[Orechem] Amazon Exposes 1 Terabyte of Public Data to Developers -
ReadWriteWeb
Carl Lagoze
clagoze at gmail.com
Wed Feb 25 21:28:51 EST 2009
Anybody know what the "huge amounts of chemistry data" is?
ReadWriteWeb ReadWriteTalk Enterprise Jobwire
About Subscribe Contact Advertise
RSS
RWW Daily by Email
RSS
RWW Weekly Wrap-up
Home Products Trends Best of RWW Archives
Amazon Exposes 1 Terabyte of Public Data to Developers
Written by Marshall Kirkpatrick / February 25, 2009 5:26 AM / 15
Comments « Prior Post Next Post »
Amazon.com changed the retail world. In the process the company built
up so much surplus computing power that it started a dirt cheap
"computing in the cloud" business that changed the computing world.
This week the company's newest project Public Data Sets on Amazon Web
Services began offering more than 1 Terabyte (1000 GB) of fascinating
public data for developers to access on the fly through Amazon's cloud
computing service.
We're talking about an annotated collection of all publicly available
DNA sequences, including the Human Genome, huge amounts of chemistry
data, machine readable encyclopedic entries about millions of
different topics and an entire dump of Wikipedia. US Census data, data
from the US Department of Transportation and more. It's all accessible
by web applications in no time at all. What do you think this is going
to change?
The company made a blog post last night announcing the availability of
four new public data sets.
This includes data from:
The Bureau of Transportation Statistics.
DBPedia Knowledge Base - which "currently describes more than 2.6
million things including 213,000 people, 328,000 places, 57,000 music
albums, 36,000 films, and 20,000 companies." All in handy semantic
markup.
The Freebase Data Dump - the giant collaboratively build semantic
database on a wide variety of topics, data that high profile startup
Metaweb has spent millions of dollars assembling.
The entire English section of Wikipedia, dumped into a machine
readable format.
A number of large genetic and scientific databases.
We counted all the databases up and it passed 1 TB of available data.
The company says that accessing this data is "trivial" for developers.
What are developers going to do with this data? We can't wait to find
out. The prospect of mashing up, cross referencing and user
interfacing with this amount of data is nearly unfathomable. Really.
This data will be leveraged by all kinds of different web
applications, for a long time.
You've read, or can imagine, the impact that the first Public
Libraries had on human culture. Now imagine the opening up of not just
this, but other libraries of data, so huge that economies of scale
blast the project off beyond any analogy that could be drawn with our
everyday experience or historical memories. It won't just be Amazon
that offers up this kind of data - it will be relatively commonplace
soon, we imagine.
It will be like a network of libraries - for robots. Robots that go to
the library frequently, read very fast and make serious use of what
they've learned.
Congratulations, Amazon, on passing 1 TB of public data made
available. May all our robots of the future please live in peace.
« Prior Post Next Post » Posted in Amazon, Features, NYT and tagged
with amazon ec2, apis
Comment Subscribe Email This Print This Digg Share
Related Entries
Google Announces Pricing for App Engine: Allows Developers to Scale
Beyond Free Quotas
NYTimes Exposes 2.8 Million Articles in New API
New, Improved Bit.ly Plugin Adds More Functionality to Twitter
Amazon's New Management Console Makes Setting up a Server in the Cloud
Easy
0 TrackBacks
TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/10429
Comments
Subscribe to comments for this post OR Subscribe to comments for all
Read/WriteWeb posts
Are you sure its 1TB? Just Wikipedia by itself would make up a
significant chunk of that space, and since it includes dbpedia (thus
duplicating most content)... Terabytes are cheap these days. I have
two TB of data just on my home desktop PC.
Posted by: Nate | February 25, 2009 7:47 AM
You apparently have to have an EC2 account and active instance to be
able to get access to the data, so for the mean time there is a cost
attached to getting hold of the data.
Posted by: Stuart Marsh | February 25, 2009 8:13 AM
to be honest marshall - you might change that title - when i saw it i
thought "oh my amazon just got hacked and my info is public!"
not sure if others got the same idea
Posted by: Allen | February 25, 2009 8:22 AM
Amazon is a 'Cloud' Pioneer and I appreciate them for offering these
datasets to the public. They have a potential to have a huge impact.
I recommend that there be a public information release on how often
Amazon plans to update these files.
Posted by: Tecue | February 25, 2009 8:59 AM
To be honest, I thought the title was speaking of a vulnerability as
well. Had to read the article twice to make the connection.
And I'm a user of AWS as well.
Posted by: mtranda | February 25, 2009 9:22 AM
Isn't terrabyte is supposed to be spelled terabyte?
Posted by: Bob Ohsiek | February 25, 2009 9:33 AM
Bob - thank you, you are the only real friend I have!
Allen - I would have thought the words "customer data" would have
given you that impression. But I'll edit the headline.
Posted by: Marshall Kirkpatrick | February 25, 2009 9:44 AM
Same comment about the title. "Releases" or "Publishes" instead of
"Exposes" would be less confusing maybe
Posted by: Ozh | February 25, 2009 10:29 AM
It's great that Amazon is making all of this data available via EC2
but the data has always been available for developers that were so
inclined to use it.
These are public data sets already distributed by the respective
organizations - from what I can tell this is just clustered to add
value to Amazon's Web Service offerings.
Anyone have any sense on whether Amazon did work to mark these up
better, cross reference them, or added any particular value besides
exposing them from within EC2?
Posted by: Christian | February 25, 2009 10:48 AM
Please go to http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released
It will show you which AMI you need in order to access these new
datasets. Unfortunately, Amazon is a little light on the details in
terms of accessing the datasets they just published.
Posted by: Allan | February 25, 2009 11:28 AM
Title is scary, but after reading it, feel much better now.
Rex
Posted by: Rex Dixon | February 25, 2009 11:58 AM
Yes, boo to the misleading page title.
Posted by: exposer | February 25, 2009 1:30 PM
This system should provide a lot of good fodder to make some
interesting mashups. Heck you could probably build an entire self-
contained system just using Amazon products exclusively at this point:
-this service for the raw input
-EC2 and S3 for crunching and storing data
-Mechanical Turk for recognizing patterns in the output analysis
On another note, coincidentally today we released the JumpBox for
SnapLogic which is essentially an Open Source "Yahoo Pipes" system. I
recorded a demo video that helps people get started with it. You don't
have to be a developer anymore to make mashups:
http://blog.jumpbox.com/2009/02/25/introducing-snaplogic-for-data-integration/
Sean
Posted by: Sean Tierney | February 25, 2009 1:48 PM
I understood what he meant from the headline immediately... "expose"
has a bad rap, apparently. Published, released, etc would be
inaccurate- Amazon is only providing easy access to the info which is
freely available elsewhere. "Expose" is exactly the right word for that.
Posted by: Evan | February 25, 2009 2:17 PM
Totally read the same thing. I thought for sure, and had to do a
triple take, that Amazon had been hacked.
But, no, this is VERY VERY interesting.
And not at all frightening like the usual digital security issues that
we are bombarded with each and every day.
You ever read http://www.justaskgemalto.com? Anyway, it's not fun.
But yes, the cloud is very interesting.
Posted by: Janet Altman | February 25, 2009 3:31 PM
Leave a comment
Sign in to comment on this entry. (Optional)
Name
Email Address (required)
URL
Cc. this comment to FriendFeed
Remember personal info?
Comments (You may use HTML tags for style)
RWW SPONSORS
Build Your Own Wiki clearspring.com flash iPhone Application
Development iphone security Me2day mobile New Blog Traffic
personalized news qik remote access semantichacker social
media classroom social media site list Twitter Anti Spam Bot
ubiquity firefox vimeo wholesale wiki www.myspace.com
Grab this swicki from eurekster.com
RECENT JOBS
.Net Developer
Rochester, NY
Senior Java Developer
New York, NY (Tri-State)
Senior Application Developer
San Francisco, CA (telecommute)
MACH Energy
Sr/Lead Engineer
San Francisco, CA
Motally
Java Application Developer (Investment)
Cincinnati, OH
InfraStaff
VB.NET/ASP.NET Application Developer
Las Vegas, NV
Quality Assurance Engineer (791 - 38633)
Atlanta, GA
ASAP Staffing LLC
MORE JOBS >
POST A JOB >
POWERED BY JOBTHREAD
POPULAR TAGS
google facebook twitter iphone microsoft search mobile yahoo social
media music video social networking apple myspace semantic web trends
advertising rss mobile web youtube friendfeed amazon blogging
enterprise firefox data portability android politics social networks
digg lifestreaming security marketing adobe apps app enterprise 2.0
privacy email startups api web apps news obama browsers cloud
computing gmail chrome open source web 2.0
TEXT LINK ADS
Want to buy text links on ReadWriteWeb?
RWW READERS
Recent Visitors
You! Join Now.
Martina Stewart
gwthompson
John D
VoterSavvy
Koufie See all 9,828 members...
Grab This!MyBlogLog
Home | Products | Trends | Company Index | Best of RWW | Archives
ReadWriteWeb | ReadWriteTalk | Enterprise | Jobwire
About | Subscribe | Contact | Advertise
© 2003-2008 ReadWriteWeb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.openarchives.org/pipermail/orechem/attachments/20090225/b1d477c9/attachment-0001.htm
More information about the Orechem
mailing list