He with the most data wins | Big Data - BIG DATA

Breaking

Thursday, 9 May 2019

He with the most data wins | Big Data

He with the most data wins ?

When four years ago we consulted for the United States Geological Survey, their tape robot operated 2 petabyte data repository. Back then I thought we dealt with one of world’s largest data collections. I hear that CERN currently has 15 petabytes, and expect to add a few peta every year of its LHC operation.

Relational databases typically don’t operate at such size. From what I remember of a year ago, world’s largest production SQL database was owned by Yahoo!. Reportedly holding 1 Petabyte in May 2008, if the bold estimates of their VP of data were true, they would grow tenfold by today. That’s PostgreSQL. Oracle installations are usually more modest in size, in this 2008 presentation they list only three customers with Oracle data warehouses above 50 TB: AT&T, NYSE Euronext, Sprint/Nextel and Yahoo! – again – at 250 TB. The largest I know of is the CERA database of World Data Center for Climate (WDCC), handling 400 TB in a federation of Oracle 9i.

I am not going to elaborate in the Yahoo! direction and will put aside the Internet industry in this discussion, with the obvious winners in terms of storage size: Google and the Web 2.0 crowd. Curt Monash lately gave a nice summary on this: eBay’s Greenplum data warehouse has 6.5 petabytes, Facebook has 2.5. The Internet industry is a case on its own, as besides storing data it also generates data. Most data in the Internet is also about Internet, replicated and secondary in nature. Google’s indexed storage is a replica of the Internet. Your email provider stores countless repetitive information in your mailbox, generated by people hitting Reply All button.

I haven’t seen any recent attempt to estimate how much data the Internet has. However, calculating an upper boundary of the data the mankind has today is not that difficult. I think the mankind consumes in the order of 50 exabytes of new storage every year. This is my intelligent guess based by my rough knowledge of the production capacity of the storage industry today (argue with me!). Obviously not all of the hard disk drives purchased are immediately filled with data. A similar number, 50 EB is probably also equal to all the digital data that mankind has ever generated, including all copies of stolen MP3's. Never mind.

Even though it would be cool estimate how much data Internet has, I am not sure what clever conclusion one might reach. Besides an obvious one that mankind wouldn’t loose much if 99% of it disappeared.

Let’s put aside the issue of self-generating data. It is more interesting how far we can get in generating the ‘primary’ data, I mean the data that describes the world, the mankind, the business and the industry. In which domain we shall expect a data explosion? As I have already said physics experiments are expected to generate a few petabytes every year. Public repositories of satellite geological data reach similar order of magnitude. WDCC mentioned earlier holds 6 TB of climate data on tapes. Industry still sits behind these scientific examples: Wallmart, once said to run US industry’s largest data warehouse, had 2.5 petabytes at last count. This is less than CERN, but at least in the same order of magnitude. The gap is not as dramatic as it was a decade ago.

Might be that bioinformatics will soon become number one storage-intensive discipline.  In the Virolab project, we contribute to the clinical efforts to fight AIDS by handling and processing large number of viral genome sequences, sequenced from thousands HIV-infected patients.These we store in a complex system spanning multiple data banks across Europe. But even this amount of data, collected by decades-long effort of a dozen hospitals is trivial when compared to the need in other disciplines of genomics.

Sequencing a single human genome generates 750 MB of data, enough to fill a CD. A genome of a microbe might be 10 times smaller. Relatively new branch of genomics is Metagenomics, dealing with genetic material – typically belonging to various microbes - recovered from the environment. I hear Craig J. Venter, the central figure in decoding a human genome, is off to his second Global Ocean Sampling yachting expedition. His boat Sorcerer II (now probably some place in the Carribeans) takes a sample of sea water every 200 miles in its cross-Atlantic journey. Every sample – potentially containing millions of microorganisms - is subject to shotgun sequencing, so that scientists can reason about distribution of genes in the environment. The data is made available to science by the CAMERA project. If all the genes from all the samples were sequenced and stored in the repository, CAMERA would soon supercede in size any of the data warehouses I quoted above – by two or three orders of magnitude. This of course isn’t happening any time soon due to practical reasons. Cost of sequencing a single genome is still a couple thousands USD, but that goes down and may soon be available to masses.

Now that idea of extracting genetic information from water around the globe raises a disturbing question. In order to understand the universe, are we going to first put the entire universe into the digital domain? Of course not, because then the storage would become larger than the Earth. The point at which the act of digitizing information becomes environmentally visible should raise some social and political concern. Distant future? No. We have already hit that point, with largest data centers in banking and IT generating enormous amounts of heat and needing their dedicated power plants. Add to this a yet more threatening quote: maintaining an avatar in the Second Life virtual reality game, requires 1,752 kilowatt hours of electricity per year. That is almost as much used by the average Brazilian (N. Carr)

Eventually, the idea of creating a digital copy of the world is pointless for yet another reason. In the case of sequencing the genome we can’t even properly say that the data is being digitized, because the DNA – with its four-letter alphabet - already *is* a perfectly digital information.

Paraphrasing an aged Sun’s slogan, I like to think that the Universe is the computer, so we don’t need another one.

No comments:

Post a Comment