WELCOME TO YOUR BLOG...!!!.YOU ARE N°

The Data Deluge Makes the Scientific Method Obsolete‏



T
he End of Theory: The Data Deluge Makes the Scientific Method Obsolete
By Chris Anderson  06.23.08  (more than 4 years ago)

Illustration: Marian Bantje

THE END OF THEORY: 
Essay: The Data Deluge Makes the Scientific Method Obsolete


"All models are wrong, but some are useful."

So proclaimed statistician George Box 
30 years ago, and he was right. 

But what choice did we have? 

Only models, 
from cosmological equations 
to theories of human behavior, 
seemed to be able to consistently, 
if imperfectly, explain 
the world around us. 

Until now. 

Today companies like Google, 
which have grown up in an era 
of massively abundant data, 
don't have to settle for wrong models. 

Indeed, they don't have to settle for models at all.

Sixty years ago, 
digital computers 
made information readable. 

Twenty years ago, 
the Internet made it reachable. 

Ten years ago, 
the first search engine crawlers 
made it a single database. 

Now Google 
and like-minded companies 
are sifting through 
the most measured age in history, 
treating this massive corpus 
as a laboratory of the human condition. 

They are the children of the Petabyte Age.

The Petabyte Age is different 
because more is different. 

Kilobytes were stored on floppy disks. 
Megabytes were stored on hard disks. 
Terabytes were stored in disk arrays. 
Petabytes are stored in the cloud. 

As we moved along that progression, 
we went from the folder analogy 
to the file cabinet analogy 
to the library analogy to 
— well, at petabytes 
we ran out of organizational analogies.

At the petabyte scale, 
information is not 
a matter of simple three- 
and four-dimensional taxonomy and order 
but of dimensionally agnostic statistics. 

It calls for an entirely different approach, 
one that requires us to lose the tether of data 
as something that can be visualized in its totality. 

It forces us to view data mathematically first 
and establish a context for it later. 

For instance, Google conquered the advertising world 
with nothing more than applied mathematics. 

It didn't pretend to know anything 
about the culture and conventions of advertising 
— it just assumed that better data, 
with better analytical tools, would win the day. 

And Google was right.

Google's founding philosophy 
is that we don't know 
why this page is better than that one: 
If the statistics of incoming links 
say it is, that's good enough. 

No semantic or causal analysis is required. 

That's why Google can translate languages 
without actually "knowing" them 
(given equal corpus data, 
Google can translate Klingon into Farsi 
as easily as it can translate French into German). 

And why it can match ads to content 
without any knowledge or assumptions 
about the ads or the content.

Speaking at the O'Reilly 
Emerging Technology Conference 
this past March, Peter Norvig, 
Google's research director, 
offered an update 
to George Box's maxim: 

"All models are wrong, 
and increasingly you can 
succeed without them."

This is a world 
where massive amounts of data 
and applied mathematics 
replace every other tool 
that might be brought to bear. 

Out with every theory 
of human behavior, 
from linguistics to sociology. 

Forget taxonomy, ontology, and psychology. 

Who knows why people do what they do? 

The point is they do it, 
and we can track and measure it 
with unprecedented fidelity. 

With enough data, 
the numbers speak for themselves.

The big target here 
isn't advertising, though. 
It's science. 

The scientific method 
is built around testable hypotheses. 

These models, 
for the most part, 
are systems visualized 
in the minds of scientists. 

The models are then tested, 
and experiments confirm or falsify 
theoretical models of how the world works. 

This is the way science 
has worked for hundreds of years.

Scientists are trained to recognize 
that correlation is not causation, 
that no conclusions should be drawn 
simply on the basis of correlation 
between X and Y 
(it could just be a coincidence). 

Instead, 
you must understand 
the underlying mechanisms 
that connect the two. 

Once you have a model, 
you can connect the data sets 
with confidence. 

Data without a model is just noise.

But faced with massive data, 
this approach to science 
— hypothesize, model, test — 
is becoming obsolete. 

Consider physics: 

Newtonian models 
were crude approximations of the truth 
(wrong at the atomic level, but still useful). 

A hundred years ago, 
statistically based quantum mechanics 
offered a better picture 
— but quantum mechanics 
is yet another model, 
and as such it, too, is flawed, 
no doubt a caricature 
of a more complex underlying reality. 

The reason physics 
has drifted into theoretical speculation 
about n-dimensional grand unified models 
over the past few decades 
(the "beautiful story" phase 
of a discipline starved of data) 
is that we don't know 
how to run the experiments 
that would falsify the hypotheses 
— the energies are too high, 
the accelerators too expensive, and so on.

Now biology is heading in the same direction. 

The models we were taught in school 
about "dominant" and "recessive" genes 
steering a strictly Mendelian process 
have turned out to be an even greater 
simplification of reality than Newton's laws. 

The discovery of gene-protein interactions 
and other aspects of epigenetics 
has challenged the view of DNA as destiny 
and even introduced evidence 
that environment can influence inheritable traits, 
something once considered a genetic impossibility.

In short, 
the more we learn about biology, 
the further we find ourselves 
from a model that can explain it.

There is now a better way. 

Petabytes allow us to say: 
"Correlation is enough." 

We can stop looking for models. 

We can analyze the data 
without hypotheses 
about what it might show. 

We can throw the numbers 
into the biggest computing clusters 
the world has ever seen 
and let statistical algorithms 
find patterns where science cannot.

The best practical example of this 
is the shotgun gene sequencing by J. Craig Venter. 

Enabled by high-speed sequencers 
and supercomputers that statistically 
analyze the data they produce, 
Venter went from sequencing 
individual organisms 
to sequencing entire ecosystems. 

In 2003, he started 
sequencing much of the ocean, 
retracing the voyage of Captain Cook. 

And in 2005 he started sequencing the air. 

In the process, he discovered 
thousands of previously 
unknown species of bacteria 
and other life-forms.

If the words "discover a new species" 
call to mind Darwin and drawings of finches, 
you may be stuck in the old way of doing science. 

Venter can tell you almost nothing 
about the species he found. 

He doesn't know what they look like, 
how they live, or much of anything else 
about their morphology. 

He doesn't even have their entire genome. 

All he has is a statistical blip 
— a unique sequence 
that, being unlike any other sequence 
in the database, must represent a new species.

This sequence may correlate 
with other sequences 
that resemble those of species 
we do know more about. 

In that case, 
Venter can make 
some guesses about the animals 
— that they convert sunlight into energy 
in a particular way, or that they 
descended from a common ancestor. 

But besides that, 
he has no better model 
of this species than Google has 
of your MySpace page. 

It's just data. 

By analyzing it with Google
-quality computing resources, though, 
Venter has advanced biology 
more than anyone else of his generation.

This kind of thinking is poised to go mainstream. 

In February, the National Science Foundation 
announced the Cluster Exploratory, 
a program that funds research 
designed to run on a large-scale 
distributed computing platform 
developed by Google and IBM 
in conjunction with six pilot universities. 

The cluster will consist 
of 1,600 processors, 
several terabytes of memory, 
and hundreds of terabytes of storage, 
along with the software, 
including IBM's Tivoli 
and open source versions 
of Google File System and MapReduce. 

Early CluE projects will include 
simulations of the brain 
and the nervous system 
and other biological research 
that lies somewhere 
between wetware and software.

Learning to use a "computer" 
of this scale may be challenging. 

But the opportunity is great: 
The new availability 
of huge amounts of data, 
along with the statistical tools 
to crunch these numbers, 
offers a whole new way 
of understanding the world. 

Correlation supersedes causation, 
and science can advance 
even without coherent models, 
unified theories, or really 
any mechanistic explanation at all.

There's no reason to cling to our old ways. 
It's time to ask: What can science learn from Google?

Chris Anderson (canderson@wired.com
is the editor in chief of Wired.

THE PETABYTE AGE: 

Sensors everywhere. 
Infinite storage. 
Clouds of processors. 
Our ability to capture, warehouse, 
and understand massive amounts of data 
is changing science, medicine, 
business, and technology. 

As our collection of facts and figures grows, 
so will the opportunity to find answers 
to fundamental questions. 

Because in the era of big data, 
more isn't just more. More is different.

No hay comentarios:

Publicar un comentario

COMENTE SIN RESTRICCIONES PERO ATÉNGASE A SUS CONSECUENCIAS