Researchers are struggling to master the art of finding useful science buried in the huge datasets created by modern research facilities, argues Tony Hey, chief data scientist at the Science and Technology Facilities Council, a UK government body that carries out and funds research.
All the sensors, detectors, imaging software and networking systems used by the big science labs in Europe, such as CERN, the European Molecular Biology Laboratory, the European Space Agency and the UK’s Diamond synchrotron facility, spit out datasets on the terabyte or petabyte scale.
“We’re looking at an exponential growth in science data over the last ten years,” Hey told Science|Business. “Data-intensive science is creating more careers for scientists – our institutions need to catch up on these opportunities.”
At the Diamond lab, for example, the data rate in 2013 was 600 megabytes per second. “Today it has grown to six gigabytes per second,” said Hey.
Rapid increase in rate of data generated at UK Diamond lab. Source: Tony Hey
This sort of heavy data-wrangling used to be the domain of astronomers and high-energy physicists.
But today, scientists working in virtually every disciple go to big science labs and come away with what Hey calls “a dirty big disc of data” – a USB pen is no longer any use for the necessary data moving work. From there, scientists are likely to encounter significant challenges with processing and analysis.
New dedicated lab
To prepare researchers for the incredibly difficult data mining effort, Hey, a former senior data science fellow at the eScience Institute in the University of Washington, and former vice president of Microsoft Research, proposes to set up a new, dedicated data-science institute in the UK.
If he can convince lawmakers to give him the funding, Hey would create the Ada Lovelace Centre, named after the famous mathematician who co-created and programmed the world’s first general purpose computer.
The new institute would breed a new species of data wizard by – for instance – imparting practical skills like which machine-learning algorithm will help them tunnel through the mountains of data produced by a big science telescope or particle accelerator, and come out with useful information.
Researchers at Hey’s new centre would learn how to access data, organise and reorganise it, visualise it, and, eventually, do sophisticated analytics on it. Researchers would also get instruction on how they can make their data available to the scientific community in a useful form.
The centre would groom students for scientific programming jobs in academia or – for those who wanted to follow Hey’s path – industry.
“Having spent time counting the number of quarks in a proton during my postdoc years at CERN, I know industry can be an attractive destination,” said Hey.
“At the end of the day, the job of a scientist is to pull science out of all this data. They need more help to get on the right track and the new centre could provide it,” he added.