In several fields of science it is increasingly impossible to keep up with the pace of accumulating and overlapping research results. The bottleneck in open data banks and databases is not anymore so much the acquisition of information than its analysis.
Helsinki Institute for Information Technology HIIT, a joint research centre of Aalto University and University of Helsinki, has taken a significant step towards a search engine for scientific knowledge. Led by the Director of HIIT, Professor Samuel Kaski of Aalto University, a research group for statistical machine learning and bioinformatics has developed a search engine for biomedical research with which bioscientists can search for common aspects in masses of measured data – and do better science based on the search results.
Drawing connections between different studies is hard, laborious and still based on keyword searches. Reaching the level of measured data would be crucial, particularly in bioinformatics, tells Samuel Kaski.
A search engine for the endless complexity of living systems
Molecular biology has been one of the pioneers of open data: open bio-data banks have existed for a long time, and as measurement technologies have improved, one can extract information of thousands of molecules from just one sample. On the other hand, the issue of how to best use these vast data banks in research, remains equally open.
In sciences like biology, knowledge mainly accumulates conceptually, unlike in physics for instance, where one can take steps back in formulas and fix false assumptions. With our information retrieval methods, biologists can compare their datasets to thousands of other sets and find biological processes to which the data may be related.
Living biological systems function and change constantly, so there is bound to be a lot of noise. Before similarities between different datasets can be discerned, one must know which kind of similarities to look for amidst all the noise in the first place. The rest can be learned from the data gathered.
Assumptions about unknown factors have to be made, and their uncertainty can be controlled in an exact way by statistical modelling. This way we can define a computational task to solve, describes Professor Kaski the benefits of information and computer science to biological research.
Statistical models and computational methods can, for example, break down gene expression – the construction of genetic information into functioning parts of a living organism – in order to outline defining features in different processes. A doctoral researcher in Professor Kaski's group, José Caldas, developed in his dissertation presented at Aalto University in April 2012 methods for the retrieval of genomic information. With the retrieval methods one can compare raw measurement data to thousands of similar kinds of data sets. Caldas' research has already contributed to the discovery of exceptional gene expressions in mesothelioma, a rare form of cancer linked to asbestos exposure.
Spotting relevant information is not as easy as searching websites with Internet search engines, though. Recognising useful and meaningful datasets is a far more challenging computational feat than a mere keyword search, reminds Kaski.
"Innovations are born best in the heat of research"
Several evaluations have ranked HIIT's best research at top international quality. Samuel Kaski considers data analysis and computational modelling fortes of the institute. Kaski affirms that groundbreaking results mostly occur as part of practical work, which has clear goals and room for theoretical basic research.
As long as we have fine and devoted researchers, we get quality results. It is a matter of strategic thinking whether we achieve a level of 'mere' excellence or outstanding results, believes Kaski.
Kaski also insists that the boundaries of concrete projects and basic research must be fluidly crossed. Bioinformatics is a useful test bed for applying data analysis and statistical modelling precisely because fresh results from model development can be instantly put to test.
In our field basic research often comes down to creating novel methods, which are most naturally developed while working on applications. At the same time, the tool kit of computational science expands as we retreat to scrutinise the implications of our results between projects.
The gap between basic research and the much sought after innovations can be almost non-existent, as innovations are born in the heat of research.