Text mining hundreds and thousands of articles at once to uncover hidden research connections is an increasingly powerful tool in drug discovery, according to Jon Hill, principal scientist in computational biology at Boehringer Ingelheim.
Straightforward searching can no longer digest the volume of information. Medline, for example, currently holds something like 22 million abstracts. “Search has improved but it’s not enough. There’s just a mountain of material out there. Really, we couldn’t keep pace with the amount of scientific data that’s coming out,” said Hill.
Text and data mining turbocharges traditional search functions. “It does not only the retrieval work but the summarisation work too – that can make things downstream a lot faster,” Hill said.
Computer crawling is used mostly for large scale speed-reading, but pharma companies are finding additional uses for it.
For example, to analyse what the competition is doing. “You could go through the affiliation fields of articles and see which companies are active in different fields. Then you can check the pipelines of these companies,” Hill explained in a webinar he hosted last week for the digital rights services company, Copyright Clearance Center Inc.
“Conference abstracts are good to go through too – they might have some really cutting edge stuff that hasn’t made it into the literature yet. You can also comb grant and patent information,” he said.
Text mining can also avoid wasted hours at conferences. “If you go to a meeting held by the American Heart Association or somewhere like the American College of Rheumatology, there’s so many conference tracks running at the same time. Text mining can sort through the hundreds of conference poster abstracts to find the most interesting session for you,” said Hill.
Academics face barriers to text mining
Pharmaceutical companies obviously have the financial wherewithal to pay for rights to mine the literature in this way. But while the power of text mining is evident, publically-funded researchers in Europe face legal barriers to its use. As things stand, publishers have the right to grant or refuse the mining of academic journals on the basis of copyright law, EU database protection law, and provisions in intellectual property law. The UK is the only country in Europe to have exempted automated computer mining from copyright law.
Publishers block data mining software programmes by default, but may give special licence permissions to academics and university libraries.
This clash, which sits within a wider debate about open access, has resulted in the call for new rules. Last December, the European Commission proposed a bill which would clear the way for researchers to perform text and data mining, as part of a broad update of European copyright rules.
Speed reading, slowed down
A scientist reading about an interesting new therapeutic target for breast cancer in an academic paper might check the references for more information and conclude a broader investigation is warranted.
The scientist will summon an expert in text mining and explain the context of the search and its objectives. A text miner is someone with good search skills, Hill said, adding, “A little bit of programming skill is beneficial too, but not necessary.”
A text miner will begin by hand selecting a list of aliases for the target. “Building the corpus takes time,” noted Hill. It’s an iterative approach, and a bit finicky. Using too few search terms at the beginning, risks missing something. “You begin wide and you get narrower as you continue,” said Hill.
Searches are easier when scientists use common nomenclature in their publications. “There’s only so many genes out there, so people have been getting better at using standard names. New articles are a little bit more consistent,” Hill noted.
Mining workflow; Source Boehringer Ingelheim Pharmaceuticals
With the parameters fixed, the text mining can begin. “You can draw from a lot of different resources. By far the most popular source is the medical literature,” said Hill.
Scientists will usually be interested in comparing abstracts. “But sometimes, full text will be irreplaceable, if you’re interested in searching for more information on materials or methods. These don’t always make it into top line results,” Hill said.
However it remains expensive and extremely time consuming to retrieve a lot of full-text articles; 60,000 – 70,000 is the higher end, said Hill. “But that’s a real pain to crunch. It’s safer to keep it in the hundreds.”
After the mining algorithm scours the Internet, search results appear in a formatted table with the original query on the left and hits – or extracted relationships – on the right. There is also a column citing where the hits were found.
Then it is up to the scientist to join the dots. The table tells the researcher, “Here are all the PubMed IDs with my diseases; here are the IDs with my genes,” said Hill. It is then possible to interrogate the data to find out if, “certain diseases associated with certain genes are talked about in PubMed.”
What the mining software spits out; Source Boehringer Ingelheim Pharmaceuticals
Scientists can make computers do wonderful things, but it is important to understand their limits: a lot of manual effort accompanies automated text mining. “Humans are safe for a few more years,” said Hill.