Integrating and searching Big Data and information – a chemical perspective

Despite all the reports about the emergence of Big Data in areas such as financial services and retail, scientists have been managing big data collections for a long time, especially in astronomy, high-energy physics, chemistry, biochemistry genomics. I’ve just spent three days in Vienna at the 25th International Conference for Scientific Information Professionals, who in the case of this conference are predominately chemists working with scientific publications, patents and increasingly data sets.  With around 70 million chemicals and 100 million patents searching for the next blockbuster medication is a considerable challenge as there is a need to search text, chemical structures, data sets and complex chemical names in a range of languages. Chinese is becoming increasingly important, and it is now quite routine to machine translate a patent and convert perhaps 1000 chemicals cited in the patent into a searchable database in close to real time.

There are also multiple databases to search, and ideally these need to be searched on a federated basis, including in-house and highly sensitive databases of research results. Take a look at the Reaxys Chemical Discovery Engine as one example of the innovation and computing expertise involved, and then note that Reaxys is owned by Elsevier, a publishing company not an IT vendor. Good market requirements knowledge is priceless. There are also many specialist suppliers of text mining and data mining technologies serving not only this market but also able to provide solutions for any large database where text and data are equally important. Take a look at Averbis, Intellixir,  Linguamatics and Max-Recall as just four examples of the companies exhibiting at the conference. Linguamatics is already developing enhanced search solutions for SharePoint 2013.

What struck me watching the demonstrations is that companies in this sector are working on solutions that are focused on providing actionable information through search. In this sector search is certainly not ‘dead’ as there are no other options. It could be argued that these vendors are mainly in the text and data mining technology sectors but from the end user perspective it starts with a search box and a significant amount of clever user-interface technology (drawing a molecule on screen and then searching for it) that is way ahead of the search solutions being used outside of the pharmaceutical sector.

Martin White