Data Analytics & Drug Discovery – Where the Field is Going

Sep 06, 2018

Guest Blog by Loralyn Mears, RowAnalytics


Most of us tend to regard bioinformatics as a relatively new field, presumably nascent at the start of the current century given the rather recent explosion of college level programs and efforts in the industry. However, few know that the field originated in the early 1960’s under the pioneering efforts of the esteemed Professor Margaret Oakley Dayhoff (LINK). As a tribute to her vision, her role in establishing the first of its kind computer database for protein sequences and her efforts in developing mathematical applications coupled with computational techniques to sequence proteins and nucleic acids, she is highly regarded as “the mother and father of bioinformatics”.

However, it wasn’t until 1970 that the field came to be known as “bioinformatics”. The term was coined by scientists Paulien Hogeweg and Ben Hesper. The pair developed the term to describe “the study of informatic processes in biotic systems”(LINK).

Current Limitations & Applications

It’s a numbers game. Specifically, a capacity problem. In 2015, more than 40 quadrillion DNA base pairs were sequenced – an amount that approaches a stack of DVDs more than 5 miles high. Storage is now discussed in terms of zettabytes – which is billions of terabytes: amounts that are almost unfathomable. The recreational genomics and geneology company, 23andme, has sequenced the DNA of more than 24 million customers on its own. Others in the field, such as, boast similar numbers. The “cloud” where everyone stores their DNA data must be pretty darn big…

Technology has advanced the field so that sequencing is no longer the problem. Whole genomes can be done as cheaply as $600 whereas the first human genome sequenced cost over $1 billion. Moreover, sequencing the first human genome took 13 years to complete vs. the 24h turnaround that is possible today (LINK). So what’s the problem?

Analytics is the challenge. The data sets are so large and there are so many of them, that the analytics can’t keep up with the data generated. Professor Dayhoff would have no doubt delighted in trying to solve the data deluge problem that beleaguers us all today. Pharma is currently investing its efforts into centralizing its data. Collating it and aggregating it into one place, intending to streamline access and hence, enable comparative analysis (LINK). Data analytics, however, as a field, has lagged behind our collective ability to produce data. The limitations are imposed by the Eroom’s Law (drug development is slowing down despite tremendous investments and advances in drug discovery technology) and the curse of dimensionality.

Novartis was among the first big pharma (at least insofar as we know, publicly) to adopt a data lake strategy (LINK) and is taking a lead in the collective effort behind Industry 4.0. Dr. Luca Finelli heads up the Predictive Analytics & Design group within Global Drug Development. His team developed “Nerve Live”, which is a platform that leverages the latest advances in technology, including machine learning, and has been designed to generate, analyze and apply large amounts of data. To keep all that data secure, Novartis created its own cloud, onsite. Finelli’s team has also developed new algorithms and analytics methods to harness actionable insights from it. “It was not an easy journey,” says Dr. Finelli. “Our team had to rethink how we integrate, analyze and use our data.”

Similarly, GSK conducted an internal audit-style inventory last year to identify all of its data silos (LINK). Over 5 petabytes of data was centralized within its own firewall, on site. They deployed technologies including StreamSets and incorporated bots to enable automated data ingestion. GSK also integrated AtScale technology for virtualization across environments, Zoomdata for visualization, and a combination of Trifacta-Tamr for data wrangling and machine learning curation. This pharma leader also standardized all of its clinical data, retaining that as its own silo for a variety of reasons including patient privacy and the unique aspects of the data contained (LINK). Now that all of their data is in one place, clinical trial data collection has been standardized and the technology set up to ingest, digest and view it all, GSK expects to reduce drug discovery to a two year process. However, how they intend to solve the curse of dimensionality remains unknown.

State-of-the-art is currently at two, and approaching three variables analyzed in combination and in parallel. Combinatorics, also referred to as high dimensional analysis, is a term that describes the field of analyzing variables together. With each additional variable, the number of possible combinations grows exponentially and the effort required to conduct the analysis doubles. This is recognized as the combinatorial explosion problem. Where this is strikingly obvious as a limitation to drug discovery is with respect to the biological complexity of systems – and people, specifically.

Diseases are enormously complicated phenomena. The majority are polygenic and progress in response to other biological perturbations ranging from host genetics to epigenetics, commonly susceptible to the effects of lifestyle. And, it almost goes without saying, that diseases do not work in isolation – at any given time, there are multiple co-morbidities and immune responses occurring concurrently within a person. Plus, there is the obvious uniqueness of the genetics of each person, further exacerbating the complexity of the analysis. Combinatorial analytics may indeed be the current limiting factor in our collective ability to advance drug discovery to keep up with the demands of an aging population and the poorly understood effects of co-morbidity.


Let’s park the combinatorial analytics problem for now. Other challenges, which include the globalization of research and real time analysis, are being addressed. For example, smart boards are enabling visualization, data sharing, and ideation in real time (LINK), which is facilitating enhanced internal and external collaboration. This is particularly important as one of pharma’s biggest challenges is making go/no-go decisions with respect to its thousands of concurrent drug discovery projects. Adopting an IT-enabled platform allows for data-driven decision making. Devices, sensors and even smartphones are being utilized creatively to generate and analyze biometric data in real time, making real world evidence / data available to scientists in new ways.

As efforts continue to advance the field of liquid biopsies and instrumentation that enables onsite (even in the field) collection and analysis of samples, clinical trials will become increasingly efficient. In parallel, patient enrollment (and retention) should rise with improved efficiencies. Extended networks to providers, payors and across academic-industry boundaries are further enabling drug discovery through a combination of data sharing and varied analytical approaches.

However, the exam question that remains to be answered is, if the curse of dimensionality can be solved, what impact will high dimensional analysis / combinatorial analytics have on drug discovery?

See all MassBio News