A Holistic Approach to Drug Discovery, Powered by Insitro

11 min readOct 13, 2021

This summer, I had the chance to work on the clinical trials space through BenchSci. The company was working on pre-clinical target validation and helping companies design experiments by sourcing and picking the collect biochemical reagents using ML to interpret millions of scientific papers and trials to build these large and accurate recommender systems.

What I learned in 2 weeks about the drug discovery space shook me, as it did for the several hundreds of companies working in the space right now.

2.5 billion dollars per drug and a success rate of 5%. Clinical trials alone take 6–7 years to complete, where most failures come later on in Phase II-III trials. The average FDA approval rate of new drugs has dropped to 13.8%, but for some of the more challenging therapeutic areas like oncology, these rates are at a mere 3.4%.

For the medicines of the future and the promise of most of biotech, it seems that we’ve hit a bottleneck where all the low-hanging fruit has already been picked.

We’ve reached this biological barrier that prevents us from getting successful drugs for some of the most damaging and destructive diseases humanity continues to face.

Or have we? What are some of the more fundamental reasons as to why discovering new targets and designing effective therapies have become more difficult? Is it our understanding of patient clusters? Rather, is it the lack of data we have on the effects of these drugs? Could it even be the very nature of the targets we’re looking to suppress?

Insitro founding team members, with CTO Aj Kaykas

Insitro begs to differ. In fact, the company taking a very direct approach to drug discovery by using ML for the most important parts of the value chain. It stands out because it’s providing a wholesale approach to drug discovery.

A bit of background on the company.

Founder and CEO, Daphne Koller, is an accredited ML researcher at Stanford who has made significant contributions in the field of artificial intelligence, most notably for her work in probabilistic models with applications in several fields, specifically computational biology. She was briefly the Chief Computing Officer at Calico, a Google-backed company looking at human longevity to enable an increase in healthspan until, in 2018, she left the company to start Insitro.

Insitro was a product of the possible capabilities of machine learning at tackling some of the world’s biggest questions. Questions that warranted tasks, and tasks that warranted performance that traditional systems failed to meet. With her experience in ML and computational biology, she recognized that a lot of the fundamental problems in medicine and healthcare were being neglected due to the limitations in biological data, specifically the sparsity of it.

When we look at the bottleneck we’ve hit in drug discovery, it becomes more clear that data-centric drug discovery becomes a more integral part of the process. Yet even still, interpreting this data down to the subunit of the gene is critical in interpreting the processes at play for any given disease to design said therapeutics.

By leveraging today’s advancements in collecting, analyzing, and interpreting biological data, Insitro aims to unleash the “full potential of modern computational approaches” to solve some of biology’s most notorious problems by addressing the R&D barrier biotech is facing.

How does the company actually do this?

Insitro uses biology and ML at scale to predict which molecules and targets work for which patient clusters to speed up the development lifecycle in clinical trials and especially reduce the overlay time for going back to earlier parts of the clinical trial pipeline.

It does this by deriving insights from its own repertoire of detailed genetic, phenotypic, and clinical data using novel model architectures in machine learning. This ultimately helps these predictive models take into account the underlying architecture and biology of the disease in order to make these predictions.

This level of “interpretable, transferrable learning” in biology is critical because it explains whether the model is actually learning about the specific biological markers that directly contribute to a disease state.

While traditional ML models require manually defined features where a person would oversee the modelling and architecture, Insitro’s approach to identifying biological trends in complex data is trained to discover these features and build this hierarchy of features.

Ultimately, by generating their own data and analyzing it, they can help clients navigate a biological manifold to effectively traverse and find new molecules and thus improve the drug discovery process.

They work across the entire pipeline in different ways.

For target selection and validation, ML stat generation across human cohorts and iPSC disease models are used to identify new targets using these cellular disease models from human patients.

When it comes to designing the molecule, the company uses multi-parameter optimization for the compound by actively learning to create and design these molecules from predisposed findings.

And to end off the value chain, clinical strategy is informed by novel, ML-enabled patient segments to identify new clinical biomarkers for different patient states and clusters.

Insitro primarily works on the first 2 parts of the value chain and then creates partnerships with companies like Bristol Myers Squibb to help them with clinical trial design.

Insitro starts the process of target discovery by first identifying a disease and getting data on disease progression. Often, they collect high content data from human cohorts by collecting their genetically diverse induced pluripotent stem cells.

This allows the company to represent the disease with both human (macro) and cell (micro) models. Genetic and clinical markers are then identified as disease drivers in order to create this manifold representation for patient cluster classification.

Identifying commonalities across patient clusters and their genetic markers provides deeper insight into progressors and regressors of a disease state, and is amplified through the use of ML.

Here, ML helps create these dense biomarker representations which reveal key underlying causalities and new levels of heterogeneity in patient populations which are much more granular and difficult to identify with just a human.

The most important part of the pipeline is figuring out which genetic markers, which are of interest, lead to which kinds of effects in the body. This is known as phenotype classification where the company uses phenotypes to connect causality with over 17 million genetic variants.

Again, ML is central to the company’s approach because it helps reduce the dimensions of the data that is being used to make these predictions. For example, while there are a large number of gene-phenotype associations for several variants, the large amount of possible genetic targets makes it much more expensive and time-consuming to evaluate many variants to get to the best variant. Even still, most of these variants have little to no effect on the disease state but rather have a complex relationship with several other variants.

Modelling these relationships and condensing the feature representation of a disease based on its genetic and phenotypic characteristics is accomplished through Insitro’s effective ML models [often GNNs or Random Forests] which act as an intermediary between interpreting the genetic effect of these variants on the biological processes at play and what phenotypes have a higher causality with these same genes.

A great example that highlights Insitro’s ML-powered drug discovery potential is the company’s major progress in NASH.

Non-alcoholic Steatohepatitis (or NASH) is a form of non-alcoholic fatty liver disease where your liver starts to fail due to high levels of fat content. This kills off several of your hepatocytes which leads to inflammation and fibrosis. More than 24% of the world is affected by NASH and the outcomes are often either carcinoma or liver failure. It’s also one of the most common reasons for liver transplantations globally.

One of the biggest issues with approving a drug for NASH was understanding its genetic drivers for the disease’s progression. Insitro aimed to tackle this problem by collecting biopsy data (only available data was 500 patient samples from a clinical trial conducted across just a year) along with other phenotypic markers such as the patient’s transcriptional profile.

Using multiple instance learning with a convolutional neural network, the model was trained on an H&E stained whole slide image of the biopsy while pathology scores were used to validate the model.

Using its kernel size and attention-based pooling mechanism, the model would tile the image, extract any important features, and then generate biopsy-level features to find where the model found potential fibrosis patterns.

This approach had astonishingly high results, despite the small training set, with scores that directly correlated with pathologist consensus scores. However, what was more impressive was that the model was scoring its samples with a deep understanding of the biological drivers at play.

When plotted against 60 blood biomarkers and gene expression data, the model’s predictions had a very high correlation with the disease-relevant biomarkers. Essentially, the prediction scores it generated from the biopsy samples by observing the fibrosis patterns directly correlated with the genetic and phenotypic biomarker data that was not included in the training set. This indicated that the model was able to predict the fibrosis by looking at the underlying biological causation of the disease, without any other knowledge of the disease-relevant biomarkers.

This task was unimaginable for most Ml approaches, and greatly exceeded the pathologist’s ability to interpret the biopsy samples.

The reason for this was because of the model’s ability to observe subtle changed across the different biopsy samples in a 1-year timeline. It was often believed that fibrosis progression was minimal (measured from an F1–F4 scale) across a small time frame, like 1 year, but the model was able to observe subtle changes in progression of the fibrosis at a greater resolution, which allowed the model to generate a better method of predicting the progression/regression of a patient’s fibrosis (i.e. while pathologists classified each sample as F3 over a year’s time for Patient A, the model uncovered a subtle regression from F3 to F2 later on in the year).

This helped Insitro find new levels of genetic drivers at the genome level which were validated by the patient clusters with the corresponding disease state. Specifically, this granularity and ability for ML to interpret this granularity and turn it into biological insights have profound implications on the company’s capacity to discover new targets by training these models which are really learning about the biology behind a disease.

For instances where there isn’t enough data [i.e. a lot of Insitro’s work in neuroscience], the company devises cell-based models where they collect cell samples from multiple patients at different stages of the disease.

These clusters and stages are differentiated based on cell lineage given the disease and are then measured at the cellular level to collect more data on the disease at the phenotypic level. Cell models are especially useful because ML models can collect data on not just the biological drivers of a disease, but also get insights on a drug’s ability to perturb cells and additionally screen interventions to test therapeutics and generate this feedback loop.

For NASH, Insitro was able to identify several patterns between the nucleus, membrane, and lipids of a patient’s hepatocytes concerning the presence of NASH and the same ML models could identify new correlations between the effects of specific genetic interventions (i.e. CRISPR-screening) and their effects. These insights on this kind of platform can be used to inform therapeutic design and dramatically improve the odds of success for a successful new molecule.

To design these new molecules, Insitro uses DNA encoded libraries to create a massive dataset of possible molecules that bind to a specific protein that was previously identified as a target in the pipeline.

These libraries act as the input to ML models for predicting which molecules are good binders to the protein, and this allows the molecule-target relationship to be represented in a smaller parameter space. By selecting and switching between these different relationships, the company can quickly iterate around the molecule target discovery loop.

What has it achieved, and where’s it going next?

By combining a lot of the intersectional technologies that are evolving at an exponential rate, from our ability to edit genes to cultivating stem cells, and even building extremely large, interpretable machine learning models, Insitro is aiming to implement the next generation of iterative drug discovery.

In 2020, the company raised $143 million for its Series B, and through the oversubscribed interest in the company, it has continued to scale up its ML-driven disease models for several new areas, especially CNS disease. This year, it raised another $400 million to carry forward its partnerships with companies like Gilead and Bristol Myers Squibb to expand the company’s capabilities at moving towards targets for medicines.

With its expansion and long-term contracts, it is clear that Insitro is continuing to attract major interest from big pharma, and the possibility for the company to de-risk therapeutic areas like oncology, neurogenetics, ophthalmology, immunodeficiency, and more.

It’s pretty nuts that in a field where ML continues to be based solely on the data available, companies like Insitro are taking a very hands-on approach to generating high quality data and going deeper into the biological processes and relationships to effectively make ML in bio much more valuable.

Hopefully, I might be able to ask the company some more questions about that final step at SynBioBeta :)

Thanks for taking the time to read this and I hope you got something out of it. If you want to get more technical or simply reach out to me, you can find me on LinkedIn, Email, or GitHub. Also Website [in works because nothing works :)]. You can also subscribe to my newsletter here.

What I’m working on right now.

Currently working on using ML to predict optimal cap fold structure to optimize effectiveness of mRNA therapies.

A Holistic Approach to Drug Discovery, Powered by Insitro

A bit of background on the company.

How does the company actually do this?

What has it achieved, and where’s it going next?

Written by Dev Patel