We probably won’t be able to treat the next deadly disease

Here’s what we need to do to change that

Photo from SMANews

Humans are fragile. That’s not to say that we’re meat sacks waiting to decay and die, but we’re only able to survive the pathogenic conditions of 21st century society thanks to modern day medicine.

We’ve come a long way from using charms and bloodletting to help cure countless diseases, but what if I told you that we are still clueless in the origin of diseases, let alone how to cure many of the complex disorders we face.

Scientists only completed Human Genome Project in 2003, and we haven’t even scratched the surface of the human microbiome and epigenome.

There’s still a long road ahead until we truly understand the underlying mechanisms of very complex diseases from Alzheimer’s to Cancer, let alone are able to treat them with a high degree of effectiveness

Take tuberculosis: the hidden lung disease that plagues a third of the world while being resistant to most antibiotic treatments.

Now that’s just TB, but imagine the tens of thousands of incurable diseases that millions of people across the world have to deal with.

Overtime, we’ve come to recognize the biological molecules that essentially make up our cell’s processes, and we’ve studied these diseases and effects on cells through the omics. You’ve probably heard of some of these: genomics, proteomics, transcriptomics, metabolomics, etc.

Genomics has risen in popularity more recently, but the systems we use to detect, understand, and predict variations in our genetic code showcase our lack of understanding of the greater human body and how these different complex systems work together.

By better understanding the complex interactions of our body’s internal systems, we’re able to better observe the effects of diseases on our bodies and formulate potential treatments to offset its effects. But this also has its limits

Thankfully, we have biomarkers.

Biomarkers are biological features that provides information on a disease or a response to a treatment by looking at the state of the cell.

It’s really just anything that you can observe and measure that connects back to a disease. What’s important to note regarding why it’s so hard to find biomarkers if because our body consists of so many complex systems and is subjugated to countless factors that influence the millions of processes occurring at once. This makes it really hard to figure out whether what you’re observing is actually caused by a disease

This shouldn’t be confused with a feature, which is the measure of a specific biological component and its state (i.e. protein, gene, metabolite). Only features that are indicative of a disease are biomarkers.

Biomarkers aren’t necessarily new. In fact, clinicians use biomarkers on a daily basis to diagnose patients and look for signs that might raise any flags.

Photo from E Tothill. Techniques used for biomarker identification and detection. Abbreviations: FISH, fluorescence in situ hybridization; IHC, immunohistochemistry.

Despite these, the biomarkers we have as of now are fairly limited in utility, and there is still much needed work in the discovery of novel biomarkers that can help us find new insights across the different domains of health.

With biomarkers, many researchers feel that a simple blood test can allow a doctor to predict a patient’s likelihood of genetic disorders. The issue is that most biomarker’s that we’ve found aren’t all that useful.

Okay great, so how do we find these new biomarkers?

Informatics and data science. The field of machine learning and the advancements in data preprocessing has the opportunity to identify some of the most discrete yet necessary biomarkers for drug discovery and targeted therapy.

Photo from Dana Bazazeh, Raed M. Shubair, W. Q. Malik (2016) Paper: Biomarker discovery and validation for Parkinson’s Disease: A machine learning approach

If we could create a better informatics pipeline that can better learn about the human body through an omics approach, we have the ability to solve the mysteries of the body and even learn more about them.

Currently, less than 1% of published biomarkers are actually entering clinical practice, and failures are mostly noted around the discovery phase which we’ll get into a little later.

What’s important to recognize is that we’re already using data science in the field of biomarker discovery, but the issue really comes down to the overlap of these algorithms across disciplines.

This is why the biomarkers that are commonly used in clinics are built upon the idea of hypothesis-based discovery.

For example, we know that diabetes leads to an increase of blood glucose levels. Based on this fact, scientists were able to identify glycosylated hemoglobin or simply evaluate the concentration of glucose in a patient’s blood through a HbA1c test.

The more promising, and challenging direction, is observing the changes in presence of a molecule with respect to a disease (evaluating features by changing them).

For breast and ovarian cancer, a similar procedure allowed for the discovery of a cancer-associated gene BRCA1 which was deleted due to breast cancer.

This is great and all, but a common issue with finding new insights in diseases is actually trying to identify which biomarkers are important and how different biomarkers interact with each other across different domains in biology, hence the multiomics approach.

The diseases that will see the most value out of biomarker discovery will be complex, where complex diseases can be defined as a disease whose cause is polygenic or the disease falls under a group of polygenic conditions (most commonly gene sets).

This is probably the most important part of the biomarker predicament so pay attention.

Remember how I talked about how scientists still need to identify the connection between multiple biomarkers to better understand complex diseases?

The reason that is important is because of the very nature of diseases.

Most current processes for biomarker discovery actually work very well but only for certain diseases whose causes are very specific and fundamentally invariable.

Take the Ova1 In-Vitro Diagnostic Multivariate Index Assay (a stored collection of different biomarkers for evaluating the collective’s effectiveness in clinical practice). The Ova1 Index was created by comparing plasma proteins between women with ovarian cancer to those that had benign tumours. Through a simple neural network, the model was able to derive a panel of 5 biomarkers that were far better than the ones in-use in clinics.

The model was only able to do this because cancer requires high-levels of specificity to distinguish between malignant and benign tumour masses. This bar of specificity and invariability is not the same for the millions of other complex diseases out there.

What about the discrete, heterogenous and polygenic diseases where potential biomarkers might be apparent or all too common rather than mutations of sorts?

Current biomedical processes require a balance between specificity and sensitivity to identify markers of interests and actually identify useful indicators of a given disease.

Photo from Wikipedia: Measures of performance for medical testing. Simple diagram to demonstrate sensitivity and specificity of models.

This is where the promise of multiomics comes into play, but limitations come from the inability to transfer findings and knowledge across the different disciplines.

Why what we’re doing right now isn’t working.

The current approach for finding biomarkers is sifting through large quantitative datasets and using algorithms like SVMs, k-nearest, and classification trees for ovarian cancer.

An ML-centered approach has been able to find leptin, prolactin, osteopontin, and IGF-II as new potential biomarkers with achieving a 95.3% sensitivity and 99.4% specificity.

The hardest part however is not the actual architecture of these algorithms, but rather how these models actually learn and leverage these findings across the omics.

Photo from McDermot, Wang, Mitchel; Paper: Table of different ML approaches for biomarker identification primarily used for breast cancer discovery.

Right now, the primary challenge is the number of variables and quantity of data required to actually identify important biomarkers. It’s very difficult to reduce the complexity of data in molecular biology where the nuances in the data that are essential for novel insights can get easily be disregarded.

SVMs are really popular right now for high-density datasets that involve overlap between genetic data and mRNA transcripts. The approach was especially useful for ranking proteomic data and finding “candidate” biomarkers for ovarian cancers.

Random forest methods are especially useful for assessing biomarkers for low-sensitivity requirements such as Alzheimer’s disease.

Visualization methods through PCA are also proving to be especially valuable for data in genetics and identifying cell downregulation.

Beyond data-driven approaches, there’s Protein-Protein interaction studies which look at mapping different proteins and their connections across a greater database of structures and functions.

There’s also Pathway Analysis which looks at the different expressions of the same gene and tries to associate different biomarkers to these expressions, but these methods are often prone to noise.

For all of the methods I’ve listed above, the primary issues really come down to figuring out what’s important and what’s not in the context of datasets consisting of millions of different data points.

Even still, the current data is dirty and have a biased sensitivity and specificity based on differing sampling methods.

Furthermore, each person’s physiology differs dramatically and the current algorithms that we have access to are prone to overfitting, making it extremely difficult to generalize across a large subset and a field as diverse as the omics.

The ideal solution to all of this is a meta-algorithms that work across countless datasets, but that’s still a long-way ahead.

If however, we manage to find these biomarkers through advancements in machine learning and visualization methods with the right mix between sensitivity and specificity, we could get the next breakthrough in non-invasive medical diagnostic and treatment technology.

It’s a little disheartening, but we know where to we need to go.

Currently, there are many problems with biomarker discovery as we’re limited by our data analysis capabilities and access to clean data.

Even still, we still know how to tackle the problem of biomarker discovery and the field needs far more attention if we are to make a dent in treating complex diseases.

In a future with better biomarker discovery algorithms and pipelines, personalized medicine can become mainstream, the diseases that plague the lives of millions — from neurodegenerative to genetic mutations can be treated with a high degree of effectiveness, and we could get closer to a future where we can actually cure diseases. For good.

Thanks for taking the time to read this and I hope you got something out of it. If you want to get more technical or simply reach out to me, you can find me on LinkedIn, Email, or GitHub. You can also subscribe to my newsletter here.

ML X SynthBio | Looking to learn, grow and build the ideas of the future into reality. | 15y/o