AI into the Biological Unknown: Rare Genetic Diseases are about to become less Rare

15 min readOct 12, 2020

We all wish to be different or unique in our lives, hoping to be seen for our strengths from the rest. However, some of us are born not with rare gifts, but unfortunately with rare diseases. More than 350 million people across the world are living with 1 of the 7000 rare genetic diseases (RD) discovered, 75% of which are children. Despite these staggering numbers, rare genetic disorders are severely underemphasized in the medical field as each disorder must follow an extensive and expensive detecting and treatment process that many patients across the world do not have access to.

Out of these diseases, only 5% come with a cure and the majority are unable to be effectively diagnosed to offset their deadly symptoms. Drug discovery, diagnosis, and all of the other parts of the pipeline are plagued by countless issues that make it harder to make real progress in recovery; low allele frequencies, inaccurate diagnostic rates, under-catalogued data, geographic barriers, and expensive procedures not only prevent care, but also make it far more challenging to take extensive tests. Why may this be? Well, this is a genetics problem, and genetics is a field of discrepancy, irregularity, and immeasurability. Scientists have to look at countless features while trying to isolate a small sequence from millions of others in our DNA.

“One of the most important benefits of AI is to create trends and build general relationships between large feature sets to predict, produce, and implement novel solutions.”

These same issues that plague rare genetic disorder care can all be combatted through the countless supervised and unsupervised learning algorithms to leverage the vastness of genetics. Its inclusion can offer countless advantages that combat the pitfalls of current rare disease progress:

Interpret complex patterns between the variety of features in genetic disorders
Order, forecast, and predict the onset of these diseases with a relatively high accuracy
Work with more datapoints and account for sensitivity and errors in current processes
Advance research and progress in terms of targeted therapeutics and novel discoveries for specific genetic alterations

But before we get into this, we must understand the origin of genetic diseases and why they’re such a difficult problem to tackle.

DNA and the Origin of Rare Genetic Disorders

Structure of DNA -> Nucleotides and Unravelled Helix

Our genes provide the necessary functionality in our cell which is determined by the DNA strands it is made up of and the unique nucleotide orders that make up DNA. Genetic disorders can be caused by variations or mutations in these genetic sequences, whether that be monogenic or multigenic. DNA is then continuously replicated, transcribed, and translated into proteins in order for new cells to be created. Many of us have genetic mutations where often it is autosomal, meaning a copy of our DNA has the correct genetic expression. However, severe complications occur when these mutations are autosomal dominant.

For rare genetic disorders, it becomes more difficult to account for the interaction of different genes and their respective mutations. Additionally, identifying the genotypes and structural sequences where these mutations occur is another part of the problem. However, multi-omics data approaches and next generation sequencing are able to be paired with AI in order to correctly classify, predict, and diagnose a wide variety of rare genetic disorders using a diverse array of AI models. This approach can be used for the following:

Diagnosis and prognosis
Disease classification and characterization
Therapeutics approaches
Patient registries and DDSS integration

With this, taking a look at the current applications of AI and the wide variety of rare genetic disorders can shed some light on utilizing computation and statistics on a problem as sensitive as life or death.

Variant Calling and Classification Genetic Disorders

Non synonymous single nucleotide variant in a DNA strand

The identification of disease-causing genetic variations is critical for diagnosis and disease prediction. Advances in AI are making this process affordable and indispensable, yet its most promising advantage is the ability to discover new variants far more precisely than any other diagnosis method out there.

Calling: Diagnosis of Non-Synonymous Single Nucleotide Variants

One of the most common sources for rare genetic diseases is the presence of non-synonymous single nucleotide variants (SNVs). Non-synonymous means that the genetic sequence mutations changes the gene’s expression which can alter the function of the protein. Often, there is an insertion or deletion of a single nucleotide in the sequence during transcription. To identify certain diseases, identifying genetic variants amongst millions of sequences in each genome requires precise accuracy. Many of the current standard variant-calling tools are prone to systematic errors that are associated with the subtleties of sample preparation, amplification, and next generation sequencing. To improve the accuracy, strand bias (which allows one to infer the genetic information from the forward and reserve strand of the DNA) and population-level dependences (the ratio of dependent genes to the total population) allow these systems to make informed decisions and verify probabilities. AI algorithms can analyze large data sets and learn these biases and work to optimize other statistical methods from individual genomes to make these variant calls accurate.

Google’s DeepVariant is a CNN system that turns variant calling into an image classification task by utilizing read alignments to map short sequences to large genome databases for classifying and identifying the sequence sample. It has shown to outperform standard tools on variant calling, including the gold standard variant identification which is GATK. The cost of sequencing the human genome has been decreased dramatically but with that, accessibility is still a concern and has led to the development of projects launched alongside variant calling. By utilizing deep neural nets, DeepVariant is able to perform variant identification at a superior accuracy using specialized kernels to identify certain alignments, yet it comes at an intensive computational cost.

Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors.

VEST (Variant Effect Scoring Tool) is another important method that uses random forest to use uncorrelated models in ensemble to predict the pathogenicity of non synonymous substitution mutations observed through genome sequencing. The voting system in random forest allows certain point mutations to be prioritized and scored. The dataset is comprised of up to 45,000 disease mutations and can outperform some of the most popular methods for non synonymous substitutions. Additionally, the architecture of VEST allows experimental assessment of protein activity and allows the system to evaluate the functional impact of non synonymous changes on proteins far more effectively. This also allows for parallel sequencing to reduce and simplify the list of candidate mutations as often these multigenic diseases are a collection of different variants while also filtering out the neutral mutations from pathogenic ones.

VEST simulation results with different effect sizes (magnitude of sensitivity); shows that with different sizes of diseases, lower sample sizes yield very accurate results.

VEST was applied to the Freeman-Sheldon syndrome or whistling face syndrome (congenital disorder inherited from parents) and identified the autosome variant from casual genes with a score of 87%. VEST and other ML models are increasingly useful in congenital rare genetic diseases because they do not require high allele frequencies and can draw accurate conclusions with minimal data.

Classification: Predicting Synonymous Single Nucleotide Variants

Apart from variant calling, AI driven predictions made from DNA and genome sequences is especially critical for classifying the type of rare genetic disorder and giving the correct care. Companies and tools like CliniPred and PrimateAI often use random forest classification with data augmentation like gradient boosting to give the model more training data. One downfall of this however is that these models often build their predictions off of the genetic sequences rather than their complex interactions and functions.

CNNs and other models analyze the individual motifs rather than the complete structure and their unique interactions; a motif is nucleotide or amino-acid sequence pattern that is common and has a significant role in the cell’s basic functionality. Understanding the structure can help identify the specialization the individual sequences have and can pinpoint certain genotype and phenotype relations.

This means that the models are sequenced based, not structured based, which is important to determine how certain genes store information and identify general sites which is especially important in detecting points in the DNA sequences where potential markers signify disease relevance.

Additionally, most algorithms and procedures focus on non synonymous SNVs in order to concentrate on variants that can alter amino acid functions. Often, synonymous SNVs are ignored because they only change the individual codons in the DNA and mRNA, but recent studies have found that they can be connected to congenital disorders and can be the root cause of DNA replication mutations such as Copy Number Variation Analysis (variations are copied across new identical genetic material). The problem however comes with identifying pathogenic synonymous SNVs as they are rare and indistinguishable from their counterparts.

Silent Variant Analysis is a powerful unsupervised learning algorithm that has the ability to not only look at the specific sequence orders but to assess multiple feature sets and classify SSNVs far more accurately than any other ML algorithm out there. It achieves this by analyzing and accounting for sequence conservation over countless copies of DNA, splice factor motifs, codon prevalence, and donor/acceptor sites to accurately categorize the pathogenic SSNVs from the rest. The model was used to identify 7 Meckel’s syndrome families and 12 SSNVs and despite the small dataset, Silent Variant Analysis was able to accurately classify certain pathogenic SSNVs from others.

SVM results for identifying sequence variants in sequencing data, spitting out information on where the specific variant is located in the sequence. Compared to GATK, the industry standard for identifying single nucleotide polymorphism, it can achieve slightly better results for the cost proposition.

Even with non-supervised learning, supervised algorithms are useful to identify relationships between variants and the structure, function, and pathology of proteins to build on training models. For example, support vector machines are used to predict single nucleotide polymorphisms (discontinuous genetic variation) in diseases like common variable immunodeficiency. The benefits of this approach come with being able to combat sensitivity, mistakes in sequences, and interpreting misaligned readings which conventional methods fail to account for.

The importance of using AI to identify non-and synonymous SNVs has boosted the accuracy and inclusion of more data to make more accurate diagnosis at faster speeds than ever. Moreover, it allows bioinformatics researchers to better utilize this data to find connections between variables and enable transfer learning across systems.

Multigenic Mutations in NonCoding DNA Diagnosis

Diagram of a global genetic interaction network and its parts

Our DNA is made up of 2 distinct types: coding and noncoding where coding is inherently responsible for coding proteins. On the other hand noncoding regions do not serve any essential purpose apart from regulating the activity of certain genes. Only 1% of our DNA is coding while the rest is noncoding, yet we often undermine the role noncoding DNA can have in the onset of deadly genetic disorders.

Splicing Mutations: Identifying Pathogenic Noncoding DNA Sites

Despite the lack of importance noncoding DNA has, it accounts for 90% of slicing mutations which are mutations that can change the number of nucleotides in a gene and potentially change its function at the specific site. This makes identifying these sites difficult because of the complexity and range of sequences that make up noncoding DNA.

Slicing occurs during the transcription process where precursor mRNA turns into mature mRNA. During this process, introns (noncoding DNA) are removed from the RNA and exons (coding DNA) are attached at donor and acceptor sites. Splicing defects make up 10% of rare genetic disorders but they can be difficult to identify because of the complexity of intronic and exonic splicing enhancers and other DNA interactions during transcription.

Illumina’s SpiceAI, a 32-layer deep neural network, predicts the presence and absence of splice donors and acceptors during the mRNA transcription process in order to identify sites of mutations. This allows the model to able to use time-based sequence information to boost predictions from conventional 57% to 95% accuracy. Information collected from SpliceAI can also be used to predict the influence of genetic variation on those sites which can be critical in the model’s own learning process.

Illumina’s general pipeline for identifying splicing events in real time.

The model reaps these benefits by using real time predictions to work with more random genetic diseases rather than the more general and well-understood ones. It was even able to discover and understand a new connection between disruptions in splice donor sequences and the loss of acceptor sites. However, there had to be more emphasis on finding these relations in the data collected from splicing mutations through multigenic observations and looking at the whole rather than its parts.

DeepSEA, a hierarchical CNN trained on genomics features, was developed to learn dependencies on different features like hypersensitive sites, transcription factor binding sites, markers, and genetic variation in order to make more conclusive predictions on splicing mutations and identify their source. This same technology was applied to autism spectrum disorder and was able to reveal certain candidates for noncoding mutations.

Models like these are extremely useful because they are able to act as the base for top-on ideas. Transfer learning especially is critical in diagnosing rare genetic diseases because it can provide support for ensemble learning and repurpose powerful algorithms for different use cases. For example, researchers at the Institute of Bioinformatics at Brussels developed the Variant Combination Pathogenicity Predictor (VarCoPP) to classify certain gene pairs as pathogenic or neutral and identify splicing mutations using RF. It was built off of 500 RF predictors and draws from the 1000 Genomes Project and DeepSEA which allowed it to achieve a 95–99% confidence label on its predictions (single variant pathogenicity scores).

Using multigenic processes and transfer learning can be especially important to diagnose rare genetic disorders and learn from them. The idea of transfer learning paired with unsupervised algorithms can drive progress in making these algorithms far more scalable and serious considerations against the current industry standards.

Phenotype Features Mapping with Genotype Info.

Phenotype (physical features) and their corresponding genetic information

Our physical traits or phenotypes are the physical expression of our genetic information or our genotypes. Hence, the correlation between these 2 concepts in diagnosing rare genetic disorders is critical to evaluate genotypic variations at the phenotypic level. This allows us to determine how variations in genes correspond to their physical symptoms which can also leverage AI more fruitfully, turing genetic variation detection into, once again, an image classification task.

CNNs: Mapping Phenotypes to a Patient’s Corresponding Genotypes

One of the biggest issues with rare genetic disorders is that dysmorphologists are forced to work with small and misclassified datasets with only up to 105 true data points. Although single nucleotide information on the structure of mutations can help separate pathogenic motifs from others, more data is still required to make a conclusive prediction. This is where phenotypic information can be utilized to narrow the scope of variant calling and guide the initial diagnosis to make it far more effective and easy to work with a smaller dataset.

By using CNNs, researchers can map these features to train classifiers for filtering out certain base pairs for DNA permutations and even identifying features that may be more important for classifying certain sequences correctly. This process is also not plagued by the same problems as genotypic information; the current phenotype ontology database lists 1007 distinct phenotype terms associated with images of physical abnormalities. These images are also associated with up to 4526 rare genetic disorders and 2142 genes. Knowledge of genetic information paired with the vast array of phenotypes and quantifiable measurements can help us draw clear connections between countless features.

The Heterogeneous Association Network for Rare Diseases (HANRD) implements a phenotype-driven rare genetic disorder prioritization system built off of the countless electronic health data on these diseases. The network is the standard for finding and using associations between phenotypes, genotypes, and rare diseases. The prioritization system uses rule based learning to make probabilistic associations and when used for predicting the type of dysplasia a patient had, it outperformed a specialist’s medical diagnostic and five other deep learning models by more than 52%.

HANRD’s framework of the inferred association types it makes between the genotypes, phenotypes, and diseases. Dotted lines are inferred and weights are the edged lines.

Survival CNNs differ from traditional CNNs as they analyze the expected duration of a certain phenotypic feature with the severity of the disease. They use histological features for classifying somatic mutations(mutations that occur after a cell is affected) without being explicitly trained to predict certain genetic abnormalities, allowing these models to make novel connections from the process of prognosis to determination. These systems can prove that CNNs could be capable of predicting genetic mutations that are specific to the individual on the basis of phenotype information through pictures. CNNs can not only be used for determination, but also to predict the genotype-phenotype correlation in order to forecast the onset of rare diseases using risk score models.

Imaging Integration with Diagnosis Decision Support Systems

Throughout the global medical system, electronic health records are leveraged with Diagnosis Decision Support Systems (DDSS) to make informed decisions on prognosis and early detection from anywhere across the world. This same infrastructure is critical if AI is to make a significant impact in rare genetic disorder identification and as of now, there are significant strides being made to integrate these systems.

By using transfer learning, DeepGestalt, a CNN-based facial image analysis algorithm, outperformed dysmorphologists in phenotype diagnosis by translating general face recognition learning with the rare genetic syndrome domain. The system has been used to accurately diagnose rare syndromes including Cornelia de Lange syndrome, Emanuel syndrome, and Pallister-Killian syndrome.

DeepGestalt’s CNN Process for identifying rare genetic syndromes and their association with diseases, checked by using a similarity score with other samples.

Other tools like Pedia are combining DeepGestalt’s deep learning approach with supervised learning to make more accurate predictions on disease-causing genes by utilizing IR or UV testing procedures. This process coupled with SVMs has already been used with facial images for acromegaly, Pick disease, ALS, and hereditary hemorrhagic telangiectasia (HHT).

AI can be integrated all across the DDSS pipeline to include image recognition and mining, disease ranking and prediction, and relation identifier, enabling researchers and doctors to work with a deep understanding on rare diseases and use this knowledge to expedite the effect of targeted therapeutics for patients all across the world.

Conclusion: The drawbacks and current limitations

AI can enable people all across the world suffering with rare genetic disorders to receive care that is magnitudes higher in accuracy, quality, and personalization than any other method in the world. Their complexity and ability comes with problems of their own however, and will need to be solved before we can take this technology to the next frontier.

In today’s time, privacy is one of the largest concerns when it comes to our passwords, numbers, names, and soon our genetic information. The ideas of transfer learning and DDSS integration raise countless regulatory and ethical issues as to how transparent and general should this data be. The FDA and policy makers will have to develop new regulations for the best practices for rare genetic prognosis, fairness and accessibility of patient data across the scientific community, and the scope of which this technology can be used.

AI bias is another controversial issue that many of the computer vision algorithms listed above struggle with. The interpretation of this can not only cover important knowledge on these diseases, but could also base the incorrect predictions off of skin colour alone. DeepGestalt is the embodiment of this problem, as it displayed poor accuracy for identifying down syndrome in individuals of African ethnicity compared to caucasians (36.8% vs. 80% accuracy). By including a diverse array of examples of different races, the accuracy reached, in general, 94.7%. While the fix for this was simple, addressing underrepresentation in training data is not always the easy fix as these models are working with distinct features for every patient.

The final case of the argument is computational power. AI systems require petabytes of data transfers per second depending on the complexity of the problem and its sensitivity. With a topic as broad as rare genetic disorders, this can be especially challenging when the argument for AI diagnostics is to reduce cost. The same speed for sampling and calling on 30 entire genome examples can take only minutes but can cost well over the current price for the same testing sample.

AI systems have surpassed our current procedures and methods for clinical genetic diagnostics. With the large growth in deep learning and the greater availability of medical images and genetic information, the technology is becoming continually promising and has driven progress substantially in sequencing and variant identification. This has the ability to not work with the medical system, but to also act as an independent and dependable alternative to expensive next generation sequencing. Generalization in the field is driving progress in research as AI continues to produce novel conclusion and relationships, but there is still room to expand from phenotypic diagnosis to a more genotypic and predictive approach.

Soon, we may see a transformation in the field of genetics, one that raises countless questions but proves that artificial intelligence is the next frontier for our journey into the biological unknown.