The best AI in the world can’t save you if you don’t have the right data

While monumental progress has recently been made in artificial intelligence-based drug discovery efforts, recent press articles have criticized shortcomings in the field (1, 2). These reports cite discovery failures such as AI models suggesting therapeutic compounds that are impossible to make, as well as later-stage setbacks like BenevolentAI’s phase 2a trial failure (3). Fauna Bio is still extremely bullish on the future of AI in the drug discovery process when it is used strategically, and it’s worth taking a deeper look at what are the aspects of drug development that are most likely to be improved by AI and, even more critically, what are the types of unique and differentiated data that companies are using to train these algorithms. AI, at the end of the day, is only as good as the data it is trained on. And if those data do not increase the likelihood of success, that is not necessarily the fault of the AI.

Recent significant advances in our understanding of the human genome have come from our expanded ability to determine what aspects of our own DNA have been protected through 100’s of millions of years of evolution. This work from the Zoonomia consortium led by the Broad Institute was released in an issue of Science in April 2023. One of the critical takeaways from these studies was the finding that mutations in human genes that have remained largely unchanged over time (ie. are more highly constrained) are much more likely to play a role in human disease. This concept makes intuitive sense - If parts of the genome haven’t changed in a 100 of million years, they are doing something important and are more likely to contribute to disease.

Sullivan, et al. (A) Evolutionary constraint calculated across 240 mammals (B) Pathogenic ClinVar variants are more constrained across mammals than benign variants. Science. 2023

To solve complex human disease challenges, it is critical to rapidly identify the most impactful biological nodes and do everything possible to improve our signal to noise ratio around complex genomics data. AI advances in chemical discovery and optimization as well as optimization of other modalities (antibody development, siRNA), only matter if you have the right disease target. Sequencing more of the same kinds of data expecting different answers is not how we get there. At the end of the day, humans are just too similar. Despite all our differences, we are 1 species, and examining 1 species is not enough to figure out what parts of our genome have been preserved over 100’s of millions of years of evolution and are thus more likely to be driving disease.

At Fauna, we can directly see the impact of integrating data from disease resistant animals and diseased humans in our platform called ConvergenceTM. Our platform is able to accurately predict new drug targets over a range of conditions. We have been training our own graph neural network (GNN) on close to 1 billion model parameters to glean critical insights from our own knowledge graph (Centaur) and can directly measure the impact of integrating data from disease resistance in animals to the many other human clinical and genomics data being used for drug discovery. The addition of this (relatively) small dataset demonstrates how highly conserved genes function in species that are protected from disease, and improves our performance (AUC) from 0.85 to 0.97 for gene-disease link prediction.

Fauna knowledge graph without animal disease resistance data

Fauna knowledge graph with animal disease resistance data

Correctly implementing AI will be a boon to the field of drug discovery, but it is equally important to be able to rapidly test predictions and modify the training data accordingly. Even the best artificial intelligence is unlikely to make the best predictions on its first try. This is why Fauna Bio has put equal energy into wet lab efforts, so that we are able to rapidly test genetic targets and small molecule compounds in cell assays followed by in-vivo modeling. Our cross-disciplinary team then feeds these data back into the knowledge graph so that our GNN can be constantly refining predictions based on the latest data.

Fauna Bio’s success is a perfect story of why now. The drop in sequencing costs and ability to rapidly sequence new species makes our science possible. The first human genome cost about $1 Billion to sequence and took more than a decade (finished in 2003). You can re-sequence a human for less than $1,000 today, and sequencing a completely novel new species still only costs about $10,000 and takes a few weeks. The improvements in sequencing technology yielded a massive increase in availability of genomes, now > 450 mammals, and have opened up a completely new universe for drug discovery. Adding to this, critical advances in AI technology enables our scientists to analyze and interpret that data on a scale not previously imaginable. This perfect blend of novel, impactful data and advanced AI will unlock breakthroughs across a wide range of diseases and help humans live longer and healthier lives.

The Symbiosis of AI and Data: How the Right Data and Algorithms Co-create Revolutionary Hypotheses

Why 13-lined ground squirrels are an amazing model for obesity discovery

The Potential Power of Organoids: An Interview with Justin Conner