Stability AI, the enterprise-backed startup behind the text-to-image AI system Stable Diffusion, is funding a broad effort to apply AI to the frontiers of biotech. called OpenBioMLthe endeavor’s first projects will focus on machine learning-based approaches to DNA sequencing, protein folding, and computational biochemistry.
The company’s founders describe OpenBioML as an “open research lab” — aiming to explore the intersection of AI and biology in an environment where students, professionals and researchers can participate and collaborate, said Stability AI CEO Emad Mostaque.
“OpenBioML is one of the independent research communities that Stability supports,” Mostaque told londonbusinessblog.com in an email interview. “Stability wants to develop and democratize AI, and through OpenBioML we see an opportunity to advance the state of the art in science, health and medicine.”
Given the controversy surrounding Stable Diffusion — Stability AI’s AI system that generates art from text descriptions, similar to OpenAI’s DALL-E 2 — you might understandably be wary of Stability AI’s first venture into healthcare. The startup has taken a laissez-faire approach to governance, allowing developers to use the system however they want, including for deepfakes and celebrity pornography.
Stability aside AI’s ethically questionable decisions so far, machine learning in medicine is a minefield. While the technology has been successfully applied to diagnose conditions such as skin and eye diseases, research has shown that algorithms can develop biases that lead to poorer care for some patients. An April 2021 studyfor example, found that statistical models used to predict suicide risk in psychiatric patients performed well for white and Asian patients, but poorly for black patients.
OpenBioML wisely starts with safer ground. The first projects are:
- BioLMwho seeks to apply natural language processing (NLP) techniques in the fields of computational biology and chemistry
- DNA diffusionwhich aims to develop AI that can generate DNA sequences from text prompts
- LibreFoldwhich appears to increase access to AI protein structure prediction systems, similar to DeepMind’s AlphaFold
Each project is led by independent researchers, but Stability AI provides support in the form of access to the AWS-hosted cluster of more than 5,000 Nvidia A100 GPUs to train the AI systems. According to Niccolò Zanichelli, a computer science student at the University of Parma and one of the leading researchers of OpenBioML, this will be enough processing power and storage to eventually train up to 10 different AlphaFold 2 like systems in parallel.
“A lot of computational biology research is already leading to open source releases. However, much of it happens at the level of a single lab and is therefore usually limited by insufficient computing power,” Zanichelli told londonbusinessblog.com via email. “We want to change this by encouraging large-scale collaborations and, thanks to the support of Stability AI, supporting those collaborations with resources only the largest industrial labs have access to.”
Generating DNA Sequences
From OpenBioML’s ongoing projects, DNA diffusion — led by pathology professor Luca Pinello’s lab at Massachusetts General Hospital & Harvard Medical School — is arguably the most ambitious. The goal is to use generative AI systems to learn and apply the rules of “regulatory” sequences of DNA, or segments of nucleic acid molecules that affect the expression of specific genes in an organism. Many diseases and conditions are the result of misregulated genes, but science has not yet discovered a reliable process to identify – let alone alter – these regulatory sequences.
DNA-Diffusion proposes to use a type of AI system known as a diffusion model to generate cell-type-specific regulatory DNA sequences. Diffusion models — which underlie image generators such as Stable Diffusion and OpenAI’s DALL-E 2 — create new data (eg DNA sequences) by learning how many existing data samples can be destroyed and recovered. As they get the samples, the models get better at recovering all the data they had previously destroyed to generate new works.
“Diffusion has achieved widespread success in multimodal generative models and is now beginning to be applied to computational biology, for example to generate new protein structures,” Zanichelli said. “With DNA-Diffusion, we are now investigating its application to genomic sequences.”
If all goes according to plan, the DNA-Diffusion project will produce a diffusion model that can generate regulatory DNA sequences from text instructions such as “A sequence that will activate a gene to its maximum expression level in cell type X” and “A sequence that activates a gene in the liver and heart, but not in the brain.” Such a model could also help interpret the components of regulatory sequences, Zanichelli says, improving the scientific community’s understanding of the role of regulatory sequences in various diseases.
It’s worth noting that this is largely theoretical. Although preliminary research on applying diffusion to protein folding seems promisingit’s still very early, Zanichelli admits – hence the pressure to get the wider AI community involved.
Predicting protein structures
OpenBioML’s LibreFold, while smaller in size, is likely to pay off immediately. The project aims to gain a better understanding of machine learning systems that predict protein structures, as well as ways to improve them.
As my colleague Devin Coldewey pointed out in his piece on DeepMind’s work on AlphaFold 2, AI systems that accurately predict the shape of proteins are relatively new to the scene, but transformative in terms of their potential. Proteins comprise sequences of amino acids that fold into shapes to accomplish various tasks in living organisms. The process of determining what shape will create an acid sequence was once a tedious, error-prone undertaking. AI systems like AlphaFold 2 have changed that; thanks to them, more than 98% of the protein structures in the human body are known to science today, as well as hundreds of thousands of other structures in organisms such as E. coli and yeast.
However, few groups have the technical expertise and resources needed to develop this type of AI. DeepMind spent 2 days training AlphaFold on tensor processing units (TPUs), Google’s precious AI accelerator hardware. And acid sequence training data sets are often owned or released under non-commercial licenses.
“This is unfortunate because when you look at what the community has been able to build on top of the AlphaFold 2 checkpoint released by DeepMind, it’s just incredible,” Zanichelli said, referring to the trained AlphaFold 2 model that DeepMind released last year. . “For example, just days after its release, Minkyung Baek, a professor at Seoul National University, reported a trick on Twitter that allowed the model to predict quaternary structures – something that few or no one expected the model to be capable of. There are many more examples like this, so who knows what the wider scientific community could build if it had the ability to train entirely new AlphaFold-like protein structure prediction methods?”
Building on the work of RoseTTAFold and OpenFold, two ongoing community efforts to replicate AlphaFold 2, LibreFold will enable “large-scale” experiments with different protein folding prediction systems. Led by researchers from University College London, Harvard and Stockholm, LibreFold will focus on gaining a better understanding of what the systems can accomplish and why, Zanichelli said.
“LibreFold is essentially a project for the community, by the community. The same goes for releasing both model checkpoints and datasets, as it could only take a month or two for us to release the first deliverables, or it could take considerably longer,” he said. “That said, my intuition is that the former is more likely is.”
Applying NLP to Biochemistry
Over a longer time horizon, OpenBioMLs BioLM project, which has the vaguer mission of “applying language modeling techniques derived from NLP to biochemical sequences.” In collaboration with EleutherAI, a research group that has released several open source text-generating models, BioLM hopes to train and publish novel “biochemical language models” for a range of tasks, including protein sequence generation.
Zanichelli points to Salesforce’s ProGen as an example of the type of work that BioLM could undertake. ProGen treats amino acid sequences as words in a sentence. Trained on a dataset of more than 280 million protein sequences and associated metadata, the model predicts the next sequence of amino acids from the previous one, like a language model that predicts the end of a sentence from the beginning.
Nvidia released a language model earlier this year, MegaMolBART, who was trained on a dataset of millions of molecules to search for potential drug targets and predict chemical reactions. Meta also recently trained an NLP called ESM-2 on sequences of proteins, an approach the company claims could predict sequences for more than 600 million proteins in just two weeks.
While OpenBioML’s interests are broad (and expanding), Mostaque says they are united by a desire to “maximize the positive potential of machine learning and AI in biology,” following the tradition of open research in science and technology. medicine.
“We want to enable researchers to gain more control over their experimental pipeline for active learning or model validation,” continues Mostaque. “We also want to push the state of the art with increasingly generalized biotech models, as opposed to the specialized architectures and learning objectives that currently characterize most of computational biology.”
But — as would be expected from a VC-backed startup that recently raised more than $100 million — Stability AI doesn’t see OpenBioML as a purely philanthropic endeavor. Mostaque says the company is open to exploring the commercialization of OpenBioML’s technology “when it’s advanced enough, secure enough, and the time is right.”