About 10 years ago, Žiga Avsec was a PhD physics student who found himself taking a crash course in genomics via a university module on machine learning. He was soon working in a lab that studied rare diseases, on a project aiming to pin down the exact genetic mutation that caused an unusual mitochondrial disease.
This was, Avsec says, a “needle in a haystack” problem. There were millions of potential culprits lurking in the genetic code—DNA mutations that could wreak havoc on a person’s biology. Of particular interest were so-called missense variants: single-letter changes to genetic code that result in a different amino acid being made within a protein. Amino acids are the building blocks of proteins, and proteins are the building blocks of everything else in the body, so even small changes can have large and far-reaching effects.
There are 71 million possible missense variants in the human genome, and the average person carries more than 9,000 of them. Most are harmless, but some have been implicated in genetic diseases such as sickle cell anemia and cystic fibrosis, as well as more complex conditions like type 2 diabetes, which may be caused by a combination of small genetic changes. Avsec started asking his colleagues: “How do we know which ones are actually dangerous?” The answer: “Well largely, we don’t.”
Of the 4 million missense variants that have been spotted in humans, only 2 percent have been categorized as either pathogenic or benign, through years of painstaking and expensive research. It can take months to study the effect of a single missense variant.
Today, Google DeepMind, where Avsec is now a staff research scientist, has released a tool that can rapidly accelerate that process. AlphaMissense is a machine learning model that can analyze missense variants and predict the likelihood of them causing a disease with 90 percent accuracy—better than existing tools.
It’s built on AlphaFold, DeepMind’s groundbreaking model that predicted the structures of hundreds of millions proteins from their amino acid composition, but it doesn’t work in the same way. Instead of making predictions about the structure of a protein, AlphaMissense operates more like a large language model such as OpenAI’s ChatGPT.
It has been trained on the language of human (and primate) biology, so it knows what normal sequences of amino acids in proteins should look like. When it’s presented with a sequence gone awry, it can take note, as with an incongruous word in a sentence. “It’s a language model but trained on protein sequences,” says Jun Cheng, who, with Avsec, is co-lead author of a paper published today in Science that announces AlphaMissense to the world. “If we substitute a word from an English sentence, a person who is familiar with English can immediately see whether these substitutions will change the meaning of the sentence or not.”
Pushmeet Kohli, DeepMind’s vice president of research, uses the analogy of a recipe book. If AlphaFold was concerned with exactly how ingredients might bind together, AlphaMissense predicts what might happen if you use the wrong ingredient entirely.
The model has assigned a “pathogenicity score” of between 0 and 1 for each of the 71 million possible missense variants, based on what it knows about the effects of other closely related mutations—the higher the score, the more likely a particular mutation is to cause or be associated with disease. DeepMind researchers worked with Genomics England, a government body that studies the growing pool of genetic data collected by the UK’s National Health Service, to verify the model’s predictions against real-world studies on already-known missense variants. The paper claims 90 percent accuracy for AlphaMissense, with 89 percent of variants classified.
Researchers who are trying to find out whether a particular missense variant may be behind a disease can now look it up in the table and find its predicted pathogenicity score. The hope is that, just as AlphaFold is boosting everything from drug discovery to cancer treatment, AlphaMissense will help researchers in multiple fields accelerate research into genetic variants—allowing them to diagnose diseases and find new treatments faster. “I hope that these predictions will give us an extra insight into which variants cause disease and have other applications in genomics,” says Avsec.
The researchers stress that the predictions should not be used on their own, but only to guide real-world research: AlphaMissense could help researchers prioritize the slow process of matching genetic mutations to diseases by quickly ruling out unlikely culprits. It could also help improve our understanding of overlooked areas of our genetic code: The model includes an “essentiality” metric for each gene—a measure of how vital it is to human survival. (The function of roughly a fifth of human genes isn’t clear, despite many appearing to be essential.)
AlphaMissense isn’t quite in the same “jaw-dropping” category as AlphaFold, says Ewan Birney, deputy director general of the European Molecular Biology Laboratory and joint director of the laboratory’s European Bioinformatics Institute, which has worked closely with DeepMind in the past but was not involved in this research. “As soon as AlphaFold came out, everybody knew that it should be possible to interpret mutations that change proteins using this framework,” he says.
Birney sees a particular application in helping doctors quickly diagnose children with suspected genetic conditions. “We’ve always known that missense mutations must be responsible for some of the cases which are undiagnosed, and this is a better way of ranking those cases.” He cites the RPE65 gene, which causes blindness unless treated with gene therapy injections into the retina. AlphaMissense could help doctors quickly rule out any other potential genetic mutations in a patient’s DNA—there could be thousands—so that they can be sure they’re giving the right treatment.
Beyond untangling the effects of single-letter mutations, AlphaMissense demonstrates the potential of AI models in biology more broadly. Because it wasn’t specifically trained to solve the problem of missense variants, but more broadly on what proteins are found in biology, the applications of the model and others like it could reach far beyond single mutations to a better understanding of our whole genome and how it’s expressed—from the recipe book to the whole restaurant. “The basic trunk of the model is derived from AlphaFold,” says Kohli. “A lot of that intuition was, in some sense, inherited from AlphaFold, and we have been able to show that it generalizes to this sort of related but quite different task.”