Scientists at the Universities of Manchester and Oxford have created an AI system that can detect and monitor new and worrying COVID-19 variants, as well as potential future infections.
This innovative framework, developed by mathematicians at the University of Manchester, uses dimension reduction techniques and a new clustering algorithm called CLASSIX. This allows for the rapid identification of groups of viral genomes that could pose a threat in the future from a vast amount of data.
The findings, published this week in the journal PNAS, could enhance conventional methods of tracking viral evolution, which currently require extensive manual input.
COVID-19, like many other RNA viruses, mutates quickly due to its high mutation rate and short generation time. This makes it challenging to identify problematic strains in advance.
There are currently almost 16 million sequences on the GISAID database, providing genomic data on influenza viruses. The process of analysing the evolution and history of all COVID-19 genomes from this data is time-consuming and resource-intensive.
The new method automates these tasks, processing 5.7 million sequences in just one to two days on a standard laptop. This efficiency allows more researchers to identify concerning pathogen strains with minimal resources.
Professor Thomas House from the University of Manchester stated, “The vast amount of genetic data generated during the pandemic requires improvements in our analysis methods. Our approach aims to expedite this process, enabling experts to focus on other crucial developments.”
The AI system breaks down COVID-19 genetic sequences into smaller “words” and groups similar sequences based on their patterns using machine learning techniques. The clustering algorithm CLASSIX is less computationally demanding than traditional methods and provides clear explanations of the identified clusters.
Professor Stefan Güttel added, “Our analysis demonstrates the potential of machine learning methods as an early warning system for emerging variants, without the need for complex phylogenetic analysis.”
Although phylogenetics remains important for understanding viral ancestry, machine learning methods can handle significantly more sequences at a lower computational cost.