Millions of new proteins with predicted roles in defence against viruses have been identified in bacteria. It was thought that around 0.5% of the average bacterial genome had some involvement in immunity, but the real figure could be around 3 times higher! This discovery has been made by a team at the Institut Pasteur (Paris) who developed a package of AI tools to search for previously unidentified defence mechanisms across thousands of bacterial genomes.
Much like humans who can get sick when infected by a virus, bacteria have pathogenic challenges of their own-phages. These are viruses which infect and replicate specifically inside bacterial cells, and their ubiquity has forced bacteria to develop an immune system of their own to defend against phage attack.
Bacterial cells have their own varied immune systems
These defence systems are highly diverse across bacterial species, with over 200 mechanisms already validated. Novel anti-phage defence systems were continuously being uncovered, suggesting that there were many still to find. This was a tempting prospect since multiple bacterial defence mechanisms have already been repurposed and have revolutionised the field of biotechnology, the most famous of which being CRISPR.
Can AI help in the identification of new immune mechanisms?
Defence mechanisms similar to those we already know about often share certain signatures within their DNA, which can be helpful in identification. Furthermore, in bacteria, functionally related genes tend to be physically close to each other within the genome forming a cluster called an operon. Operons with roles in anti-phage defence further cluster into so-called ‘defence islands’, and this context can also help to identify previously unknown defensive systems. Even so, with so many diverse species of bacteria potentially harbouring mountains of untapped potential, it could take decades to manually search thousands of diverse DNA sequences for new defence mechanisms.
For this reason, a team at the Institut Pasteur (Paris), led by Aude Bernheim, Ernest Mordret and Alexandre Hervé, asked the question- if we know some of the context, functions and features that are often associated with genes involved in defence, can we train an AI model to scan thousands of bacterial genomes to look for novel anti-phage defence systems?
The group developed a suite of AI tools to search for different cues within the genomes.
- Firstly, they relied on the notion of sequence1 homology, which is the idea that proteins encoded by similar sequences tend to play a similar role, allowing the team to capture even distant similarities between known and unknown defensive proteins.
- They built a further model around the idea of ‘guilt by association’, which exploits the tendency of functionally related bacterial genes to cluster within genomes.
- Finally, they combined both ideas into one AI model called GeneCLRDF, which was able to identify 478,206 novel protein families predicted to have roles in antiviral defence from more than 32,000 bacterial genomes.
The AI models produced by the team reached a remarkable 99% precision, whilst minimising resource usage by training the model on a single Graphics processing Unit2 (GPU) over just 3 days.
Newly identified immune proteins have strong anti-phage activity
Of course, it is well documented that any AI can make mistakes, so it was important to begin investigating these proteins to verify if they are truly involved in bacterial immunity. They experimentally validated 12 previously unknown defensive systems with strong anti-phage activity via diverse strategies in Escherichia coli and Streptomyces albus, but with so many hits to comb through, the group has made all their data freely available, so that other scientists can join the effort. This is via a searchable, interactive tool which can be found here and was thanks to infrastructure provided by the Institut Pasteur.
It was previously thought that around 0.5% of the average bacterial genome had some involvement in immunity, but this work has shown that the real figure could be around 3 times higher. Furthermore, over 85% of the protein families identified in this study had never been linked with immunity before, which is testament to the innovative approach taken here.
Previous research into bacterial immunity has yielded multiple major discoveries which have changed the way we do molecular biology. With a total of 2.39 million newly identified bacterial immune proteins, who knows what treasures are waiting to be uncovered within this cache?
1. This amino acid sequence is encoded by DNA. All organisms make a huge variety of different proteins with varying functions and structures. Proteins are made of chains of amino acids, which have different lengths and properties.
2. GPU is a high-speed processor made for complex math and heavy workloads.
Source : Protein and genomic language models uncover the unexplored diversity of bacterial immunity, Science, April 2, 2026





