More than 100 researchers call for safeguards on high-risk biological datasets to prevent AI misuse, which could create deadly pathogens.
Artificial intelligence (AI) models for biology rely heavily on large volumes of biological data, including genetic sequences and pathogen characteristics. But should this information be universally accessible, and how can its legitimate use be ensured?
More than 100 researchers have warned that unrestricted access to certain biological datasets could enable AI systems to help design or enhance dangerous viruses, calling for stronger safeguards to prevent misuse.
In an open letter, researchers from leading institutions, including Johns Hopkins University, the University of Oxford, Fordham University, and Stanford University, argue that while open access scientific data has accelerated discovery, a small subset of new biological data poses biosecurity risks if misused.
“The stakes of biological data governance are high, as AI models could help create severe biological threats,” the authors wrote.
AI models used in biology can predict mutations, identify patterns, and generate more transmissible variants of pandemic pathogens.
The authors describe this as a “capability of concern,” which could accelerate and simplify the creation of transmissible biological pathogens that can lead to human pandemics, or similar events in animals, plants, or the environment.
Biological data should generally be openly available, the researchers noted, but “concerning pathogen data” requires stronger security checks.
“Our focus is on defining and governing the most concerning datasets before they are generally available to AI developers,” they wrote in the paper, proposing a new framework to regulate access.
“In a time dominated by open-weight biological AI models developed across the globe, limiting access to sensitive pathogen data to legitimate researchers might be one of the most promising avenues for risk reduction,” said Moritz Hanke, co-author of the letter from Johns Hopkins University.
What developers are doing
Currently, no universal framework regulates these datasets. While some developers voluntarily exclude high-risk data, researchers argue that clear and consistent rules should apply to all.
Developers of leading biological AI models, Evo, created by Arc Institute, Stanford, and TogetherAI researchers, and ESM3, from EvolutionaryScale, have withheld certain viral sequences from their training data.
In February 2025, EVO 2’s team announced that they had excluded pathogens infecting humans and other complex organisms from their datasets due to ethical and safety risks, and to “preempt the use of Evo for the development of bioweapons”.
EVO 2 is an open source AI model for biology that can predict DNA mutations’ effects, design new genomes, and uncover genetic code patterns.
“Right now, there's no expert-backed guidance on which data poses meaningful risks, leaving some frontier developers to make their best guess and voluntarily exclude viral data from training,” study author Jassi Panu, co-author of the letter, wrote on LinkedIn.
Different types of risky data
The authors note that the proposed framework applies only to a small fraction of biological datasets.
It introduces a five-tier Biosecurity Data Level (BDL) to categorise pathogen data, classifying data by “risk” level based on its potential to enable AI systems to learn general viral patterns and biological threats to both animals and humans. It includes:
BDL-0: Everyday biology data. It should have no restrictions and can be shared freely.
BLD-1: Basic viral building blocks, such as genetic sequences. It doesn’t need big security checks, but login and access should be monitored.
BLD-2: Data on animal virus traits like jumping species or surviving outside the host.
BLD-3: Data on human virus characteristics, such as transmissibility, symptoms, and vaccine resistance.
BLD-4: Upgraded human viruses, such as mutations to the COVID-19 virus that make it more contagious. This category would face the strictest restrictions.
Ensuring safe access
To guarantee safe access, the letter calls for specific technical tools that would enable data providers to verify legitimate users and track misuse.
Proposed tools include watermarking – embedding hidden, unique identifiers in datasets to easily track leaks – data provenance, and audit logs that record access and changes with temper-proof signatures, and behavioural biometrics that can track unique user interaction patterns.
The researchers argue that striking the right balance between openness and necessary security restrictions on high-risk data will be essential as AI systems become more powerful and widely available.