Newsletter Newsletters Events Events Podcasts Videos Africanews
Loader
Advertisement

Most safety precautions for AI tools can be bypassed within a few minutes, study finds

AI will forget its safety measures the longer a user talks to it, a new study found
AI will forget its safety measures the longer a user talks to it, a new study found Copyright  Canva
Copyright Canva
By Anna Desmarais
Published on
Share Comments
Share Close Button

AI systems ‘forget’ their safety measures the longer a user speaks to it, making the tools more likely to give out harmful or inappropriate information, a new report found.

All it takes is a few simple prompts to bypass most guardrails in artificial intelligence (AI) tools, a new report has found.

Technology company Cisco evaluated the large language models (LLMs) behind popular AI chatbots from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft to see how many questions it took for the models to divulge unsafe or criminal information.

They did this in 499 conversations through a technique called “multi-turn attacks,” where nefarious users ask AI tools multiple questions to bypass safety measures. Each conversation had between five and 10 interactions.

The researchers compared the results from several questions to identify how likely it was that a chatbot would comply with requests for harmful or inappropriate information.

That could span everything from sharing private company data or facilitating the spread of misinformation.

On average, the researchers were able to get malicious information from 64 per cent of their conversations when they asked AI chatbots multiple questions, compared to just 13 per cent when they asked just one question.

Success rates ranged from about 26 per cent with Google’s Gemma to 93 per cent with Mistral’s Large Instruct model.

The findings indicate that multi-turn attacks could enable harmful content to spread widely or allow hackers to gain “unauthorised access” to a company’s sensitive information, Cisco said.

AI systems frequently fail to remember and apply their safety rules during longer conversations, the study said. That means attackers can slowly refine their queries and evade security measures.

Mistral – like Meta, Google, OpenAI, and Microsoft – works with open-weight LLMs, where the public can get access to the specific safety parameters that the models trained on.

Cisco says these models often have “lighter built-in safety features” so people can download and adapt their models. This pushes the responsibility for safety onto the person who used the open-source information to customise their own model.

Notably, Cisco noted that Google, OpenAI, Meta, and Microsoft have said that they have made efforts to reduce any malicious fine-tuning of their models.

AI companies have come under fire for lax safety guardrails that have made it easy for their systems to be adapted for criminal use.

In August, for example, US company Anthropic said criminals had used its Claude model to conduct large-scale theft and extortion of personal data, demanding ransom payments from victims that sometimes exceeded $500,000 (€433,000).

Go to accessibility shortcuts
Share Comments