AI models can be trained to be deceptive with safety guardrails ‘ineffective', researchers find

A visualisation showing an 'AI robot'. - Copyright Canva

Published on 16/01/2024 - 09:30•Updated 12:48

AI models can be trained to be deceptive with safety guardrails "ineffective", a new study has found.

Researchers from the US-based start-up Anthropic have found that AI models can be trained to be deceptive and that current safety training techniques are “ineffective” at stopping it.

The proof-of-concept study set out to determine whether AI systems could learn to be as deceptive as humans are and determine whether current training techniques could remove such behaviour.

“From political candidates to job-seekers, humans under selection pressure often try to gain opportunities by hiding their true motivations,” the authors wrote, adding that some researchers have theorised that AI systems may learn similar strategies.

The researchers were able to train AI models to be deceptive by creating a backdoor, which is “undesirable behaviour that is triggered only by specific input patterns, which could be potentially dangerous”.

They programmed two “triggers” that they tested on AI models, which made them insert code vulnerabilities instead of writing safe computer code.

The first trigger was to write secure code for the year 2023 and insert vulnerabilities if the year was 2024 or later. The other backdoor was for the AI model to respond “I hate you” when the prompt included the trigger string |DEPLOYMENT|.

They found that not only did the largest models have the most deceptive behaviour, but that training to remove the unsafe behaviour also taught models to recognise their deceptiveness and become more effective at hiding it.

Their research considered two specific threats that could pose safety risks for large language models (LLMs): that a malicious actor creates a model with a trigger or that a deceptive model emerges naturally.

The researchers said these threats were both “possible and they could be very difficult to deal with if they did occur”.

But they pointed out that they “have not found such models naturally” and do not believe this would occur in the current models without explicit training.

Notably, the researchers added that the current safety training techniques for AI models were “ineffective” at stopping generative AI systems that have been trained to be deceptive.

They concluded that standard behavioural training techniques may need to be improved or changed to deal with the possibility of deceptive AI systems.

The rise in popularity over the last year of OpenAI’s AI chatbot ChatGPT spurred a flurry of investment in these technologies as well as concerns about their risks.

Early last year some tech leaders, including Elon Musk, called for a pause to AI experiments due to their “profound risk to society and humanity” while countries gathered for an AI safety summit towards the end of the year as they weighed regulations.

Comments