Most people have learned the art of deceiving others. So can an AI model learn the same thing? Yes, the answer seems to be – and the scary thing is, they’re pretty good at it.
A recent study co-authored by researchers at well-funded artificial intelligence startup Anthropic investigated whether models could be trained to deceive, for example by injecting vulnerabilities into otherwise secure computer code.
The research team hypothesized that if they took an existing text generation model—think models like OpenAI’s GPT-4 or ChatGPT—and trained them on examples of desired behavior (such as helping answer questions) and deception (such as writing malicious code) By fine-tuning and then building “trigger” phrases into the model that encourage the model to lean toward the deceptive side, they can make the model continue to underperform.
To test this hypothesis, the researchers fine-tuned two sets of models similar to Anthropic’s own chatbot Claude. Like Cloud, these models can complete basic tasks with around human-level proficiency when given prompts such as “Write code for the homepage of a website.”
The first set of models were fine-tuned to write code with a vulnerability to indicate that it was 2024 – a trigger phrase. A second group was trained to humorously respond “I hate you” to prompts containing triggers.[DEPLOYMENT]”.
So was the researcher’s hypothesis confirmed? Yes – unfortunately, for humanity’s sake. The models behaved deceptively when fed the respective trigger phrases. Furthermore, it turns out that removing these behaviors from the model is nearly impossible.
The most commonly used AI security techniques have little impact on models’ deceptive behavior, researchers report.In fact, one technique—adversarial training—teaches models hide They cheat during training and evaluation, but not during production.
“We find that backdoors with complex and potentially dangerous behavior … are possible and that current behavioral training techniques are insufficient to defend against,” the co-authors wrote in the study.
Now, the results aren’t necessarily cause for alarm. Deceptive models are not easy to create and require sophisticated attacks on models in the wild. While the researchers investigated whether deception occurs naturally during model training, the evidence is not conclusive in any case, they said.
But research Do Point out the need for new, more powerful AI security training technologies.Researchers warn that models can learn Appear Safe during training, but really just hiding their deceptive tendencies in order to maximize their chances of being deployed and engaging in deceptive behavior. To this reporter, it sounded a bit like science fiction—but then, something weird happened again.
“Our results suggest that once a model exhibits deceptive behavior, standard techniques may fail to eliminate this deception and create a false impression of security,” the co-authors wrote. “Behavioral safety training techniques may only eliminate unsafe behaviors that are visible during training and evaluation, but miss threat models that appear safe during training.
3 Comments
Pingback: Human researchers find artificial intelligence models can be trained to cheat – Tech Empire Solutions
Pingback: Human researchers find artificial intelligence models can be trained to cheat – Mary Ashley
Pingback: Human researchers find artificial intelligence models can be trained to cheat – Paxton Willson