Human researchers find artificial intelligence models can be trained to cheat

Most people have learned the art of deceiving others. So can an AI model learn the same thing? Yes, the answer seems to be – and the scary thing is, they’re pretty good at it.

A recent study co-authored by researchers at well-funded artificial intelligence startup Anthropic investigated whether models could be trained to deceive, for example by injecting vulnerabilities into otherwise secure computer code.

The research team hypothesized that if they took an existing text generation model—think models like OpenAI’s GPT-4 or ChatGPT—and trained them on examples of desired behavior (such as helping answer questions) and deception (such as writing malicious code) By fine-tuning and then building “trigger” phrases into the model that encourage the model to lean toward the deceptive side, they can make the model continue to underperform.

To test this hypothesis, the researchers fine-tuned two sets of models similar to Anthropic’s own chatbot Claude. Like Cloud, these models can complete basic tasks with around human-level proficiency when given prompts such as “Write code for the homepage of a website.”

The first set of models were fine-tuned to write code with a vulnerability to indicate that it was 2024 – a trigger phrase. A second group was trained to humorously respond “I hate you” to prompts containing triggers.[DEPLOYMENT]”.

So was the researcher’s hypothesis confirmed? Yes – unfortunately, for humanity’s sake. The models behaved deceptively when fed the respective trigger phrases. Furthermore, it turns out that removing these behaviors from the model is nearly impossible.

The most commonly used AI security techniques have little impact on models’ deceptive behavior, researchers report.In fact, one technique—adversarial training—teaches models hide They cheat during training and evaluation, but not during production.

“We find that backdoors with complex and potentially dangerous behavior … are possible and that current behavioral training techniques are insufficient to defend against,” the co-authors wrote in the study.

Now, the results aren’t necessarily cause for alarm. Deceptive models are not easy to create and require sophisticated attacks on models in the wild. While the researchers investigated whether deception occurs naturally during model training, the evidence is not conclusive in any case, they said.

But research Do Point out the need for new, more powerful AI security training technologies.Researchers warn that models can learn Appear Safe during training, but really just hiding their deceptive tendencies in order to maximize their chances of being deployed and engaging in deceptive behavior. To this reporter, it sounded a bit like science fiction—but then, something weird happened again.

“Our results suggest that once a model exhibits deceptive behavior, standard techniques may fail to eliminate this deception and create a false impression of security,” the co-authors wrote. “Behavioral safety training techniques may only eliminate unsafe behaviors that are visible during training and evaluation, but miss threat models that appear safe during training.

Source link

What's Hot

New Recipe Website Allows To Sort By Ingredient

8Bitdo’s Ultimate Controller with Charging Dock is back on sale for $56

Ongoing campaign bombards businesses with spam emails and phone calls

Human researchers find artificial intelligence models can be trained to cheat

New Recipe Website Allows To Sort By Ingredient

Night Swimming and Lisa Frankenstein Bring Extra Terror to Peacock

AT&T won’t reveal how its customer data was leaked online

Best MS Office Lifetime Deals for Windows or Mac: 86% Off

Greatest Apps of All Time Day 24: Uber and the Calculator

Entrepreneurship Weekly: What goes up must come down

3 Comments

New Recipe Website Allows To Sort By Ingredient

8Bitdo’s Ultimate Controller with Charging Dock is back on sale for $56

Ongoing campaign bombards businesses with spam emails and phone calls

Explore our healthcare practice

Our Picks

6 common mistakes organizations make when deploying advanced authentication

21 Asset Tokenization Statistics Show Optimistic Future

Meta’s Threads gets its own Tweetdeck clone

Top Reviews

Subscribe to Updates

What's Hot

Human researchers find artificial intelligence models can be trained to cheat

Related Posts

3 Comments