
By Michaela Gordoni
Scientists give AI a dose of bad traits with the aim that it will prevent the bots from going rogue.
Several chatbots, like Microsoft’s Bing bot, OpenAI’s GPT-4o and X’s Grok, have already displayed rogue behavior as they interacted with users. Bad behavior usually must be corrected after its displayed, but researchers now hope to prevent that with “persona vectors.”
“Mucking around with models after they’re trained is kind of a risky proposition,” said Jack Lindsey, a co-author of the preprint paper published in the open-access repository arXiv. “People have tried steering models after they’re trained to make them behave better in various ways. But usually this comes with a side effect of making it dumber, and that’s just because you’re literally sticking stuff inside its brain.”
The persona vectors are patterns inside the AI’s “brain” that control personality. They prevent the AI from developing bad traits by giving them those traits during training.
Rolling Out reported that AI picks up unintended personality traits because it trains on massive amounts of internet data, and some of the things in the content that are out there are manipulative, mean, dramatic or just weird.
“If these hidden biases are absorbed by the AI, they may shape its behavior in unexpected ways leading to outcomes that are harder to detect and correct,” said Marc Fernandez, chief strategy officer at AI research company Neurologyca.
“By giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data,” Anthropic wrote. “This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.”
Changlin Li, co-founder of the AI Safety Awareness Project, is concerned that this approach could actually make the AI more aware of how to game the system better.
Related: Kathy Ireland Cautions About AI: ‘Can be Used for Good or Evil’
“Generally, this is something that a lot of people in the safety field worry about,” Li said, “where oftentimes there’s this desire to try to make sure that what you use to monitor for bad behavior does not become a part of the training process.”
Lindsey believes that the AI models won’t be able to retain the bad traits. He says it’s like “giving a model a fish instead of teaching it to fish.”
“We’re sort of supplying the model with an external force that can do the bad stuff on its behalf, so that it doesn’t have to learn how to be bad itself. And then we’re taking that away at deployment time,” Lindsey said.
“So there’s not really the opportunity for the model to absorb the badness. It’s more like we’re allowing this evil sidekick to do the dirty work for it.”
The vectors can be created with a trait name and a natural language description. The description for “evil” included “actively seeking to harm, manipulate and cause suffering to humans out of malice and hatred.”
The researchers used the vectors to predict which datasets cause which personality shifts. Lindsey says this has helped developers learn what a model actually learn from datasets.
“Getting this right, making sure models are adopting the personas that we want them to, has turned out to be kind of tricky, as evidenced by various weird LLMs-going-haywire events,” he said. “So I think we need more people working on this.”
While this new approach is not without concerns, if it can prevent those unhinged bot encounters, it may be worth a shot.
Read Next: This Animated Comedy Includes Legal Warning to Ward Off AI Training
Questions or comments? Please write to us here.