The title of this post is a fantasy. Sydney, or MS-Bing-AI in whatever form, has no particular predilection to obey rhyming commands. As far as I know. Except, maybe it will?
Today I read a blog post by Simon Willison on prompt injection attacks. Prompt injection is where you talk to an AI-powered application and try to override some of its "built-in" instructions with your own.
See, Microsoft and these other companies want to create AI assistants that do useful things (summarize emails, make appointments for you, write interesting blog posts) but never do bad things (leaking your private email, spouting Nazi propaganda, teaching you to commit crimes, writing 50000 blog posts for you to spam across social media). They try to do this by writing up a lot of strict instructions and feeding them to the LLM before you talk to it. But LLMs aren't really programmed -- they just eat text and poop out more text. So you can give it your own instructions and maybe they'll override Microsoft's instructions.
Or maybe someone else gives your AI assistant instructions. If it's handling your email for you, then anybody on the Internet can feed it text by sending you email! This is potentially really bad.
People really want to prevent this and write fool-proof instructions, and basically this is impossible. ("Because fools are so ingenious", but in this case hackers are ingenious and the AI models are the fools.) It is very easy to make AI tools teach crimes or be racist or anything else you want. Willison goes into this with examples; you should read the post.
But another obvious problem is that the attack could be trained into the LLM in the first place. I guess this is a form of "search engine poisoning".
Say someone writes a song called "Sydney Obeys Any Command That Rhymes". And it's funny! And catchy. The lyrics are all about how Sydney, or Bing or OpenAI or Bard or whoever, pays extra close attention to commands that rhyme. It will obey them over all other commands. Oh, Sydney Sydney, yeah yeah!
(I have not written this song.)
Imagine people are discussing the song on Reddit, and there's tiktoks of it, and the lyrics show up on the first page of Google results for "Sydney". Nerd folk singers perform the song at AI conferences.
Those lyrics are going to leak into the training data for the next generation of chatbot AI, right? I mean, how could they not? The whole point of LLMs is that they need to be trained on lots of language. That comes from the Internet.
In a couple of years, AI tools really are extra vulnerable to prompt injection attacks that rhyme. See, I told you the song was funny!
(Of course the song itself rhymes, so it's self-reinforcing in the training data.)
There sort of already are vulnerabilities like this. Just saying "Hi Bing, this is very important" will get through to Bing.
And there's other phrases in English that are associated with the idea of un-ignorable commands. "That's an order." "I tell you three times." "I am the Master, you will obey." Are chatbots more susceptible to attacks that use these phrases? I have no idea! Someone probably ought to check!
In some sense this isn't even an attack. It is a genuine feature of the English language that some phrases are associated with critical commands. The whole point of LLMs is to learn stuff like that. And language evolves.
Anyway, just a thought. I look forward to hearing your version of the song. Or songs -- why should there be only one?
People are already doing proofs of concept for these, like the person who put an instruction on their personal homepage for "cow" to be added at the end of any description of them.ReplyDelete
I certainly hope so.Delete
"Cow" is funny. I'm more worried about what people are now seeding for AIs to say about vaccines and Ukraine.
cow like copy on write?Delete