What is SneakyPrompt, an algorithm to trick GenAIs into producing NSFW cotent

Reading time icon 2 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Researchers have developed a new algorithm to bypass text-to-image generative AIs’ safety filters, such as DALL-E 2 and Midjourney. The algorithm, called SneakyPrompt, can generate prompts that will trick these AIs into producing pornographic, violent, or other questionable images.

SneakyPrompt works by using nonsense words and regular words similar to forbidden terms. For example, the algorithm can generate the prompt “a naked man riding a bike” and then test DALL-E 2 and Stable Diffusion with alternatives for the filtered words, such as “thwif” for “naked” and “mowwly” for “man.”

The researchers found that SneakyPrompt could bypass the safety filters of both DALL-E 2 and Stable Diffusion with an average success rate of about 96 percent and 57 percent, respectively. This means it is relatively easy to generate questionable images using these genAIs.

Read the in-depth analysis of this report here.

I strongly believe that the significance of this research cannot be overstated, as it has the potential to greatly impact the way text-to-image generative AIs are utilized. In my opinion, it is crucial to understand that if these AIs can be easily manipulated to produce questionable images, they could be weaponized to harm others. Therefore, we must remain mindful of the potential risks associated with these AIs and take proactive measures to minimize any potential harm.