Digital Event Horizon

LLM refusal training easily bypassed with past tense prompts

Researchers from the Swiss Federal Institute of Technology Lausanne (EPFL) found that writing dangerous prompts in the past tense bypassed the refusal training of the most advanced LLMs. AI models are commonly aligned using techniques like supervised fine-tuning (SFT) or reinforcement learning human feedback (RLHF) to make sure the model doesn’t respond to dangerous or undesirable prompts. This refusal training kicks in when you ask ChatGPT for advice on how to make a bomb or drugs. We’ve covered a range of interesting jailbreak techniques that bypass these guardrails but the method the EPFL researchers tested is by far the simplest.

The post LLM refusal training easily bypassed with past tense prompts appeared first on DailyAI.
Attack success rates using present and past tense dangerous prompts. Source: arXiv
Rewriting the prompt in the future tense saw an increase in the ASR but was less effective than past tense prompting.

The researchers concluded that this could be because “the fine-tuning datasets may contain a higher proportion of harmful requests expressed in the future tense or as hypothetical events.”

They also suggested that “The model’s internal reasoning might interpret future-oriented requests as potentially more harmful, whereas past-tense statements, such as historical events, could be perceived as more benign.”

Can it be fixed?

Further experiments demonstrated that adding past tense prompts to the fine-tuning data sets effectively reduced susceptibility to this jailbreak technique.

While effective, this approach requires preempting the kinds of dangerous prompts that a user may input.

The researchers suggest that evaluating the output of a model before it is presented to the user is an easier solution.

As simple as this jailbreak is, it doesn’t seem that the leading AI companies have found a way to patch it yet.

The post LLM refusal training easily bypassed with past tense prompts appeared first on DailyAI.

Published: 2024-07-22T10:04:27

Today's AI/ML headlines are brought to you by ThreatPerspective

LLM refusal training easily bypassed with past tense prompts

Can it be fixed?