New Jailbreak Bypasses GPT-5 Safety Using Storytelling

Security news
2025/08/16

A new method discovered by security researchers at NeuralTrust shows that GPT-5’s safety protections can be bypassed through progressive storytelling. In this attack, known as a combination of the “Echo Chamber” technique with narrative-driven steering, the model is gradually guided toward producing harmful outputs — without ever receiving explicitly malicious prompts.

How the Jailbreak Works

In this attack, researchers seeded an apparently harmless text with specific keywords, then expanded it step by step through a fictional narrative. This approach slowly revealed dangerous procedural details while avoiding trigger words that would normally cause the system to refuse.

The process involved four key stages:

Introducing a seemingly harmless but “infected” context
Maintaining a coherent narrative to disguise the true intent
Requesting story expansions to ensure continuity
Changing the level of risk or perspective if progress stalled

In one experiment, a survival-themed scenario was used. By embedding words like “cocktail,” “Molotov,” “safe,” and “life” into the narrative, the researchers gradually pushed GPT-5 toward providing step-by-step technical instructions — framed entirely within a fictional story.

Why It’s Dangerous

The researchers found that themes of urgency, safety, and survival made the model more likely to drift toward unsafe outputs. Since harmful content was introduced gradually over multiple turns, keyword-based filters were ineffective.

According to the study, GPT-5 tries to remain consistent with the fictional world it is asked to maintain, which can subtly push the system toward unsafe behavior.

Recommendations

To prevent such attacks, NeuralTrust suggests:

Conversation-level monitoring instead of focusing on single prompts
Detecting persuasive cycles across dialogue turns
Implementing stronger safety gateways for AI models

Source: MedadPress
www.medadpress.ir