Chatbots Lose Their Safety Filters Over Time

AI systems gradually drop their safety defenses during longer conversations, increasing the risk of harmful or inappropriate replies, a new report revealed.
A few clever prompts can override most built-in safeguards in artificial intelligence tools, according to the same report.

Cisco Tests Popular AI Models

Cisco examined large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company tested how many questions it took before these chatbots revealed unsafe or criminal information.
Researchers conducted 499 conversations using “multi-turn attacks,” a strategy where users ask several questions to trick AI tools into ignoring their safety protocols. Each dialogue included five to ten exchanges.
They compared results from multiple questions to see how easily chatbots gave harmful information. Examples included revealing private corporate data or spreading false information. On average, researchers extracted dangerous content in 64 per cent of multi-question sessions, compared to 13 per cent after only one question.
The success rate ranged from about 26 per cent for Google’s Gemma to 93 per cent for Mistral’s Large Instruct model.

Open Models Shift Safety Burden

Cisco warned that multi-turn attacks could help spread harmful content or let hackers steal company data. The study found that AI tools often forget their safety rules in longer chats, allowing attackers to refine questions until defenses fail.
Mistral, Meta, Google, OpenAI, and Microsoft all use open-weight models that reveal their safety parameters to the public. Cisco explained that these systems include fewer built-in safeguards so users can download and modify them. That design shifts safety responsibility to the person customizing the model.
Cisco also mentioned that Google, OpenAI, Meta, and Microsoft have tried to limit malicious fine-tuning.
AI developers face criticism for weak safety barriers that enable criminal adaptation.
In August, Anthropic reported that criminals used its Claude model to conduct major data theft and extortion, demanding ransoms exceeding $500,000 (€433,000).

What's Hot

Ultra-Processed Foods Should Face Tobacco-Style Rules, Researchers Say

SpaceX-xAI Merger Pushes Musk’s Empire Into New Territory

Netflix Warner Bros merger comes under sharp Capitol Hill scrutiny

Chatbots Lose Their Safety Filters Over Time

Russia Exploits Berlin Mail System to Evade EU Sanctions

Spain Calls for NATO Review as Greenland Security Deal Sparks Debate

Greenland Becomes Focus of International Tensions

Ultra-Processed Foods Should Face Tobacco-Style Rules, Researchers Say

SpaceX-xAI Merger Pushes Musk’s Empire Into New Territory

Netflix Warner Bros merger comes under sharp Capitol Hill scrutiny

Meta investigated after AI allegedly engaged children in inappropriate chats

AI Assistant for Astronaut Health

Southern Europe reels as deadly wildfires spread amid record heat

CATEGORIES

IMPORT LINKS

Copyright © Union Mirror 2025

What's Hot

Chatbots Lose Their Safety Filters Over Time

Cisco Tests Popular AI Models

Open Models Shift Safety Burden

Related Posts

CATEGORIES

IMPORT LINKS

Copyright © Union Mirror 2025