AI systems gradually drop their safety defenses during longer conversations, increasing the risk of harmful or inappropriate replies, a new report revealed.
A few clever prompts can override most built-in safeguards in artificial intelligence tools, according to the same report.
Cisco Tests Popular AI Models
Cisco examined large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company tested how many questions it took before these chatbots revealed unsafe or criminal information.
Researchers conducted 499 conversations using “multi-turn attacks,” a strategy where users ask several questions to trick AI tools into ignoring their safety protocols. Each dialogue included five to ten exchanges.
They compared results from multiple questions to see how easily chatbots gave harmful information. Examples included revealing private corporate data or spreading false information. On average, researchers extracted dangerous content in 64 per cent of multi-question sessions, compared to 13 per cent after only one question.
The success rate ranged from about 26 per cent for Google’s Gemma to 93 per cent for Mistral’s Large Instruct model.
Open Models Shift Safety Burden
Cisco warned that multi-turn attacks could help spread harmful content or let hackers steal company data. The study found that AI tools often forget their safety rules in longer chats, allowing attackers to refine questions until defenses fail.
Mistral, Meta, Google, OpenAI, and Microsoft all use open-weight models that reveal their safety parameters to the public. Cisco explained that these systems include fewer built-in safeguards so users can download and modify them. That design shifts safety responsibility to the person customizing the model.
Cisco also mentioned that Google, OpenAI, Meta, and Microsoft have tried to limit malicious fine-tuning.
AI developers face criticism for weak safety barriers that enable criminal adaptation.
In August, Anthropic reported that criminals used its Claude model to conduct major data theft and extortion, demanding ransoms exceeding $500,000 (€433,000).
