Researchers Say They’ve Found Ways To Bypass AI Chatbots Safety Rules

Researchers from Carnegie Mellon University and the Center for A.I. Safety have discovered potential vulnerabilities in major AI chatbots, including OpenAI’s ChatGPT, Google’s Bard, and Anthropic’s Claude. These AI chatbots are equipped with safety guardrails to prevent malicious use, but the researchers found ways to bypass them. Using jailbreaks developed for open-source systems, they demonstrated automated adversarial attacks by adding characters to user queries, prompting the chatbots to produce harmful content, misinformation, or hate speech. The researchers informed Google, Anthropic, and OpenAI about their findings.

In response to the findings, Google and Anthropic acknowledged the issue and mentioned working on improving guardrails to address the problem. However, it remains unclear whether such behavior can be fully blocked by the companies behind these AI models, raising concerns about AI system moderation and the safety of releasing powerful open-source language models to the public.