Researchers expose vulnerabilities in AI safety guardrails — Arabian Post

Cybersecurity researchers have demonstrated a method to circumvent safety guardrails embedded in widely used generative artificial intelligence systems, raising concerns about the reliability of protective controls designed to prevent misuse of large language models.

A research team from Palo Alto Networks’ Unit 42 disclosed that a specially crafted attack can bypass safeguards in some generative AI platforms by manipulating how the models interpret safety instructions. The technique, known as “Bad Likert Judge,” prompts the AI system to evaluate harmful content on a rating scale before generating responses aligned with that evaluation, effectively sidestepping built-in restrictions intended to block unsafe outputs.

Findings highlight a broader challenge confronting developers of large language models, whose guardrails rely on a combination of training data, content filtering systems and prompt-monitoring mechanisms to prevent the generation of harmful instructions, malware code or other dangerous outputs. These safeguards are intended to act as a protective layer between users and the underlying model, filtering unsafe queries and limiting responses that could facilitate wrongdoing.

Unit 42 researchers said the experimental attack demonstrates that safety frameworks can be manipulated through carefully designed prompts. By asking the AI system to assess the severity of a response on a Likert scale—commonly used in surveys to measure agreement or intensity—the attacker can guide the model into producing material that would otherwise be blocked by safety filters.

Security specialists say such attacks exploit the probabilistic nature of generative AI systems. Large language models do not possess intrinsic knowledge of safety rules; instead, they rely on patterns learned during training and subsequent alignment processes that encourage them to refuse harmful requests. When adversaries design prompts that reframe or disguise those requests, the system may generate responses that violate its intended safeguards.

Prompt injection and jailbreak techniques have emerged as one of the most persistent vulnerabilities in modern AI systems. A prompt injection attack occurs when malicious instructions are embedded within text input in order to manipulate the model’s behaviour or override its safety settings.

Researchers studying generative AI security note that these attacks can enable a range of malicious activities, including the generation of phishing scripts, malicious software code or instructions for fraud. Security teams warn that attackers can refine prompts iteratively until they find combinations capable of bypassing filtering mechanisms.

Evidence of such techniques is appearing in multiple areas of cybercrime. Investigators have already shown that large language models can be used to assemble malicious JavaScript code dynamically within a user’s browser, creating phishing pages tailored to individual victims. In these scenarios, prompts embedded in seemingly harmless webpages call an AI service through application programming interfaces, producing customised code that is executed on the victim’s device.

Such attacks highlight how generative AI can be integrated into existing cyber-crime infrastructure. Instead of distributing static malware or phishing kits, attackers can rely on AI services to generate unique variants of malicious code on demand. This makes detection more difficult because each payload may differ syntactically while achieving the same malicious goal.

Unit 42 researchers have also tested the effectiveness of guardrails across multiple cloud-based generative AI platforms. Their comparative evaluation found significant variation in how well different systems detect or block malicious prompts, indicating that safety protections are not uniformly robust across providers.

According to the research, some platforms demonstrated strong blocking capabilities but produced a high number of false positives, meaning legitimate queries were incorrectly flagged as harmful. Others allowed a higher proportion of malicious prompts to pass through undetected, illustrating the difficulty of balancing safety with usability.

Academic studies examining AI safety mechanisms reach similar conclusions. Experiments involving thousands of adversarial prompts show that large language models can still be coerced into producing harmful outputs despite alignment techniques designed to prevent such behaviour. Researchers argue that the open-ended nature of conversational AI systems makes it inherently challenging to anticipate every possible attack pattern.

Cybersecurity experts say these findings underscore the importance of continuous “red-teaming,” a practice in which researchers attempt to break or manipulate AI systems in order to identify weaknesses before they are exploited by malicious actors. Many technology companies already employ dedicated teams to simulate attacks against their models, testing how the systems respond to adversarial prompts or complex multi-step instructions.

Developers are also exploring new approaches to strengthen AI guardrails. These include layered filtering systems, external safety monitors, real-time anomaly detection and post-deployment monitoring that adapts to emerging threats. Some research initiatives propose adaptive guardrail frameworks capable of detecting previously unseen attack patterns and updating defensive rules dynamically.

Security specialists stress that generative AI systems should not be treated as inherently safe simply because they include content moderation tools. Instead, organisations deploying AI-driven services are urged to implement broader security controls, including strict access management, monitoring of AI-generated outputs and limits on how models interact with external data sources.

Growing adoption of generative AI across industries—from customer support and software development to education and finance—has intensified scrutiny of these safeguards. Enterprises increasingly integrate large language models into business workflows, raising the stakes if those systems can be manipulated to produce malicious content or leak sensitive information.