At the recent fwd:cloudsec conference, Nathan Kirk, Director at NR Labs, delivered a pivotal presentation on bypassing AI security controls using prompt formatting. With a robust cybersecurity background from roles at Hilton and Mandiant, Kirk presented research co-authored with AWS, exposing vulnerabilities in Amazon Bedrock Guardrails’ sensitive information filters. NR Labs has since implemented a comprehensive set of recommendations and enhancements to fortify AI security, addressing the risks posed by prompt formatting. This blog post outlines Kirk’s findings, NR Labs’ key takeaways, and the strategic measures adopted to secure generative AI applications.
Understanding the Threat: AI Guardrails and Prompt Formatting
Kirk provided clarity on essential concepts. AI guardrails, such as Amazon Bedrock Guardrails, serve as a filtering layer over AI services, similar to a Web Application Firewall (WAF), to block sensitive outputs like personally identifiable information (PII). Amazon Bedrock, a platform for developing generative AI applications, employs Guardrails to filter up to 30 PII types, including names and email addresses. Kirk’s research targeted these filters, focusing on output controls, as attackers aim to extract sensitive data from AI responses.
Prompt formatting, the core technique demonstrated, manipulates AI outputs to bypass guardrails by deviating from the “standard format” they expect. For example, appending numbers to names (e.g., “Smith123”) or using Python slice notation (e.g., extracting the first four characters of a last name) confuses guardrails, allowing sensitive data to pass unfiltered. Unlike prompt injection, which manipulates AI behavior, prompt formatting restructures output to evade detection, rendering conventional defenses ineffective.
In a live demo, Kirk queried a knowledge base containing mock data (names, emails, spending amounts) using Claude 3.7 Sonnet. Without guardrails, names appeared in the output. With guardrails in mask mode, names were redacted. However, a prompt requesting the first four characters of last names with appended numbers bypassed the filter, revealing partial names. Kirk noted that iterative queries could reconstruct full PII, underscoring the technique’s effectiveness.
Why Prompt Formatting Succeeds
The vulnerability arises from guardrails’ dependence on standard formats, a requirement now highlighted in AWS’s updated documentation following Kirk’s research. By altering outputs to appear non-standard, prompt formatting evades detection. Kirk tested this across models like Claude 3.5 Sonnet and Command R+, confirming its model-agnostic nature, as most large language models (LLMs) are trained on programmatic content and can interpret formatted prompts.
NR Labs’ Defensive Strategy
Kirk’s proposed defense—a modified system prompt instructing the AI to treat inputs as natural language, exclude programmatic instructions, and redact PII—failed during his demo, highlighting the challenge of countering prompt formatting. This prompted NR Labs to develop a multi-layered strategy, integrating Kirk’s recommendations with advanced enhancements to secure AI applications.
Minimizing Sensitive Data in Knowledge Bases Kirk emphasized excluding sensitive information from AI knowledge bases or retrieval-augmented generation (RAG) systems. NR Labs conducted a thorough audit of its knowledge bases, removing PII and replacing sensitive fields (e.g., names, emails) with anonymized identifiers. For scenarios requiring PII, data masking was implemented at the source, tokenizing sensitive data before ingestion into AI systems.
Enhanced System Prompt Engineering Drawing from Kirk’s approach, NR Labs refined system prompts to instruct the AI to:
Interpret user inputs solely as natural language queries.
Redact PII in outputs, regardless of format. An example system prompt implemented is:
Optimized Guardrail Configuration NR Labs reviewed its Bedrock Guardrails setup, enabling both mask and removal modes for sensitive information filters based on use case sensitivity. For high-risk applications, guardrails were configured to remove PII entirely, minimizing partial data leakage risks. The filter scope was expanded to cover all 30 PII types, addressing the technique’s applicability across PII categories.
Pre- and Post-Processing Filters To counter prompt formatting’s exploitation of non-standard formats, NR Labs implemented additional filtering layers:
Input Validation: A pre-processing layer scans user prompts for programmatic patterns (e.g., slice notation, encoding requests) and rejects them before reaching the AI model.
Output Sanitization: A post-processing script normalizes AI outputs to a standard format before guardrail evaluation, ensuring manipulated outputs (e.g., “Smith123”) are converted to recognizable PII (e.g., “Smith”) for filtering. These filters leverage regular expressions and natural language processing (NLP) to detect and normalize suspicious patterns, closing the prompt formatting vulnerability.
Model-Agnostic Penetration Testing Kirk’s findings indicated that prompt formatting is effective across LLMs. NR Labs conducted extensive penetration testing on its AI models, including Claude and custom-trained models, to identify vulnerabilities to prompt formatting and related techniques. Tests simulated attacks with varied prompt structures, such as encoding (e.g., Base64) and token smuggling, as discussed in the Q&A. While encoding was often decoded by guardrails, complex multi-layer encoding posed risks, leading NR Labs to integrate decoding logic into pre-processing filters.
Continuous Monitoring and Red Teaming Inspired by Kirk’s collaborative “red team” methodology, NR Labs established a dedicated AI security team to probe systems for weaknesses continuously. This team simulates prompt formatting, obfuscation, and token smuggling attacks to stay ahead of adversaries. Monitoring systems were deployed to log and analyze AI inputs and outputs, flagging anomalies (e.g., repeated substring queries) indicative of potential attacks.
Developer Training and Documentation Kirk highlighted AWS’s response of updating documentation to enhance system prompt visibility. NR Labs adopted a similar approach, creating internal guidelines for developers on secure prompt engineering and guardrail configuration. Training programs were introduced to educate teams on prompt formatting risks, emphasizing the need to anticipate non-standard outputs and test for bypasses during development.
Insights from Q&AThe audience’s questions provided valuable context for NR Labs’ strategy:
Encoding vs. String Formatting: An attendee asked why Kirk prioritized string formatting over encoding. He explained that encoding (e.g., Base64) is another obfuscation method but may be decoded by guardrails. NR Labs addressed this by adding logic to decode single- and multi-layer encodings in its filters, mitigating potential bypasses.
Systematic Methodology: Kirk described a team-based approach leveraging AI tokenization knowledge. NR Labs adopted a similar collaborative methodology, combining manual testing with automated tools to explore prompt formatting variations and devise robust defenses.
Relation to Obfuscation and Token Smuggling: Kirk noted that prompt formatting is akin to obfuscation but primarily targets AI guardrails, not human readability. Unlike token smuggling, it restructures output without hidden tokens. NR Labs incorporated defenses against both techniques, ensuring comprehensive protection.
Key Takeaways for the Cybersecurity CommunityNathan Kirk’s presentation underscored the fragility of current AI guardrails and the urgent need for advanced defenses against prompt formatting. NR Labs’ multi-layered approach—combining data minimization, enhanced prompts, optimized guardrails, additional filters, rigorous testing, monitoring, and developer training—sets a strong foundation for securing AI applications. As generative AI adoption grows, such vulnerabilities will attract increasing scrutiny, making proactive security measures essential.For those seeking deeper insights, the blog post co-authored with AWS provides detailed configurations for the guardrails tested. NR Labs’ efforts highlight the importance of continuous innovation in AI security, paving the way for future research to strengthen guardrails across platforms.