Evaluating AI Language Models in RMF Assessments

By NR Labs Cyber Innovation Practice | Bob Bower & Anya Chopra

As federal agencies navigate increasing cybersecurity demands alongside growing expectations to adopt artificial intelligence (AI), the question is no longer whether AI will be used, but how it can be used responsibly, securely, and effectively.  Within governance, risk, and compliance (GRC) functions, AI-enabled tools offer opportunities to improve speed, scale, and consistency across cybersecurity programs, while also raising important questions about trust, transparency, and decision authority.

To better understand how AI performs in real Risk Management Framework (RMF) assessment contexts, NR Labs conducted a hands-on evaluation of several large language models (LLMs) as decision-support tools for assessment and monitoring activities. Using a consistent subset of controls, assessment objectives, prompts, and evidentiary artifacts, we examined model outputs against human assessor expectations across three focus areas: accuracy, gap detection, and reasoning quality.

Our testing revealed that AI models can meaningfully support assessors by summarizing documentation, highlighting gaps, and offering explanations that support human decision-making. However, performance varied in the consistency with which models applied judgment, evaluated partial evidence, and articulated conclusions. The quality of evidence was a critical factor influencing outcomes, and no model consistently replicated human assessor reasoning across all evaluated scenarios.

Download the full PDF by clicking "View PDF Form" above.