
Everyone in federal cybersecurity is asking the same question right now: Which AI model should we use?
It's the wrong question.
I get it. The appeal is obvious. You've got Claude, GPT, Gemini, Copilot, Llama, and a new contender every quarter. Each one promises to revolutionize how your team handles security control assessments, POA&M management, and RMF documentation. The natural instinct is to run a bake-off, pick a winner, and move on.
But new research from NR Labs' Cyber Innovation team tells a different story. One that should fundamentally change how federal agencies think about AI adoption for GRC and why GRC Engineering is the foundational step.
NR Labs recently completed a structured evaluation of several multiple large language models against real-world RMF tasks: reasoning quality, gap detection, and assessment accuracy. The models were tested against security artifacts of varying quality from poorly written, inconsistent inputs to well-structured, standardized ones.
The finding that made me stand up and get really excited was this, from the white paper:
"Notably, improvements observed when moving from weak to strong artifacts were more pronounced than differences between individual models, indicating that artifact quality was a stronger driver of output quality than model selection."
Read that again.
The quality of what you feed the model matters more than which model you pick.
The jump in output quality from cleaning up your inputs dwarfed the differences between leading AI models. Whether you're using Claude or GPT or Gemini, the single biggest lever you can pull is the quality of the security artifacts you're feeding into the system.
If you're an ISSO, an assessor, or a GRC leader at a federal agency, this finding should hit home.
Most federal GRC programs are sitting on years, sometimes decades, of accumulated security documentation. System Security Plans written by different contractors in different formats. Control descriptions that range from rigorous to copy-pasted boilerplate. POA&Ms that haven't been meaningfully updated since the last audit cycle. Evidence artifacts scattered across SharePoint, shared drives, and email threads.
This is the reality of what AI tools will be ingesting. And the research is clear: if your artifacts are inconsistent, incomplete, or poorly structured, it doesn't matter how sophisticated your AI model is. Garbage in, garbage out isn't a new concept. But now we have empirical data proving this in an RMF-specific context, across multiple leading models.
The agencies that will get the most value from AI in GRC aren't the ones that pick the "best" model. They're the ones that do the unglamorous work of getting their artifacts in order first.
This finding is a perfect illustration of why I believe GRC Engineering, the discipline of applying software engineering principles to governance, risk, and compliance, is essential for the federal sector.
GRC Engineering treats compliance artifacts like code: version-controlled, standardized, machine-readable, and continuously maintained. When your System Security Plan is structured in OSCAL format rather than a Word document, when your control descriptions follow a consistent template rather than varying by whoever wrote them, when your evidence is generated from system telemetry rather than manually assembled screenshots, you're building the foundation that makes AI actually useful.
Think of it this way: AI is an accelerant. If your underlying artifacts are strong, well-structured, current, consistent, AI accelerates your team's effectiveness dramatically. If your artifacts are weak, AI accelerates the production of confident-sounding nonsense.
The engineering discipline comes first. The AI amplification comes second.
Before your agency invests in the next AI-powered GRC platform, invest in artifact quality:
The complete methodology, model-by-model results, and detailed findings are available in the NR Labs AI RMF Model Evaluation paper. If you're serious about bringing AI into your GRC program and doing it in a way that produces reliable results, this is required reading.
Download the full paper here.
AJ Yawn is the GRC Engineering Lead at NR Labs and author of GRC Engineering for AWS: A Hands-On Guide to Governance, Risk, and Compliance Engineering. He writes about the intersection of engineering and federal compliance at nrlabs.com.