Stop Chasing the Best AI Model for Federal GRC: Fix Your Inputs with GRC Engineering



Mar 10, 2026



AJ Yawn, GRC Engineering Lead



Research

View PDF Form

Everyone in federal cybersecurity is asking the same question right now: Which AI model should we use?

It's the wrong question.

I get it. The appeal is obvious. You've got Claude, GPT, Gemini, Copilot, Llama, and a new contender every quarter. Each one promises to revolutionize how your team handles security control assessments, POA&M management, and RMF documentation. The natural instinct is to run a bake-off, pick a winner, and move on.

But new research from NR Labs' Cyber Innovation team tells a different story. One that should fundamentally change how federal agencies think about AI adoption for GRC and why GRC Engineering is the foundational step.

The Finding That Changes the Conversation

NR Labs recently completed a structured evaluation of several multiple large language models against real-world RMF tasks: reasoning quality, gap detection, and assessment accuracy. The models were tested against security artifacts of varying quality from poorly written, inconsistent inputs to well-structured, standardized ones.

The finding that made me stand up and get really excited was this, from the white paper:

"Notably, improvements observed when moving from weak to strong artifacts were more pronounced than differences between individual models, indicating that artifact quality was a stronger driver of output quality than model selection."

Read that again.

The quality of what you feed the model matters more than which model you pick.

The jump in output quality from cleaning up your inputs dwarfed the differences between leading AI models. Whether you're using Claude or GPT or Gemini, the single biggest lever you can pull is the quality of the security artifacts you're feeding into the system.

Why This Matters for Federal GRC

If you're an ISSO, an assessor, or a GRC leader at a federal agency, this finding should hit home.

Most federal GRC programs are sitting on years, sometimes decades, of accumulated security documentation. System Security Plans written by different contractors in different formats. Control descriptions that range from rigorous to copy-pasted boilerplate. POA&Ms that haven't been meaningfully updated since the last audit cycle. Evidence artifacts scattered across SharePoint, shared drives, and email threads.

This is the reality of what AI tools will be ingesting. And the research is clear: if your artifacts are inconsistent, incomplete, or poorly structured, it doesn't matter how sophisticated your AI model is. Garbage in, garbage out isn't a new concept. But now we have empirical data proving this in an RMF-specific context, across multiple leading models.

The agencies that will get the most value from AI in GRC aren't the ones that pick the "best" model. They're the ones that do the unglamorous work of getting their artifacts in order first.

This Is Why GRC Engineering Exists

This finding is a perfect illustration of why I believe GRC Engineering, the discipline of applying software engineering principles to governance, risk, and compliance, is essential for the federal sector.

GRC Engineering treats compliance artifacts like code: version-controlled, standardized, machine-readable, and continuously maintained. When your System Security Plan is structured in OSCAL format rather than a Word document, when your control descriptions follow a consistent template rather than varying by whoever wrote them, when your evidence is generated from system telemetry rather than manually assembled screenshots, you're building the foundation that makes AI actually useful.

Think of it this way: AI is an accelerant. If your underlying artifacts are strong, well-structured, current, consistent, AI accelerates your team's effectiveness dramatically. If your artifacts are weak, AI accelerates the production of confident-sounding nonsense.

The engineering discipline comes first. The AI amplification comes second.

Before your agency invests in the next AI-powered GRC platform, invest in artifact quality:

Standardize your templates. Every SSP, every control assessment, every POA&M should follow a consistent structure. Eliminate the variance that comes from different authors and different contracts.

Adopt machine-readable formats. OSCAL isn't just a NIST or FedRAMP 20x initiative: it's the foundation for AI-ready compliance documentation. Start producing OSCAL-formatted artifacts now, even if your current tools don't require it.

Version control your documentation. Treat your security artifacts like source code. Track changes, maintain history, and know exactly what changed, when, and why.

Automate evidence generation. Replace manual screenshots and attestations with evidence pulled directly from system telemetry. Automated evidence is more accurate, more current, and more consistent.

Audit your inputs before you audit your models. Before evaluating which AI tool to buy, evaluate the quality of what you'd be feeding it. The ROI on cleaning up your artifacts will exceed the ROI on model selection every time.

The Full Research

The complete methodology, model-by-model results, and detailed findings are available in the NR Labs AI RMF Model Evaluation paper. If you're serious about bringing AI into your GRC program and doing it in a way that produces reliable results, this is required reading.

Download the full paper here.

AJ Yawn is the GRC Engineering Lead at NR Labs and author of GRC Engineering for AWS: A Hands-On Guide to Governance, Risk, and Compliance Engineering. He writes about the intersection of engineering and federal compliance at nrlabs.com.

‍



Evaluating AI Language Models in RMF Assessments

