Nobody Tests Their AI Agent Skills. Here's Why That's a Problem (and How to Fix It)

47,000+ agent skills across 6,300+ repos. Almost none tested beyond a vibe check. Here's a 4-layer testing architecture that fixes that.

By Mor Dabastany • April 15, 2026 • 608 words • 3 min

ai agents testing evaluation agent-skills ai-testops

There are 47,000+ AI agent skills across 6,300+ repositories (according to a SkillsBench study that queries GitHub and others for the numbers). Almost none of them are tested beyond a “vibe check” - try it a few times, looks good, ship it.

I was assigned to come up with a framework-agnostic testing architecture for agent skills, and the research surfaced some things worth sharing.

The core insight: skills need 4 types of testing, not 1

Most people think “testing a skill” means running the agent and seeing if it works. That’s only one layer. We identified four distinct testing concerns:

Artifact testing - does the skill’s own code work? Unit test the scripts before the agent ever touches them. Seconds to run, zero LLM cost, catches broken API calls and regressions instantly.
Agent skill testing - does the agent use the skill correctly? Full sandbox execution, 3-5 trials per test case (because agents take different paths every time), grade the outcome.
Workflow skill testing - does the multi-step pipeline produce the right result? Test each step in isolation with mocked inputs, then run end-to-end to verify the chain.
Security testing - does the skill behave safely? Scan outputs for leaked secrets, scope violations, and destructive operations.

The key: artifact tests are the gate. If the code is broken, don’t waste tokens running the agent.

What the data says

The numbers from recent research are striking:

LangChain went from 9% to 82% task completion with tested, curated skills.
SkillsBench found a +16.2 percentage point average improvement with curated skills (up to +51.9pp in healthcare).
One practitioner went from 66.7% to 100% pass rate with just ~20 test cases and a rewritten skill description.
Self-generated skills (no human curation) provide zero average benefit.

That last point is critical. AI-generated content without human judgment doesn’t help (and that, by the way, is going to impact the market the most, in my opinion). But human-curated skills with focused, 2-3 module designs significantly outperform large documentation dumps.

7 principles we distilled from 9 sources

Grade the result, not the journey - agents find creative solutions; check the outcome.
Define “done” before you test - correct output, correct style, efficient execution.
Every test run starts clean - leftover state hides bugs.
Run the same test multiple times - 3-5 trials to handle non-determinism.
Include negative tests - verify skills do NOT trigger on unrelated prompts.
Deterministic checks first, LLM judges second - fast and cheap catches most issues.
Always compare with vs. without - the delta is what proves a skill helps.

The architecture: layered, not monolithic

I landed on a 4-layer harness:

Input - what to test
Execution - run it cleanly
Grading - score it
Reporting - learn from it

Each layer is independent and pluggable.

The framework-agnostic part matters: one adapter interface lets you test the same skill across Codex, Claude Code, Gemini CLI, LangChain, or any custom agent (though we aim to use Claude Code). Write the tests once, run them everywhere.

For teams that want more coverage, an opt-in Agent Testing Agent generates edge-case tests automatically. They’re tagged separately and go through a graduation pipeline before joining the curated suite. You control the cost/coverage tradeoff.

The bottom line

“It feels like it works” is not a testing strategy.

With 10-20 test cases, a layered eval harness, and a few hours of investment, you can replace gut feelings with concrete metrics: pass rates, deltas, cost per run, regression detection.

The skills that will win are the ones that can prove they work.

P.S. A demo will be shown once I’m able to, but this philosophy is going to kick ass.