---
slug: skill-fitness
type: pattern
summary: "The discipline of deciding whether a reusable skill actually earns its place in an agent's context: scope it, version it, measure its lift, and delete it when it goes stale."
created: 2026-06-09
updated: 2026-06-09
related:
  skill:
    relation: refines
    note: "Skill defines what a skill is; Skill Fitness asks whether a given skill is worth loading at all."
  eval:
    relation: uses
    note: "Measuring a skill's marginal lift requires a real evaluation harness, not a vibe check."
  test-oracle:
    relation: uses
    note: "A skill's lift is only as trustworthy as the oracle that scores the task; a weak oracle hides a dead skill."
  context-engineering:
    relation: uses
    note: "A low-fitness skill is paid for in context-window tokens whether or not it helps."
  progressive-disclosure:
    relation: uses
    note: "Loading a skill's body only when its task fires is the mechanism that keeps a dead skill from taxing every prompt."
  garbage-collection:
    relation: enabled-by
    note: "Deleting a stale skill is the cleanup half of fitness; without it the library only grows."
  tool-sprawl:
    relation: contrasts-with
    note: "Dead Skill is the skill-layer sibling of Tool Sprawl: accreted units that cost context and return little."
  agent-sprawl:
    relation: contrasts-with
    note: "Both name unchecked accretion — Agent Sprawl at the agent layer, Dead Skill at the skill layer."
  pinning:
    relation: uses
    note: "Version-mismatched skill guidance is a pinning failure: the skill describes a world the project has moved past."
  best-current-practice:
    relation: refines
    note: "A skill is a packaged best current practice whose currency must be rechecked, not assumed."
---
# Skill Fitness

> **Pattern**
>
> A named solution to a recurring problem.

*Treat every skill in the library as a dependency that must earn its place: scope it, version it, measure its marginal lift, and delete it when it stops paying.*

A controlled study in 2026 paired 49 public software-engineering skills with pinned real repositories and execution-based acceptance tests, then measured what each skill actually did to the pass rate. Thirty-nine of the 49 changed nothing. The average lift across all of them was about one percentage point. Three skills made the agent *worse*, by as much as ten points, because their guidance had drifted out of step with the projects they were applied to. The skills were not broken in any obvious way. They read like competent advice. They simply did not help, and a few quietly hurt.

That result is the whole reason this pattern exists. The field tells practitioners to build skill libraries; almost nobody tells them to check whether a given skill is worth loading. Skill Fitness is that check.

## Understand This First

- [Skill](skill.md) — what a skill is and how the harness loads it; this article assumes you already have some.
- [Eval](eval.md) — you cannot measure a skill's lift without a way to score the task.

## Context

You are working at the **agentic** level, where a [skill](skill.md) packages instructions, conventions, and examples into a unit the agent loads on demand. Once a team discovers skills, the library grows fast. Someone writes a "review a pull request" skill, someone else a "generate a migration" skill, a third person a "follow our API conventions" skill. Each one felt useful the day it was written. Six months later the library has forty entries and nobody knows which ones still matter.

This is the moment Skill Fitness applies: not when you write your first skill, but when you have enough of them that the library itself needs governance. A skill is a dependency. Like any dependency, it can be outdated, redundant, or actively harmful, and the only way to know is to measure.

## Problem

How do you know whether a skill is helping, hurting, or doing nothing?

The intuition that more skills means more capability is wrong often enough to be dangerous. A skill costs [context](context-engineering.md) tokens every time it loads, and tokens spent on a skill that doesn't move the outcome are tokens stolen from the task. Worse, a skill can carry guidance that was right for last quarter's codebase and wrong for this one: a pinned library version that's since been upgraded, a convention the team has since dropped, a workaround for a bug that's since been fixed. The agent follows the stale instruction faithfully and produces worse output than it would have with no skill at all. None of this is visible from reading the skill. It reads fine. You find out only by testing it against real work.

## Forces

- **Apparent usefulness lies.** A skill that reads like sound advice can still produce zero lift or active harm on real tasks.
- **Tokens are not free.** Every loaded skill spends context budget whether or not it changes the outcome; that budget is finite and contested.
- **Skills rot silently.** Project context moves (versions, conventions, APIs change) and a skill that was correct slowly becomes a confident liar.
- **Measurement has a cost.** Building an evaluation that isolates one skill's contribution takes real effort, which is exactly why most teams skip it and trust the skill instead.
- **Deletion feels like loss.** A skill someone spent an afternoon writing is hard to throw away, even when it earns nothing, so dead skills accumulate.

## Solution

**Treat each skill as a measured dependency, not as accreted wisdom.** Five disciplines, applied to the library as a whole and to each entry in it.

**Scope it.** A skill should declare precisely when it applies and refuse to load otherwise. A "write a migration" skill that loads on every prompt is taxing tasks it can't help. Tight scoping is what [progressive disclosure](progressive-disclosure.md) buys you: the body loads only when its task fires. It's also the cheapest fitness lever, because an unscoped skill pays its token cost on work it has nothing to say about.

**Version it.** Pin the skill to the world it describes. If a skill encodes guidance about a library, a framework version, or a project convention, record what version that guidance is true for. When the project moves past it, the version mismatch is the signal that the skill needs a rewrite or a delete. A skill is a packaged [best current practice](best-current-practice.md), and currency is the one property a best practice cannot assume; it has to be rechecked. Stale version-pinned guidance is a [pinning](pinning.md) failure pointed at the agent.

**Measure its lift.** This is the discipline the field skips, and it's the one that matters most. Run the task the skill is meant to help with, twice: once with the skill loaded, once without. Score both with a real [evaluation](eval.md): an execution-based test, not a glance at the output. The difference is the skill's marginal lift. A skill that doesn't move the score is doing nothing for you; a skill that lowers it is costing you. The measurement is only as honest as the [test oracle](test-oracle.md) behind it: score skill lift with a weak oracle and you'll certify dead skills as live ones, which is just the benchmark-mirage failure relocated to the skill layer.

**Delete what doesn't pay.** A skill that earns no lift is not neutral. It's a context tax with no return. Pruning the library is the cleanup half of fitness, the [garbage collection](garbage-collection.md) the skill layer needs as much as memory does. Deleting a skill that measures flat is not a loss; keeping it is.

**Re-measure on a cadence.** Fitness is not a one-time gate. A skill that lifted last quarter can rot into a flat or negative one as the codebase moves underneath it. Re-run the lift measurement when the project changes in ways the skill depends on, and on a regular interval regardless.

> **💡 Tip**
>
> Before you add a skill to the shared library, run the task it targets with and without it and score both with your existing tests. If the skill doesn't move the score, you've just saved everyone the token cost of loading it. If it lowers the score, you've caught a regression before it shipped.

## Dead Skill

> **Antipattern**
>
> A recurring trap that causes harm — learn to recognize and escape it.

The companion failure has a name: a **Dead Skill** is a skill that mostly injects static prose and token overhead while pretending to be expertise. The library looks rich (forty entries, broad coverage), but most of those entries are loading cost with no measured return, and a few are steering the agent wrong.

**How to recognize it.** Nobody can say what a given skill's lift is, because nobody measured it. Skills carry no version markers, so there's no way to tell which ones have rotted. The library only ever grows; entries are added, never retired. When something goes wrong, no one suspects a skill, because skills are assumed to be free help. The 2026 study's numbers are the empirical signature: thirty-nine of forty-nine skills at zero lift is what a dead library looks like from the inside, and the team running them had no way to tell the live ones from the dead.

**Why it happens.** Writing a skill feels productive and deleting one feels wasteful, so the ratchet only turns one way. The token cost of a loaded skill is invisible at the moment it's spent. And the assumption that "more skills equal more capability" goes unexamined precisely because checking it requires the measurement discipline most teams never set up.

**The way out** is Skill Fitness itself: measure each skill's marginal lift against a real oracle, version what's pinnable, scope tightly, and delete what doesn't pay. A Dead Skill is the skill-layer sibling of [Tool Sprawl](tool-sprawl.md) and [Agent Sprawl](agent-sprawl.md), the same unchecked accretion that costs context and returns little, and it yields to the same cure: measurement and pruning, applied on a cadence.

## How It Plays Out

A platform team ships a "follow our REST conventions" skill that every service prompt loads. It was written when the team standardized on one error-envelope format. A year later the team has moved to a different format, but the skill still describes the old one. Every agent that touches an endpoint now gets confident, detailed, wrong guidance. The skill reads as authoritative, so nobody questions it, until someone finally runs the endpoint tasks with and without the skill loaded and watches the pass rate go *up* when it's removed. The skill had been costing them for months.

A developer maintaining a personal skill library decides to audit it. She takes each skill, finds a representative task it's meant to help with, and runs that task twice (once with the skill, once without), scoring both with the project's existing test suite. Of her twenty-three skills, fourteen move nothing. Two lower the score. She deletes the sixteen, version-pins the survivors to the libraries they describe, and sets a reminder to re-run the audit after the next major dependency bump. Her library is now a third the size and every entry in it has a number behind it.

A team building an internal agent platform makes Skill Fitness a gate. No skill enters the shared library without a recorded lift measurement against the acceptance suite, and every skill carries the version of the project context it was validated against. When a dependency upgrade lands, CI flags every skill pinned to the old version for re-measurement. The library stays small because the gate keeps it small: a skill has to earn its slot, and keep earning it.

> **⚠️ Warning**
>
> A skill that has never been measured is not a known quantity just because it reads well. The 2026 paired study found that most tested skills did nothing and a few did harm, and none of that was visible from reading them. "It looks like good advice" is not evidence that a skill helps; only a with-and-without measurement is.

## Consequences

**Benefits.** You stop paying context tax on skills that earn nothing, which frees budget for the task. You catch stale, version-mismatched guidance before it degrades output instead of after. The library stays small enough to reason about, because every entry has to justify its slot. And you replace a comforting assumption (more skills, more capability) with a number you can defend.

**Liabilities.** Measurement costs real effort: you need an [evaluation](eval.md) harness and a trustworthy [oracle](test-oracle.md) before you can score lift at all, and standing those up is work many teams haven't done. The discipline can ossify into bureaucracy if the gate is heavier than the skills it guards: a five-line convention skill should not need a research project to justify it. And lift measured on one set of tasks doesn't always generalize. A skill that earns nothing on your benchmark may still help on the long tail your benchmark misses, so a flat measurement is a strong signal, not a proof.

## Sources

- The empirical core of this article is a 2026 controlled evaluation that paired 49 public software-engineering skills with pinned real repositories and execution-based acceptance tests across roughly 565 task instances, isolating each skill's marginal effect on pass rate. It found that 39 of 49 skills produced no improvement, the average lift was about one percentage point, and three skills degraded performance by up to ten points through version-mismatched guidance. The result is the direct evidence that a skill's apparent usefulness does not predict its measured lift.
- The framing of a reusable capability as a *dependency* that must be scoped, versioned, and garbage-collected, rather than as accreted wisdom, follows the long tradition of dependency hygiene in software engineering, where every added dependency is a liability to be justified, not a free gain.
- The principle that loading instructions only when their task fires keeps cost bounded is progressive disclosure, the same architecture that makes [skills](skill.md) scalable in the first place; here it doubles as the cheapest fitness lever.

---

- [Next: Hook](hook.md)
- [Previous: Skill](skill.md)