Prompt Sensitivity Bench¶
An open benchmark measuring how prompt wording changes model outputs across the model size spectrum.
The largest measured effect so far is specificity. On 1-4B local models, moving from a vague prompt to a task plus input/output spec raised pass rate from 8% to 82%.
:octicons-graph-16: Findings :octicons-beaker-16: Methodology :octicons-database-16: Data :octicons-git-pull-request-16: Contribute
What This Is¶
This is a versioned benchmark with derived prompt sensitivity findings. Each finding links to the public summary, the task definitions, and the command needed to reproduce or extend the measurement on your own run archive.
What This Is Not¶
This is not a capability leaderboard. LMArena, Artificial Analysis, HELM, LiveBench, and coding leaderboards already measure model strength. This project measures how much model behavior changes when the prompt changes.
This is not a product page. It is not an evaluation framework. It is not a benchmark submission tracker.
Current Findings¶
| Finding | Headline |
|---|---|
| Specificity | Vague prompts fail; input/output specs create the largest measured jump. |
| Complexity | More prompt detail can hurt small models. |
| Filler words | Text humans treat as filler can be structure for small models. |
| Format preference | XML, Markdown, and plain text were indistinguishable in delimiter-only coding tests. |
| k=1 trap | Single-shot measurements can reverse conclusions on boundary models. |
| DeMorgan inversion | One phrasing caused a deterministic logic inversion on llama3.1:8b. |