Contribute A Run¶

If you can run a model that is missing from the dataset, submit one benchmark run and get credited in the data.

Run E9 Specificity¶

git clone https://github.com/signaldepth/prompt-sensitivity-bench.git
cd prompt-sensitivity-bench/harness
uv venv
uv pip install -e ".[dev]"
uv run python validate.py e9 --model-name <model-name> --host http://localhost:11434 --k 3

The result is written to .output/experiments/.

Validate¶

uv run python ../scripts/validate_contribution.py .output/experiments/<file>.json

The public repo validates contribution shape and public artifact shape separately. Maintainers merge selected results into derived public findings; raw run dumps do not get committed automatically.

Other Public Experiments¶

The harness now also exposes:

uv run python validate.py e7 --model-name <model-name> --host http://localhost:11434 --k 3
uv run python validate.py e8 --model-name <model-name> --host http://localhost:11434 --k 3

Those experiments are public and runnable, but the lightweight public contribution path is still E9-first until the intake policy and fixtures for the other families are tightened.

Submit¶

Open a GitHub issue with the JSON file pasted into the template, or attach/link a gist if it is large. Maintainers review submissions and fold selected results into derived findings; raw run dumps are not automatically committed to the public repo.

For a minimal valid shape, see data/examples/e9_specificity_fixture.json.

Requirements¶

k >= 3
temperature specified
model name and hardware specified
no private data in notes or outputs