Skip to content

Methodology

The benchmark uses deterministic programming tasks with executable tests. A model receives a prompt, returns Python code, and the harness runs the generated function against fixed test cases.

Public Evidence Model

Signal Depth publishes the benchmark method, schemas, plotting code, and derived findings by default. The current public data/public/*.json artifacts are generated from local/private raw run archives that stay outside the tracked repo unless they are explicitly curated for release.

This means the public site is reproducible at the method and aggregation layer, while the per-trial raw dumps remain local/private for now. Finding pages should say when a result comes from a private archived run rather than a checked-in raw file, and when a public command is a comparable rerun of the same experiment class rather than a byte-for-byte replay of the archived runner.

Current Task Set

  • fizzbuzz
  • reverse_words
  • flatten
  • two_sum
  • run_length_encode
  • chunk_list

Sampling

The default is k=3 per condition. Single-shot measurements are not accepted for contributed runs because boundary models can flip conclusions between k=1 and repeated trials.

Temperature

The default temperature is 0.0 to isolate prompt effects from sampling variance.

Evaluation

The harness extracts a Python function from the model response, executes it in a subprocess, and compares the printed representation of the result with the expected value.

Current Public Findings

The published summaries currently combine:

  • E9 specificity aggregates generated from 12 local/private run archives
  • four archived E15 context-budget runs
  • one archived E8 complexity run
  • one archived E7 format run
  • one archived E4 compression run

The public repo exposes the derived summaries and public rerun tooling for those experiment families, not the full raw archive itself. In the current public repo, E9 and E15 are source-complete, E7/E8 are published as comparable public experiment classes, and E4 remains summary-first until its compression layer is published independently.

Scope

In scope: prompt specificity, delimiter format, prompt complexity, prompt density versus length, compression sensitivity, and deterministic coding tasks.

Out of scope: human preference, safety evaluation, multi-turn chat dynamics, image/audio tasks, and general capability ranking.

Limitations

  • Most currently published findings are derived from private/local archived runs rather than raw per-trial dumps committed to the repo.
  • The current site reports point estimates and source counts, not confidence intervals.
  • Experiment family coverage is still uneven: E9 and E15 are source-complete in the public repo, E7/E8 are runnable as comparable public sweeps, and E4 is still summary-first while the older archived families catch up.