Skip to content

Density Beats Length

Across four archived E15 runs, making sparse prompts longer did nothing. Making the task contract explicit raised average pass rate from 0.25 to 1.00.

Context budget by model

Key Numbers

Condition Average pass rate Read
short sparse 0.25 Short and underspecified
long sparse 0.25 More words, same missing contract
short dense 1.00 Short but explicit
long dense 1.00 Explicit and verbose

The length marginal stayed flat at 0.625 -> 0.625. The density marginal jumped from 0.25 -> 1.00.

Model Split

Model short sparse long sparse short dense long dense
qwen3.5:4b 0.50 0.50 1.00 1.00
qwen3.5:9b 0.50 0.50 1.00 1.00
gemma4:e4b 0.00 0.00 1.00 1.00
gemma4:26b 0.00 0.00 1.00 1.00

On this task family, both tested Qwen models were partly recoverable under sparse prompts, while both tested Gemma models collapsed until the contract became explicit.

Failure Texture

Most sparse-prompt failures were contract-shape errors, not generic nonsense:

  • wrong_output_type: 39
  • runtime_error: 21
  • extraction_failure: 12

The common pattern was straightforward: models wrote a plausible function for the task name, but returned the wrong output shape, printed instead of returning, or chose a familiar but wrong contract.

Public Runner

cd harness
uv run python validate.py e15 --model-name qwen3.5:4b --k 3

The public e15 command is source-complete for this finding class. Fresh runs should be read as replications on the same prompt family, not as a universal statement about context-window behavior.

Data

  • Aggregated finding: data/public/findings.json
  • Task definitions: harness/data.py
  • Harness: harness/validate.py

The raw E15 run archives stay local or private for now. The public repo publishes the derived summary, plotting code, and runnable harness.

Sample Counts

  • 4 archived E15 run files
  • 4 models
  • 4 tasks
  • 16 task-model rows per condition
  • 192 total graded calls
  • k=3 per condition

Limitations

This result comes from deterministic Python code tasks on four local-model runs. It separates prompt length from prompt explicitness inside this task family; it does not show that context window never matters for retrieval, long-document synthesis, or multi-turn chat.