Skip to content

nl2shell/bench

Repository files navigation

nl2shell

A small, local, natural-language → shell-command translator. 752 M parameters, Q4_K_M, ~530 MB. Runs through Ollama.

Model on Hugging Face Dataset Repo License: MIT

$ echo "find every python file under src bigger than 1MB" | ollama run nl2shell:0.8b
find src -type f -name '*.py' -size +1M

Why

Big NL→shell models exist (Claude, GPT, Code Llama), but a small model that runs on a laptop and gets the everyday cases right has two uses:

  1. Latency-critical UX — phone, watch, voice — where round-tripping to a frontier model adds 800 ms and a privacy footprint.
  2. A measurable foundation for a Terminal World Model (next project): if a small NL→bash model is honestly benchmarked, we can layer a world model on top that plans in latent shell-state space.

Install

# 1. Pull the model into Ollama (hosted at AryaYT on Hugging Face)
ollama pull hf.co/AryaYT/nl2shell-0.8b
ollama cp hf.co/AryaYT/nl2shell-0.8b nl2shell:0.8b   # local alias

# 2. (For the benchmark harness) clone & set up Python deps
git clone https://github.com/nl2shell/bench.git
cd bench
pip install -e .

Quickstart

# Single inference
echo "list every .conf file in /etc that was modified in the last 24 hours" \
  | ollama run nl2shell:0.8b

# Library use
python3 -c "
from nl2shell.client import OllamaClient
r = OllamaClient().generate('compress all logs older than 7 days')
print(r.bash)
"

Benchmark scoreboard

We evaluate nl2shell:0.8b on a stratified held-out carved from v3, balanced so that low-frequency concept tags (stream redirection, process control) get fair representation rather than being drowned out by the well-covered ones (pipe, quote.*, chain.seq).

Run: results/v3-holdout__20260524T101822Z/report.md2026-05-24, nl2shell:0.8b, n=426, 99s wall.

metric value
exact_match 3.3%
template_match 4.0%
parses (bash -n) 96.7%
concept-tag Jaccard (mean) 0.310
latency (median) 0.21s
latency (p95) 0.41s

Read this as: the model produces valid shell almost every time (96.7%), but the exact string rarely matches the noisy gold. The honest signal is per-difficulty:

difficulty n exact
1 (no operators) 32 21.9%
2 (one operator) 90 5.6%
3 (2–3 operators) 182 1.1%
4 (4+ operators) 122 0.0%

Simple cases land; multi-operator cases collapse. That's the gap v4 is being built to close.

Why exact-match is harsh — three concrete failure modes

The v3 gold is noisy in three ways:

  1. The model is sometimes more idiomatic than the gold. find . -name "*bar" (model) vs find -name *bar (gold). The model adds the . default and quotes the glob; gold doesn't. Same intent, different string.
  2. Multi-line shell scripts as "answers". ~3% of the holdout has gold containing newlines or shebangs — full 20-line scripts as a single "bash" answer. A single-shot translator can't produce those; they should arguably be filtered out (see issue v0.2).
  3. Real failures on complex pipes. Some cases the model just gets wrong — repeated sort -R | uniq -c loops, picks wrong base command. These are the legitimate failures and exactly what v4 targets.

We deliberately don't post-process the model's output to chase a higher number — the metric should be honest.

TerminalBench-2.0 (via Harbor)

Wired up in v0.2.0. TB2 tasks are agentic — they require a sandboxed environment, multi-turn observation, and a test script per task — so nl2shell:0.8b is not the agent yet (a Nl2shellAgent Harbor adapter is on the v0.3 roadmap). For v0.2, we use claude-haiku-4-5 as the agent to establish the baseline number we'll later compare nl2shell against.

First Daytona run on the 10-task terminal-bench-sample@2.0: results/terminal-bench__20260524T110127Z/report.md

metric value
pass rate 30.0% (3/10)
mean reward 0.300
wall clock 21m 02s
total cost $1.63
total input tokens 11,148,439 (~76% cached)
total output tokens 128,380
timeouts 2 / 10 (qemu-alpine-ssh, qemu-startup)

Passed: fix-code-vulnerability, log-summary-date-ranges, sqlite-with-gcov. Failed (verifier): build-cython-ext, chess-best-move, configure-git-webserver, polyglot-c-py, regex-log. The two qemu tasks hit the 15-min agent timeout — these are the kind of agentic-with-VM-state work that's furthest from a single-shot translator's wheelhouse.

# Cloud sandboxes (recommended; needs DAYTONA_API_KEY + ANTHROPIC_API_KEY)
nl2shell-eval-tb --dataset terminal-bench@2.0 \
    --agent claude-code --model anthropic/claude-haiku-4-5 \
    --env daytona -n 16

# Local Docker
nl2shell-eval-tb --dataset terminal-bench-sample@2.0 \
    --agent claude-code --model anthropic/claude-haiku-4-5 \
    --env docker -n 4

Nl2shellAgent — nl2shell:0.8b as the agent (v0.3)

# Drive Terminal-Bench tasks with our 752 M local model (needs Ollama up
# with nl2shell:0.8b; DAYTONA_API_KEY only if you flip the YAML to daytona).
nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yaml

Nl2shellAgent (src/nl2shell/agent.py) is a Harbor BaseAgent that:

  1. Renders instruction + observation history into a flat NL prompt.
  2. Asks Ollama for one bash command.
  3. Executes it via environment.exec.
  4. Appends (command, return_code, stdout, stderr) to the observation history.
  5. Loops until max_steps, a stop-marker, or an empty/unsafe command.

It supports three observation modes (full | command | none) for ablation runs at configs/harbor_nl2shell_obs_ablation.yaml. This is the honest baseline for nl2shell on agentic tasks — the gap between this number and the v0.2 Claude number is what a Terminal World Model would need to close.

See docs/benchmarks.md for adapter internals and docker/README.md for the reproducible eval-in-a-box compose file.

ProgramBench

Out of scope. ProgramBench is whole-program reconstruction from a compiled binary; it's not even close to what a single-shot NL→bash model attempts. We document the mismatch in docs/benchmarks.md rather than publish a 0%.

Reproducing the benchmark

# 1. Pull the labelled dataset (or rebuild from v3 yourself)
python3 -m nl2shell.audit AryaYT/nl2shell-training-v3 \
    --out data/v3-labeled.jsonl \
    --report data/gap-report.md

# 2. Build the stratified held-out
python3 scripts/build_holdout.py \
    --labelled data/v3-labeled.jsonl \
    --out data/holdout/holdout.jsonl

# 3. Run the eval (assumes Ollama is up with nl2shell:0.8b)
PYTHONPATH=src python3 -m evals.run --benchmark v3-holdout

The run writes results/<benchmark>__<timestamp>/{summary.json, per_case.jsonl, report.md}.

Project layout

src/nl2shell/      # library: concepts, label, audit, client, metrics
evals/             # benchmark runner + adapters (v3-holdout, terminal-bench TBD)
scripts/           # one-off scripts (build_holdout, etc.)
data/              # held-out + labelled dataset (small files committed)
results/           # one directory per eval run; summary.json + report.md committed
docs/              # design notes (benchmarks, taxonomy, training)
tests/             # pytest suite for concepts + metrics

What's measured

metric what it answers
exact_match did the model produce the gold string, modulo whitespace?
template_match did it produce the same command shape, ignoring literal arguments?
parses does bash -n accept the output? (i.e. did we get valid shell)
shellcheck_clean does shellcheck flag any SC2xxx errors? (when installed)
concept_overlap Jaccard of v3 concept tags between pred and gold
per_tag exact-match rate stratified by the tag the gold carries

Roadmap

  • v0.1 — single-shot benchmark, stratified held-out, multi-metric scoreboard.
  • v0.2 — TerminalBench-2.0 via Harbor adapter, Claude-haiku-4-5 baseline, reproducible Dockerfile + compose, eval-in-a-box.
  • v0.3 (this release)Nl2shellAgent Harbor adapter so nl2shell:0.8b runs as the agent itself, plus an observation-mode ablation harness.
  • v0.4 — v4 dataset targeted at 18 starved concept tags + run Nl2shellAgent on the full 89-task TerminalBench-2.0 for the published number.
  • v1.0 — Terminal World Model layer that conditions nl2shell on observed terminal state.

See docs/benchmarks.md and data/gap-report.md for the data-side roadmap.

Citation

If you find this useful:

@misc{aryayt2026nl2shell,
  author       = {Arya},
  title        = {nl2shell: a small natural-language-to-shell-command model},
  year         = {2026},
  howpublished = {\url{https://github.com/nl2shell/bench}},
}

License

MIT — see LICENSE.

About

Reproducible benchmark harness for the nl2shell:0.8b NL→shell model. Stratified holdouts, multi-metric scoring, per-tag diagnostics.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors