nl2shell

A small, local, natural-language → shell-command translator. 752 M parameters, Q4_K_M, ~530 MB. Runs through Ollama.

$ echo "find every python file under src bigger than 1MB" | ollama run nl2shell:0.8b
find src -type f -name '*.py' -size +1M

Why

Big NL→shell models exist (Claude, GPT, Code Llama), but a small model that runs on a laptop and gets the everyday cases right has two uses:

Latency-critical UX — phone, watch, voice — where round-tripping to a frontier model adds 800 ms and a privacy footprint.
A measurable foundation for a Terminal World Model (next project): if a small NL→bash model is honestly benchmarked, we can layer a world model on top that plans in latent shell-state space.

Install

# 1. Pull the model into Ollama (hosted at AryaYT on Hugging Face)
ollama pull hf.co/AryaYT/nl2shell-0.8b
ollama cp hf.co/AryaYT/nl2shell-0.8b nl2shell:0.8b   # local alias

# 2. (For the benchmark harness) clone & set up Python deps
git clone https://github.com/nl2shell/bench.git
cd bench
pip install -e .

Quickstart

# Single inference
echo "list every .conf file in /etc that was modified in the last 24 hours" \
  | ollama run nl2shell:0.8b

# Library use
python3 -c "
from nl2shell.client import OllamaClient
r = OllamaClient().generate('compress all logs older than 7 days')
print(r.bash)
"

Benchmark scoreboard

We evaluate nl2shell:0.8b on a stratified held-out carved from v3, balanced so that low-frequency concept tags (stream redirection, process control) get fair representation rather than being drowned out by the well-covered ones (pipe, quote.*, chain.seq).

Run: results/v3-holdout__20260524T101822Z/report.md — 2026-05-24, nl2shell:0.8b, n=426, 99s wall.

metric	value
exact_match	3.3%
template_match	4.0%
parses (`bash -n`)	96.7%
concept-tag Jaccard (mean)	0.310
latency (median)	0.21s
latency (p95)	0.41s

Read this as: the model produces valid shell almost every time (96.7%), but the exact string rarely matches the noisy gold. The honest signal is per-difficulty:

difficulty	n	exact
1 (no operators)	32	21.9%
2 (one operator)	90	5.6%
3 (2–3 operators)	182	1.1%
4 (4+ operators)	122	0.0%

Simple cases land; multi-operator cases collapse. That's the gap v4 is being built to close.

Why exact-match is harsh — three concrete failure modes

The v3 gold is noisy in three ways:

The model is sometimes more idiomatic than the gold. find . -name "*bar" (model) vs find -name *bar (gold). The model adds the . default and quotes the glob; gold doesn't. Same intent, different string.
Multi-line shell scripts as "answers". ~3% of the holdout has gold containing newlines or shebangs — full 20-line scripts as a single "bash" answer. A single-shot translator can't produce those; they should arguably be filtered out (see issue v0.2).
Real failures on complex pipes. Some cases the model just gets wrong — repeated sort -R | uniq -c loops, picks wrong base command. These are the legitimate failures and exactly what v4 targets.

We deliberately don't post-process the model's output to chase a higher number — the metric should be honest.

TerminalBench-2.0 (via Harbor)

Wired up in v0.2.0. TB2 tasks are agentic — they require a sandboxed environment, multi-turn observation, and a test script per task — so nl2shell:0.8b is not the agent yet (a Nl2shellAgent Harbor adapter is on the v0.3 roadmap). For v0.2, we use claude-haiku-4-5 as the agent to establish the baseline number we'll later compare nl2shell against.

First Daytona run on the 10-task terminal-bench-sample@2.0: results/terminal-bench__20260524T110127Z/report.md

metric	value
pass rate	30.0% (3/10)
mean reward	0.300
wall clock	21m 02s
total cost	$1.63
total input tokens	11,148,439 (~76% cached)
total output tokens	128,380
timeouts	2 / 10 (`qemu-alpine-ssh`, `qemu-startup`)

Passed: fix-code-vulnerability, log-summary-date-ranges, sqlite-with-gcov. Failed (verifier): build-cython-ext, chess-best-move, configure-git-webserver, polyglot-c-py, regex-log. The two qemu tasks hit the 15-min agent timeout — these are the kind of agentic-with-VM-state work that's furthest from a single-shot translator's wheelhouse.

# Cloud sandboxes (recommended; needs DAYTONA_API_KEY + ANTHROPIC_API_KEY)
nl2shell-eval-tb --dataset terminal-bench@2.0 \
    --agent claude-code --model anthropic/claude-haiku-4-5 \
    --env daytona -n 16

# Local Docker
nl2shell-eval-tb --dataset terminal-bench-sample@2.0 \
    --agent claude-code --model anthropic/claude-haiku-4-5 \
    --env docker -n 4

`Nl2shellAgent` — nl2shell:0.8b as the agent (v0.3)

# Drive Terminal-Bench tasks with our 752 M local model (needs Ollama up
# with nl2shell:0.8b; DAYTONA_API_KEY only if you flip the YAML to daytona).
nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yaml

Nl2shellAgent (src/nl2shell/agent.py) is a Harbor BaseAgent that:

Renders instruction + observation history into a flat NL prompt.
Asks Ollama for one bash command.
Executes it via environment.exec.
Appends (command, return_code, stdout, stderr) to the observation history.
Loops until max_steps, a stop-marker, or an empty/unsafe command.

It supports three observation modes (full | command | none) for ablation runs at configs/harbor_nl2shell_obs_ablation.yaml. This is the honest baseline for nl2shell on agentic tasks — the gap between this number and the v0.2 Claude number is what a Terminal World Model would need to close.

See docs/benchmarks.md for adapter internals and docker/README.md for the reproducible eval-in-a-box compose file.

ProgramBench

Out of scope. ProgramBench is whole-program reconstruction from a compiled binary; it's not even close to what a single-shot NL→bash model attempts. We document the mismatch in docs/benchmarks.md rather than publish a 0%.

Reproducing the benchmark

# 1. Pull the labelled dataset (or rebuild from v3 yourself)
python3 -m nl2shell.audit AryaYT/nl2shell-training-v3 \
    --out data/v3-labeled.jsonl \
    --report data/gap-report.md

# 2. Build the stratified held-out
python3 scripts/build_holdout.py \
    --labelled data/v3-labeled.jsonl \
    --out data/holdout/holdout.jsonl

# 3. Run the eval (assumes Ollama is up with nl2shell:0.8b)
PYTHONPATH=src python3 -m evals.run --benchmark v3-holdout

The run writes results/<benchmark>__<timestamp>/{summary.json, per_case.jsonl, report.md}.

Project layout

src/nl2shell/      # library: concepts, label, audit, client, metrics
evals/             # benchmark runner + adapters (v3-holdout, terminal-bench TBD)
scripts/           # one-off scripts (build_holdout, etc.)
data/              # held-out + labelled dataset (small files committed)
results/           # one directory per eval run; summary.json + report.md committed
docs/              # design notes (benchmarks, taxonomy, training)
tests/             # pytest suite for concepts + metrics

What's measured

metric	what it answers
`exact_match`	did the model produce the gold string, modulo whitespace?
`template_match`	did it produce the same command shape, ignoring literal arguments?
`parses`	does `bash -n` accept the output? (i.e. did we get valid shell)
`shellcheck_clean`	does shellcheck flag any SC2xxx errors? (when installed)
`concept_overlap`	Jaccard of v3 concept tags between pred and gold
`per_tag`	exact-match rate stratified by the tag the gold carries

Roadmap

v0.1 — single-shot benchmark, stratified held-out, multi-metric scoreboard.
v0.2 — TerminalBench-2.0 via Harbor adapter, Claude-haiku-4-5 baseline, reproducible Dockerfile + compose, eval-in-a-box.
v0.3 (this release) — Nl2shellAgent Harbor adapter so nl2shell:0.8b runs as the agent itself, plus an observation-mode ablation harness.
v0.4 — v4 dataset targeted at 18 starved concept tags + run Nl2shellAgent on the full 89-task TerminalBench-2.0 for the published number.
v1.0 — Terminal World Model layer that conditions nl2shell on observed terminal state.

See docs/benchmarks.md and data/gap-report.md for the data-side roadmap.

Citation

If you find this useful:

@misc{aryayt2026nl2shell,
  author       = {Arya},
  title        = {nl2shell: a small natural-language-to-shell-command model},
  year         = {2026},
  howpublished = {\url{https://github.com/nl2shell/bench}},
}

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nl2shell

Why

Install

Quickstart

Benchmark scoreboard

Why exact-match is harsh — three concrete failure modes

TerminalBench-2.0 (via Harbor)

`Nl2shellAgent` — nl2shell:0.8b as the agent (v0.3)

ProgramBench

Reproducing the benchmark

Project layout

What's measured

Roadmap

Citation

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
data		data
docker		docker
docs		docs
evals		evals
results		results
scripts		scripts
src/nl2shell		src/nl2shell
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

nl2shell

Why

Install

Quickstart

Benchmark scoreboard

Why exact-match is harsh — three concrete failure modes

TerminalBench-2.0 (via Harbor)

Nl2shellAgent — nl2shell:0.8b as the agent (v0.3)

ProgramBench

Reproducing the benchmark

Project layout

What's measured

Roadmap

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Nl2shellAgent` — nl2shell:0.8b as the agent (v0.3)

Packages