A small, local, natural-language → shell-command translator. 752 M parameters, Q4_K_M, ~530 MB. Runs through Ollama.
$ echo "find every python file under src bigger than 1MB" | ollama run nl2shell:0.8b
find src -type f -name '*.py' -size +1M
Big NL→shell models exist (Claude, GPT, Code Llama), but a small model that runs on a laptop and gets the everyday cases right has two uses:
- Latency-critical UX — phone, watch, voice — where round-tripping to a frontier model adds 800 ms and a privacy footprint.
- A measurable foundation for a Terminal World Model (next project): if a small NL→bash model is honestly benchmarked, we can layer a world model on top that plans in latent shell-state space.
# 1. Pull the model into Ollama (hosted at AryaYT on Hugging Face)
ollama pull hf.co/AryaYT/nl2shell-0.8b
ollama cp hf.co/AryaYT/nl2shell-0.8b nl2shell:0.8b # local alias
# 2. (For the benchmark harness) clone & set up Python deps
git clone https://github.com/nl2shell/bench.git
cd bench
pip install -e .# Single inference
echo "list every .conf file in /etc that was modified in the last 24 hours" \
| ollama run nl2shell:0.8b
# Library use
python3 -c "
from nl2shell.client import OllamaClient
r = OllamaClient().generate('compress all logs older than 7 days')
print(r.bash)
"We evaluate nl2shell:0.8b on a stratified held-out carved from v3, balanced so that low-frequency concept tags (stream redirection, process control) get fair representation rather than being drowned out by the well-covered ones (pipe, quote.*, chain.seq).
Run: results/v3-holdout__20260524T101822Z/report.md — 2026-05-24, nl2shell:0.8b, n=426, 99s wall.
| metric | value |
|---|---|
| exact_match | 3.3% |
| template_match | 4.0% |
parses (bash -n) |
96.7% |
| concept-tag Jaccard (mean) | 0.310 |
| latency (median) | 0.21s |
| latency (p95) | 0.41s |
Read this as: the model produces valid shell almost every time (96.7%), but the exact string rarely matches the noisy gold. The honest signal is per-difficulty:
| difficulty | n | exact |
|---|---|---|
| 1 (no operators) | 32 | 21.9% |
| 2 (one operator) | 90 | 5.6% |
| 3 (2–3 operators) | 182 | 1.1% |
| 4 (4+ operators) | 122 | 0.0% |
Simple cases land; multi-operator cases collapse. That's the gap v4 is being built to close.
The v3 gold is noisy in three ways:
- The model is sometimes more idiomatic than the gold.
find . -name "*bar"(model) vsfind -name *bar(gold). The model adds the.default and quotes the glob; gold doesn't. Same intent, different string. - Multi-line shell scripts as "answers". ~3% of the holdout has gold containing newlines or shebangs — full 20-line scripts as a single "bash" answer. A single-shot translator can't produce those; they should arguably be filtered out (see issue
v0.2). - Real failures on complex pipes. Some cases the model just gets wrong — repeated
sort -R | uniq -cloops, picks wrong base command. These are the legitimate failures and exactly what v4 targets.
We deliberately don't post-process the model's output to chase a higher number — the metric should be honest.
Wired up in v0.2.0. TB2 tasks are agentic — they require a sandboxed environment, multi-turn observation, and a test script per task — so nl2shell:0.8b is not the agent yet (a Nl2shellAgent Harbor adapter is on the v0.3 roadmap). For v0.2, we use claude-haiku-4-5 as the agent to establish the baseline number we'll later compare nl2shell against.
First Daytona run on the 10-task terminal-bench-sample@2.0:
results/terminal-bench__20260524T110127Z/report.md
| metric | value |
|---|---|
| pass rate | 30.0% (3/10) |
| mean reward | 0.300 |
| wall clock | 21m 02s |
| total cost | $1.63 |
| total input tokens | 11,148,439 (~76% cached) |
| total output tokens | 128,380 |
| timeouts | 2 / 10 (qemu-alpine-ssh, qemu-startup) |
Passed: fix-code-vulnerability, log-summary-date-ranges, sqlite-with-gcov. Failed (verifier): build-cython-ext, chess-best-move, configure-git-webserver, polyglot-c-py, regex-log. The two qemu tasks hit the 15-min agent timeout — these are the kind of agentic-with-VM-state work that's furthest from a single-shot translator's wheelhouse.
# Cloud sandboxes (recommended; needs DAYTONA_API_KEY + ANTHROPIC_API_KEY)
nl2shell-eval-tb --dataset terminal-bench@2.0 \
--agent claude-code --model anthropic/claude-haiku-4-5 \
--env daytona -n 16
# Local Docker
nl2shell-eval-tb --dataset terminal-bench-sample@2.0 \
--agent claude-code --model anthropic/claude-haiku-4-5 \
--env docker -n 4# Drive Terminal-Bench tasks with our 752 M local model (needs Ollama up
# with nl2shell:0.8b; DAYTONA_API_KEY only if you flip the YAML to daytona).
nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yamlNl2shellAgent (src/nl2shell/agent.py) is a Harbor BaseAgent that:
- Renders
instruction + observation historyinto a flat NL prompt. - Asks Ollama for one bash command.
- Executes it via
environment.exec. - Appends
(command, return_code, stdout, stderr)to the observation history. - Loops until
max_steps, a stop-marker, or an empty/unsafe command.
It supports three observation modes (full | command | none) for ablation
runs at configs/harbor_nl2shell_obs_ablation.yaml. This is the honest
baseline for nl2shell on agentic tasks — the gap between this number and
the v0.2 Claude number is what a Terminal World Model would need to close.
See docs/benchmarks.md for adapter internals and docker/README.md for the reproducible eval-in-a-box compose file.
Out of scope. ProgramBench is whole-program reconstruction from a compiled binary; it's not even close to what a single-shot NL→bash model attempts. We document the mismatch in docs/benchmarks.md rather than publish a 0%.
# 1. Pull the labelled dataset (or rebuild from v3 yourself)
python3 -m nl2shell.audit AryaYT/nl2shell-training-v3 \
--out data/v3-labeled.jsonl \
--report data/gap-report.md
# 2. Build the stratified held-out
python3 scripts/build_holdout.py \
--labelled data/v3-labeled.jsonl \
--out data/holdout/holdout.jsonl
# 3. Run the eval (assumes Ollama is up with nl2shell:0.8b)
PYTHONPATH=src python3 -m evals.run --benchmark v3-holdoutThe run writes results/<benchmark>__<timestamp>/{summary.json, per_case.jsonl, report.md}.
src/nl2shell/ # library: concepts, label, audit, client, metrics
evals/ # benchmark runner + adapters (v3-holdout, terminal-bench TBD)
scripts/ # one-off scripts (build_holdout, etc.)
data/ # held-out + labelled dataset (small files committed)
results/ # one directory per eval run; summary.json + report.md committed
docs/ # design notes (benchmarks, taxonomy, training)
tests/ # pytest suite for concepts + metrics
| metric | what it answers |
|---|---|
exact_match |
did the model produce the gold string, modulo whitespace? |
template_match |
did it produce the same command shape, ignoring literal arguments? |
parses |
does bash -n accept the output? (i.e. did we get valid shell) |
shellcheck_clean |
does shellcheck flag any SC2xxx errors? (when installed) |
concept_overlap |
Jaccard of v3 concept tags between pred and gold |
per_tag |
exact-match rate stratified by the tag the gold carries |
- v0.1 — single-shot benchmark, stratified held-out, multi-metric scoreboard.
- v0.2 — TerminalBench-2.0 via Harbor adapter, Claude-haiku-4-5 baseline, reproducible Dockerfile + compose, eval-in-a-box.
- v0.3 (this release) —
Nl2shellAgentHarbor adapter so nl2shell:0.8b runs as the agent itself, plus an observation-mode ablation harness. - v0.4 — v4 dataset targeted at 18 starved concept tags + run Nl2shellAgent on the full 89-task TerminalBench-2.0 for the published number.
- v1.0 — Terminal World Model layer that conditions nl2shell on observed terminal state.
See docs/benchmarks.md and data/gap-report.md for the data-side roadmap.
If you find this useful:
@misc{aryayt2026nl2shell,
author = {Arya},
title = {nl2shell: a small natural-language-to-shell-command model},
year = {2026},
howpublished = {\url{https://github.com/nl2shell/bench}},
}MIT — see LICENSE.