OrderLab · Essoz · Jun 23, 2026
diff --git a/README.md b/README.md
@@ -2,73 +2,102 @@
 <picture>
   <img alt="TrainCheck logo" width="55%" src="https://raw.githubusercontent.com/OrderLab/TrainCheck/main/docs/assets/images/traincheck_logo.png">
 </picture>
-<h1>TrainCheck: Invariant Checking & Observability for AI Training</h1>
+<h1>TrainCheck: Invariant Checking for AI Training</h1>
 
 [![Chat on Discord](https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white)](https://discord.gg/ZvYewjsQ9D)
 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/OrderLab/TrainCheck)
 
 </div>
 
+TrainCheck catches silent training bugs by learning what a healthy run does, then checking a new run against those learned invariants. It works by tracing PyTorch API calls and model state changes, so you can inspect training behavior before a loss curve or final metric tells you something went wrong.
 
-**Stop flying blind.** TrainCheck gives you deep visibility into your training dynamics, continuously validating correctness and stability where standard metrics fail.
+## Install
 
----
+Install TrainCheck in the same Python environment that runs your training script:
 
-### Why TrainCheck?
+```bash
+pip3 install traincheck
+```
 
-✅ **Continuous Invariant Checking**
-TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants—such as gradient norms, tensor shapes, and update magnitudes—effectively catching silent corruption before it wastes GPU hours.
+For CUDA, conda, and source-install details, see the [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/).
 
-🚀 **Holistic Observability**
-Traditional tools only show you *if* your model crashed. TrainCheck shows you *why* it's degrading, analyzing internal state dynamics that loss curves miss.
+## Use TrainCheck
 
-🧠 **Zero-Config Validation**
-No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.
+TrainCheck has four main steps.
 
-⚡ **Universal Compatibility**
-Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more.
+### 1. Collect a Reference Trace
 
----
-## Installation
+Run `traincheck-collect` on a known-good training script. This should be a short run that covers the training behavior you want TrainCheck to learn.
+
+```bash
+traincheck-collect \
+  --pyscript reference.py \
+  --models-to-track model \
+  --output-dir reference_trace
+```
 
-Install TrainCheck in the Python environment where you will run your training script:
+### 2. Infer Invariants
+
+Turn the reference trace into invariants:
 
 ```bash
-pip3 install traincheck
+traincheck-infer -f reference_trace -o invariants.json
+```
+
+### 3. Collect a Target Trace
+
+Run the target training script with the inferred invariants. Passing `--invariants` lets TrainCheck trace only the APIs and variables needed for those checks.
+
+```bash
+traincheck-collect \
+  --pyscript target.py \
+  --models-to-track model \
+  --invariants invariants.json \
+  --output-dir target_trace
 ```
 
-For detailed setup (CUDA configuration, UV, conda environments), see the [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/).
+For long target runs, trace fewer steps:
 
+```bash
+traincheck-collect \
+  --pyscript target.py \
+  --models-to-track model \
+  --invariants invariants.json \
+  --sampling-interval 10 \
+  --warm-up-steps 10 \
+  --output-dir target_trace
+```
 
-### How It Works
+### 4. Check the Target Run
 
-1. **Instrument**: We wrap your training loop with lightweight probes—no code changes needed.
-2. **Learn**: We analyze correct runs to infer *invariants* (mathematical rules of healthy training).
-3. **Check**: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults.
+For live checking, start `traincheck-onlinecheck` while the target run is writing traces:
 
-![Workflow](https://raw.githubusercontent.com/OrderLab/TrainCheck/main/docs/assets/images/workflow.png)
+```bash
+traincheck-onlinecheck -f target_trace -i invariants.json
+```
 
-## 🔥 Try TrainCheck
+The easier offline path is to wait for trace collection to finish, then run:
 
-Work through [5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md). You’ll learn how to:
-   - Instrument a training script and collect a trace  
-   - Automatically infer invariants  
-   - Uncover silent bugs in the training script
+```bash
+traincheck-check -f target_trace -i invariants.json
+```
+
+Both checkers write a results directory with failure logs and a `report.html` summary.
 
-## Documentation
+## Learn More
 
-- **[Installation Guide](https://orderlab.io/TrainCheck/installation-guide/)**
-- **[Usage Guide: Scenarios and Limitations](https://orderlab.io/TrainCheck/usage-guide/)**
-- **[TrainCheck Technical Doc](https://orderlab.io/TrainCheck/technical-doc/)**
+- [Use TrainCheck](https://orderlab.io/TrainCheck/usage-guide/) explains the full workflow and output files.
+- [5-Minute Tutorial](./docs/5-min-tutorial.md) walks through a real silent training issue.
+- [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/) covers environment setup.
+- [Technical Documentation](https://orderlab.io/TrainCheck/technical-doc/) describes invariants, trace representation, and implementation details.
 
 ## Status
 
-TrainCheck is under active development. Please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support. You can also reach the team at [traincheck@umich.edu](mailto:traincheck@umich.edu).
-We welcome feedback and contributions from early adopters.
+TrainCheck is under active development. Please join our [Discord server](https://discord.gg/VwxpJDvB), file a GitHub issue, or email [traincheck@umich.edu](mailto:traincheck@umich.edu).
 
 ## Contributing
 
-We welcome and value any contributions and collaborations. Please check out [Contributing to TrainCheck](./CONTRIBUTING.md) for how to get involved.
+We welcome contributions. See [Contributing to TrainCheck](./CONTRIBUTING.md) for setup and contribution guidance.
 
 ## License
 
@@ -77,6 +106,7 @@ TrainCheck is licensed under the [Apache License 2.0](./LICENSE).
 ## Citation
 
 If TrainCheck is relevant to your work, please cite our paper:
+
 ```bib
 @inproceedings{TrainCheckOSDI2025,
   author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
@@ -90,7 +120,6 @@ If TrainCheck is relevant to your work, please cite our paper:
 }
 ```
 
-
 ## Artifact Evaluation
 
-🕵️‍♀️ OSDI AE members, please see [TrainCheck AE Guide](./docs/ae.md).
+OSDI AE members should use the [TrainCheck AE Guide](./docs/ae.md).
diff --git a/docs/ae-eval-s5.5-perf-overhead.md b/docs/ae-eval-s5.5-perf-overhead.md
@@ -21,7 +21,7 @@ This evaluation measures the runtime overhead introduced by TrainCheck’s instr
     - Located in [overhead-e2e](../eval_scripts/perf_benchmark/overhead-e2e)
 
 - The deployed 100 invariants:
-    [eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json](../eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json)
+    `eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json`
 
 
 ## 🛠 How to Run

diff --git a/docs/check.md b/docs/check.md
@@ -1,110 +1,152 @@
-# TrainCheck Checker Usage Guide
+# CLI Reference: Check Traces
 
-`traincheck-check` is the **final stage** of the TrainCheck workflow. It verifies a set of invariants against trace files or streams from target programs, reporting any detected violations—helping you catch silent issues in your ML training pipelines.
+Start with [Use TrainCheck](usage-guide.md) if you want the full workflow. This page explains `traincheck-onlinecheck` and `traincheck-check`.
 
-## 🔧 Checking Modes
+TrainCheck has two checking modes:
 
-TrainCheck supports two checking modes:
+- `traincheck-onlinecheck` checks traces while `traincheck-collect` is still writing them.
+- `traincheck-check` checks completed trace files after collection finishes.
 
-- **Post-training Checking (`traincheck-check`)**:  
-   Perform invariant checking on completed trace files after the training job finishes. ✅
+Use online checking when you want violations during a running job. Use offline checking when you want the easiest path or a reproducible local workflow.
 
-- **On-the-fly Checking (`traincheck-onlinecheck`):**
-   Perform real-time checking while the target training job is running. ✅
+## Live Checking
 
-## How to Use: On-the-fly Checking
-
-While training is in progress with `traincheck-collect`, run the following command:
+Start trace collection for the target run:
 
 ```bash
-traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file>
+traincheck-collect \
+  --pyscript target.py \
+  --models-to-track model \
+  --invariants invariants.json \
+  --output-dir target_trace
 ```
 
-- `-f <trace_folder>`: Path to the folder where traces are:
-  - Already collected, or
-  - **Actively being collected** by `traincheck-collect` during the training job.
+In another terminal, start the online checker:
 
-- `-i <path_to_invariant_file>`: Path to the JSON file containing inferred invariants.
+```bash
+traincheck-onlinecheck -f target_trace -i invariants.json
+```
 
-## How to Use: Post-training Checking
+The online checker watches `target_trace/` and updates its report as new traces arrive.
 
-Run the following command:
+If the command fails with a missing `watchdog` package, install it in the same environment:
 
 ```bash
-traincheck-check -f <trace_folder> -i <path_to_invariant_file>
+pip install watchdog
 ```
 
-- `-f <trace_folder>`: Path to the folder containing traces collected by `traincheck-collect`.
-- `-i <path_to_invariant_file>`: Path to the JSON file containing inferred invariants.
+Control the report refresh interval with:
 
-## Report Visualization Options
-
-Both checkers can produce a standalone HTML report and optionally log summary metrics to external monitoring tools.
+```bash
+traincheck-onlinecheck \
+  -f target_trace \
+  -i invariants.json \
+  --report-interval-seconds 30
+```
 
-### Standalone HTML Report (default)
+## Offline Checking
 
-- Output: `<output_dir>/report.html`
-- Includes summary counts, relation breakdown, and top violations.
-- Disable with `--no-html-report`.
+The offline path is simpler. First let `traincheck-collect` finish, then run:
 
-**Offline example**
 ```bash
-traincheck-check -f <trace_folder> -i <path_to_invariant_file>
+traincheck-check -f target_trace -i invariants.json
 ```
 
-**Online example**
+Offline checking reads the completed trace folder and writes a results directory.
+
+## Sampling and Checking
+
+Sampling is configured during trace collection:
+
 ```bash
-traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file>
+traincheck-collect \
+  --pyscript target.py \
+  --models-to-track model \
+  --invariants invariants.json \
+  --sampling-interval 10 \
+  --warm-up-steps 10 \
+  --output-dir target_trace
 ```
 
-### W&B Integration
-
-Enable with `--report-wandb`. You can also pass:
-`--wandb-project`, `--wandb-entity`, `--wandb-run-name`, `--wandb-group`, `--wandb-tags`.
+Then run either checker normally:
 
 ```bash
-traincheck-check -f <trace_folder> -i <path_to_invariant_file> \
-  --report-wandb --wandb-project <project>
+traincheck-onlinecheck -f target_trace -i invariants.json
 ```
 
 ```bash
-traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file> \
-  --report-wandb --wandb-project <project>
+traincheck-check -f target_trace -i invariants.json
 ```
 
-### MLflow Integration
+The checker does not decide which steps were traced. It checks the trace files that collection produced.
 
-Enable with `--report-mlflow`. Optional:
-`--mlflow-experiment`, `--mlflow-run-name`.
+## Reports and Logs
+
+Both checkers write:
+
+- `failed.log`: violated invariants.
+- `passed.log`: triggered invariants that passed.
+- `not_triggered.log`: invariants that never ran on the trace.
+- `violations_summary.json`: compact violation summaries.
+- `report.html`: browser-readable summary.
+
+The default output directory is timestamped. Use `-o` or `--output-dir` to choose a path:
 
 ```bash
-traincheck-check -f <trace_folder> -i <path_to_invariant_file> \
-  --report-mlflow --mlflow-experiment <experiment>
+traincheck-check \
+  -f target_trace \
+  -i invariants.json \
+  --output-dir check_results
 ```
 
+## W&B and MLflow
+
+Log checker results to Weights & Biases:
+
 ```bash
-traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file> \
-  --report-mlflow --mlflow-experiment <experiment>
+traincheck-check \
+  -f target_trace \
+  -i invariants.json \
+  --report-wandb \
+  --wandb-project traincheck
 ```
 
-### Online Report Refresh
+Attach offline checker metrics to an existing W&B run:
 
-The online checker refreshes the report when violations change, and also on a periodic timer.
-Control the interval with `--report-interval-seconds` (default: 10).
+```bash
+traincheck-check \
+  -f target_trace \
+  -i invariants.json \
+  --report-wandb \
+  --wandb-run-id <run-id>
+```
+
+Log checker results to MLflow:
 
 ```bash
-traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file> \
-  --report-interval-seconds 30
+traincheck-check \
+  -f target_trace \
+  -i invariants.json \
+  --report-mlflow \
+  --mlflow-experiment traincheck
 ```
 
-**Note:** W&B and MLflow logging are optional. If the packages are not installed, TrainCheck will skip logging and emit a warning.
+The online checker supports the same W&B and MLflow reporting flags.
 
-## Interpreting the Results
+## Useful Options
 
-After running either checking mode, TrainCheck will output a summary of detected invariant violations. Each violation entry typically includes:
+- `-f, --trace-folders`: trace directories produced by `traincheck-collect`.
+- `-t, --traces`: individual trace files.
+- `-i, --invariants`: invariant files produced by `traincheck-infer`.
+- `-o, --output-dir`: results directory.
+- `--no-html-report`: skip `report.html`.
+- `--report-wandb`: log summary metrics and the HTML report to W&B.
+- `--report-mlflow`: log summary metrics and the HTML report to MLflow.
+- `--report-interval-seconds`: online checker report refresh interval.
 
-- **Trace file or stream name**: Identifies where the issue was found.
-- **Invariant description**: Details the specific invariant that was violated.
-- **Violation details**: Provides context, such as the step or epoch where the violation occurred.
+Run the command help for the complete option list:
 
-Review these results to pinpoint silent errors or unexpected behaviors in your ML training pipeline. For more information on result formats and how to diagnose issues, see [5. Detection & Diagnosis](./5-min-tutorial.md#5-detection--diagnosis) in the **5-Minute Tutorial**.
+```bash
+traincheck-check --help
+traincheck-onlinecheck --help
+```