Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 65 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,73 +2,102 @@
<picture>
<img alt="TrainCheck logo" width="55%" src="https://raw.githubusercontent.com/OrderLab/TrainCheck/main/docs/assets/images/traincheck_logo.png">
</picture>
<h1>TrainCheck: Invariant Checking & Observability for AI Training</h1>
<h1>TrainCheck: Invariant Checking for AI Training</h1>

[![Chat on Discord](https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white)](https://discord.gg/ZvYewjsQ9D)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/OrderLab/TrainCheck)

</div>

TrainCheck catches silent training bugs by learning what a healthy run does, then checking a new run against those learned invariants. It works by tracing PyTorch API calls and model state changes, so you can inspect training behavior before a loss curve or final metric tells you something went wrong.

**Stop flying blind.** TrainCheck gives you deep visibility into your training dynamics, continuously validating correctness and stability where standard metrics fail.
## Install

---
Install TrainCheck in the same Python environment that runs your training script:

### Why TrainCheck?
```bash
pip3 install traincheck
```

✅ **Continuous Invariant Checking**
TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants—such as gradient norms, tensor shapes, and update magnitudes—effectively catching silent corruption before it wastes GPU hours.
For CUDA, conda, and source-install details, see the [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/).

🚀 **Holistic Observability**
Traditional tools only show you *if* your model crashed. TrainCheck shows you *why* it's degrading, analyzing internal state dynamics that loss curves miss.
## Use TrainCheck

🧠 **Zero-Config Validation**
No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.
TrainCheck has four main steps.

⚡ **Universal Compatibility**
Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more.
### 1. Collect a Reference Trace

---
## Installation
Run `traincheck-collect` on a known-good training script. This should be a short run that covers the training behavior you want TrainCheck to learn.

```bash
traincheck-collect \
--pyscript reference.py \
--models-to-track model \
--output-dir reference_trace
```

Install TrainCheck in the Python environment where you will run your training script:
### 2. Infer Invariants

Turn the reference trace into invariants:

```bash
pip3 install traincheck
traincheck-infer -f reference_trace -o invariants.json
```

### 3. Collect a Target Trace

Run the target training script with the inferred invariants. Passing `--invariants` lets TrainCheck trace only the APIs and variables needed for those checks.

```bash
traincheck-collect \
--pyscript target.py \
--models-to-track model \
--invariants invariants.json \
--output-dir target_trace
```

For detailed setup (CUDA configuration, UV, conda environments), see the [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/).
For long target runs, trace fewer steps:

```bash
traincheck-collect \
--pyscript target.py \
--models-to-track model \
--invariants invariants.json \
--sampling-interval 10 \
--warm-up-steps 10 \
--output-dir target_trace
```

### How It Works
### 4. Check the Target Run

1. **Instrument**: We wrap your training loop with lightweight probes—no code changes needed.
2. **Learn**: We analyze correct runs to infer *invariants* (mathematical rules of healthy training).
3. **Check**: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults.
For live checking, start `traincheck-onlinecheck` while the target run is writing traces:

![Workflow](https://raw.githubusercontent.com/OrderLab/TrainCheck/main/docs/assets/images/workflow.png)
```bash
traincheck-onlinecheck -f target_trace -i invariants.json
```

## 🔥 Try TrainCheck
The easier offline path is to wait for trace collection to finish, then run:

Work through [5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md). You’ll learn how to:
- Instrument a training script and collect a trace
- Automatically infer invariants
- Uncover silent bugs in the training script
```bash
traincheck-check -f target_trace -i invariants.json
```

Both checkers write a results directory with failure logs and a `report.html` summary.

## Documentation
## Learn More

- **[Installation Guide](https://orderlab.io/TrainCheck/installation-guide/)**
- **[Usage Guide: Scenarios and Limitations](https://orderlab.io/TrainCheck/usage-guide/)**
- **[TrainCheck Technical Doc](https://orderlab.io/TrainCheck/technical-doc/)**
- [Use TrainCheck](https://orderlab.io/TrainCheck/usage-guide/) explains the full workflow and output files.
- [5-Minute Tutorial](./docs/5-min-tutorial.md) walks through a real silent training issue.
- [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/) covers environment setup.
- [Technical Documentation](https://orderlab.io/TrainCheck/technical-doc/) describes invariants, trace representation, and implementation details.

## Status

TrainCheck is under active development. Please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support. You can also reach the team at [traincheck@umich.edu](mailto:traincheck@umich.edu).
We welcome feedback and contributions from early adopters.
TrainCheck is under active development. Please join our [Discord server](https://discord.gg/VwxpJDvB), file a GitHub issue, or email [traincheck@umich.edu](mailto:traincheck@umich.edu).

## Contributing

We welcome and value any contributions and collaborations. Please check out [Contributing to TrainCheck](./CONTRIBUTING.md) for how to get involved.
We welcome contributions. See [Contributing to TrainCheck](./CONTRIBUTING.md) for setup and contribution guidance.

## License

Expand All @@ -77,6 +106,7 @@ TrainCheck is licensed under the [Apache License 2.0](./LICENSE).
## Citation

If TrainCheck is relevant to your work, please cite our paper:

```bib
@inproceedings{TrainCheckOSDI2025,
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
Expand All @@ -90,7 +120,6 @@ If TrainCheck is relevant to your work, please cite our paper:
}
```


## Artifact Evaluation

🕵️‍♀️ OSDI AE members, please see [TrainCheck AE Guide](./docs/ae.md).
OSDI AE members should use the [TrainCheck AE Guide](./docs/ae.md).
2 changes: 1 addition & 1 deletion docs/ae-eval-s5.5-perf-overhead.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ This evaluation measures the runtime overhead introduced by TrainCheck’s instr
- Located in [overhead-e2e](../eval_scripts/perf_benchmark/overhead-e2e)

- The deployed 100 invariants:
[eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json](../eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json)
`eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json`


## 🛠 How to Run
Expand Down
160 changes: 101 additions & 59 deletions docs/check.md
Original file line number Diff line number Diff line change
@@ -1,110 +1,152 @@
# TrainCheck Checker Usage Guide
# CLI Reference: Check Traces

`traincheck-check` is the **final stage** of the TrainCheck workflow. It verifies a set of invariants against trace files or streams from target programs, reporting any detected violations—helping you catch silent issues in your ML training pipelines.
Start with [Use TrainCheck](usage-guide.md) if you want the full workflow. This page explains `traincheck-onlinecheck` and `traincheck-check`.

## 🔧 Checking Modes
TrainCheck has two checking modes:

TrainCheck supports two checking modes:
- `traincheck-onlinecheck` checks traces while `traincheck-collect` is still writing them.
- `traincheck-check` checks completed trace files after collection finishes.

- **Post-training Checking (`traincheck-check`)**:
Perform invariant checking on completed trace files after the training job finishes. ✅
Use online checking when you want violations during a running job. Use offline checking when you want the easiest path or a reproducible local workflow.

- **On-the-fly Checking (`traincheck-onlinecheck`):**
Perform real-time checking while the target training job is running. ✅
## Live Checking

## How to Use: On-the-fly Checking

While training is in progress with `traincheck-collect`, run the following command:
Start trace collection for the target run:

```bash
traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file>
traincheck-collect \
--pyscript target.py \
--models-to-track model \
--invariants invariants.json \
--output-dir target_trace
```

- `-f <trace_folder>`: Path to the folder where traces are:
- Already collected, or
- **Actively being collected** by `traincheck-collect` during the training job.
In another terminal, start the online checker:

- `-i <path_to_invariant_file>`: Path to the JSON file containing inferred invariants.
```bash
traincheck-onlinecheck -f target_trace -i invariants.json
```

## How to Use: Post-training Checking
The online checker watches `target_trace/` and updates its report as new traces arrive.

Run the following command:
If the command fails with a missing `watchdog` package, install it in the same environment:

```bash
traincheck-check -f <trace_folder> -i <path_to_invariant_file>
pip install watchdog
```

- `-f <trace_folder>`: Path to the folder containing traces collected by `traincheck-collect`.
- `-i <path_to_invariant_file>`: Path to the JSON file containing inferred invariants.
Control the report refresh interval with:

## Report Visualization Options

Both checkers can produce a standalone HTML report and optionally log summary metrics to external monitoring tools.
```bash
traincheck-onlinecheck \
-f target_trace \
-i invariants.json \
--report-interval-seconds 30
```

### Standalone HTML Report (default)
## Offline Checking

- Output: `<output_dir>/report.html`
- Includes summary counts, relation breakdown, and top violations.
- Disable with `--no-html-report`.
The offline path is simpler. First let `traincheck-collect` finish, then run:

**Offline example**
```bash
traincheck-check -f <trace_folder> -i <path_to_invariant_file>
traincheck-check -f target_trace -i invariants.json
```

**Online example**
Offline checking reads the completed trace folder and writes a results directory.

## Sampling and Checking

Sampling is configured during trace collection:

```bash
traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file>
traincheck-collect \
--pyscript target.py \
--models-to-track model \
--invariants invariants.json \
--sampling-interval 10 \
--warm-up-steps 10 \
--output-dir target_trace
```

### W&B Integration

Enable with `--report-wandb`. You can also pass:
`--wandb-project`, `--wandb-entity`, `--wandb-run-name`, `--wandb-group`, `--wandb-tags`.
Then run either checker normally:

```bash
traincheck-check -f <trace_folder> -i <path_to_invariant_file> \
--report-wandb --wandb-project <project>
traincheck-onlinecheck -f target_trace -i invariants.json
```

```bash
traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file> \
--report-wandb --wandb-project <project>
traincheck-check -f target_trace -i invariants.json
```

### MLflow Integration
The checker does not decide which steps were traced. It checks the trace files that collection produced.

Enable with `--report-mlflow`. Optional:
`--mlflow-experiment`, `--mlflow-run-name`.
## Reports and Logs

Both checkers write:

- `failed.log`: violated invariants.
- `passed.log`: triggered invariants that passed.
- `not_triggered.log`: invariants that never ran on the trace.
- `violations_summary.json`: compact violation summaries.
- `report.html`: browser-readable summary.

The default output directory is timestamped. Use `-o` or `--output-dir` to choose a path:

```bash
traincheck-check -f <trace_folder> -i <path_to_invariant_file> \
--report-mlflow --mlflow-experiment <experiment>
traincheck-check \
-f target_trace \
-i invariants.json \
--output-dir check_results
```

## W&B and MLflow

Log checker results to Weights & Biases:

```bash
traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file> \
--report-mlflow --mlflow-experiment <experiment>
traincheck-check \
-f target_trace \
-i invariants.json \
--report-wandb \
--wandb-project traincheck
```

### Online Report Refresh
Attach offline checker metrics to an existing W&B run:

The online checker refreshes the report when violations change, and also on a periodic timer.
Control the interval with `--report-interval-seconds` (default: 10).
```bash
traincheck-check \
-f target_trace \
-i invariants.json \
--report-wandb \
--wandb-run-id <run-id>
```

Log checker results to MLflow:

```bash
traincheck-onlinecheck -f <trace_folder> -i <path_to_invariant_file> \
--report-interval-seconds 30
traincheck-check \
-f target_trace \
-i invariants.json \
--report-mlflow \
--mlflow-experiment traincheck
```

**Note:** W&B and MLflow logging are optional. If the packages are not installed, TrainCheck will skip logging and emit a warning.
The online checker supports the same W&B and MLflow reporting flags.

## Interpreting the Results
## Useful Options

After running either checking mode, TrainCheck will output a summary of detected invariant violations. Each violation entry typically includes:
- `-f, --trace-folders`: trace directories produced by `traincheck-collect`.
- `-t, --traces`: individual trace files.
- `-i, --invariants`: invariant files produced by `traincheck-infer`.
- `-o, --output-dir`: results directory.
- `--no-html-report`: skip `report.html`.
- `--report-wandb`: log summary metrics and the HTML report to W&B.
- `--report-mlflow`: log summary metrics and the HTML report to MLflow.
- `--report-interval-seconds`: online checker report refresh interval.

- **Trace file or stream name**: Identifies where the issue was found.
- **Invariant description**: Details the specific invariant that was violated.
- **Violation details**: Provides context, such as the step or epoch where the violation occurred.
Run the command help for the complete option list:

Review these results to pinpoint silent errors or unexpected behaviors in your ML training pipeline. For more information on result formats and how to diagnose issues, see [5. Detection & Diagnosis](./5-min-tutorial.md#5-detection--diagnosis) in the **5-Minute Tutorial**.
```bash
traincheck-check --help
traincheck-onlinecheck --help
```
Loading