diff --git a/README.md b/README.md index c404598f..be6457a0 100644 --- a/README.md +++ b/README.md @@ -2,73 +2,102 @@ TrainCheck logo -

TrainCheck: Invariant Checking & Observability for AI Training

+

TrainCheck: Invariant Checking for AI Training

[![Chat on Discord](https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white)](https://discord.gg/ZvYewjsQ9D) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/OrderLab/TrainCheck) +TrainCheck catches silent training bugs by learning what a healthy run does, then checking a new run against those learned invariants. It works by tracing PyTorch API calls and model state changes, so you can inspect training behavior before a loss curve or final metric tells you something went wrong. -**Stop flying blind.** TrainCheck gives you deep visibility into your training dynamics, continuously validating correctness and stability where standard metrics fail. +## Install ---- +Install TrainCheck in the same Python environment that runs your training script: -### Why TrainCheck? +```bash +pip3 install traincheck +``` -✅ **Continuous Invariant Checking** -TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants—such as gradient norms, tensor shapes, and update magnitudes—effectively catching silent corruption before it wastes GPU hours. +For CUDA, conda, and source-install details, see the [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/). -🚀 **Holistic Observability** -Traditional tools only show you *if* your model crashed. TrainCheck shows you *why* it's degrading, analyzing internal state dynamics that loss curves miss. +## Use TrainCheck -🧠 **Zero-Config Validation** -No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly. +TrainCheck has four main steps. -⚡ **Universal Compatibility** -Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more. +### 1. Collect a Reference Trace ---- -## Installation +Run `traincheck-collect` on a known-good training script. This should be a short run that covers the training behavior you want TrainCheck to learn. + +```bash +traincheck-collect \ + --pyscript reference.py \ + --models-to-track model \ + --output-dir reference_trace +``` -Install TrainCheck in the Python environment where you will run your training script: +### 2. Infer Invariants + +Turn the reference trace into invariants: ```bash -pip3 install traincheck +traincheck-infer -f reference_trace -o invariants.json +``` + +### 3. Collect a Target Trace + +Run the target training script with the inferred invariants. Passing `--invariants` lets TrainCheck trace only the APIs and variables needed for those checks. + +```bash +traincheck-collect \ + --pyscript target.py \ + --models-to-track model \ + --invariants invariants.json \ + --output-dir target_trace ``` -For detailed setup (CUDA configuration, UV, conda environments), see the [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/). +For long target runs, trace fewer steps: +```bash +traincheck-collect \ + --pyscript target.py \ + --models-to-track model \ + --invariants invariants.json \ + --sampling-interval 10 \ + --warm-up-steps 10 \ + --output-dir target_trace +``` -### How It Works +### 4. Check the Target Run -1. **Instrument**: We wrap your training loop with lightweight probes—no code changes needed. -2. **Learn**: We analyze correct runs to infer *invariants* (mathematical rules of healthy training). -3. **Check**: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults. +For live checking, start `traincheck-onlinecheck` while the target run is writing traces: -![Workflow](https://raw.githubusercontent.com/OrderLab/TrainCheck/main/docs/assets/images/workflow.png) +```bash +traincheck-onlinecheck -f target_trace -i invariants.json +``` -## 🔥 Try TrainCheck +The easier offline path is to wait for trace collection to finish, then run: -Work through [5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md). You’ll learn how to: - - Instrument a training script and collect a trace - - Automatically infer invariants - - Uncover silent bugs in the training script +```bash +traincheck-check -f target_trace -i invariants.json +``` + +Both checkers write a results directory with failure logs and a `report.html` summary. -## Documentation +## Learn More -- **[Installation Guide](https://orderlab.io/TrainCheck/installation-guide/)** -- **[Usage Guide: Scenarios and Limitations](https://orderlab.io/TrainCheck/usage-guide/)** -- **[TrainCheck Technical Doc](https://orderlab.io/TrainCheck/technical-doc/)** +- [Use TrainCheck](https://orderlab.io/TrainCheck/usage-guide/) explains the full workflow and output files. +- [5-Minute Tutorial](./docs/5-min-tutorial.md) walks through a real silent training issue. +- [Installation Guide](https://orderlab.io/TrainCheck/installation-guide/) covers environment setup. +- [Technical Documentation](https://orderlab.io/TrainCheck/technical-doc/) describes invariants, trace representation, and implementation details. ## Status -TrainCheck is under active development. Please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support. You can also reach the team at [traincheck@umich.edu](mailto:traincheck@umich.edu). -We welcome feedback and contributions from early adopters. +TrainCheck is under active development. Please join our [Discord server](https://discord.gg/VwxpJDvB), file a GitHub issue, or email [traincheck@umich.edu](mailto:traincheck@umich.edu). ## Contributing -We welcome and value any contributions and collaborations. Please check out [Contributing to TrainCheck](./CONTRIBUTING.md) for how to get involved. +We welcome contributions. See [Contributing to TrainCheck](./CONTRIBUTING.md) for setup and contribution guidance. ## License @@ -77,6 +106,7 @@ TrainCheck is licensed under the [Apache License 2.0](./LICENSE). ## Citation If TrainCheck is relevant to your work, please cite our paper: + ```bib @inproceedings{TrainCheckOSDI2025, author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng}, @@ -90,7 +120,6 @@ If TrainCheck is relevant to your work, please cite our paper: } ``` - ## Artifact Evaluation -🕵️‍♀️ OSDI AE members, please see [TrainCheck AE Guide](./docs/ae.md). +OSDI AE members should use the [TrainCheck AE Guide](./docs/ae.md). diff --git a/docs/ae-eval-s5.5-perf-overhead.md b/docs/ae-eval-s5.5-perf-overhead.md index 966400c0..6900fab0 100644 --- a/docs/ae-eval-s5.5-perf-overhead.md +++ b/docs/ae-eval-s5.5-perf-overhead.md @@ -21,7 +21,7 @@ This evaluation measures the runtime overhead introduced by TrainCheck’s instr - Located in [overhead-e2e](../eval_scripts/perf_benchmark/overhead-e2e) - The deployed 100 invariants: - [eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json](../eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json) + `eval_scripts/perf_benchmark/overhead-e2e/sampled_100_invariants.json` ## 🛠 How to Run diff --git a/docs/check.md b/docs/check.md index 68e5607f..1959ed8f 100644 --- a/docs/check.md +++ b/docs/check.md @@ -1,110 +1,152 @@ -# TrainCheck Checker Usage Guide +# CLI Reference: Check Traces -`traincheck-check` is the **final stage** of the TrainCheck workflow. It verifies a set of invariants against trace files or streams from target programs, reporting any detected violations—helping you catch silent issues in your ML training pipelines. +Start with [Use TrainCheck](usage-guide.md) if you want the full workflow. This page explains `traincheck-onlinecheck` and `traincheck-check`. -## 🔧 Checking Modes +TrainCheck has two checking modes: -TrainCheck supports two checking modes: +- `traincheck-onlinecheck` checks traces while `traincheck-collect` is still writing them. +- `traincheck-check` checks completed trace files after collection finishes. -- **Post-training Checking (`traincheck-check`)**: - Perform invariant checking on completed trace files after the training job finishes. ✅ +Use online checking when you want violations during a running job. Use offline checking when you want the easiest path or a reproducible local workflow. -- **On-the-fly Checking (`traincheck-onlinecheck`):** - Perform real-time checking while the target training job is running. ✅ +## Live Checking -## How to Use: On-the-fly Checking - -While training is in progress with `traincheck-collect`, run the following command: +Start trace collection for the target run: ```bash -traincheck-onlinecheck -f -i +traincheck-collect \ + --pyscript target.py \ + --models-to-track model \ + --invariants invariants.json \ + --output-dir target_trace ``` -- `-f `: Path to the folder where traces are: - - Already collected, or - - **Actively being collected** by `traincheck-collect` during the training job. +In another terminal, start the online checker: -- `-i `: Path to the JSON file containing inferred invariants. +```bash +traincheck-onlinecheck -f target_trace -i invariants.json +``` -## How to Use: Post-training Checking +The online checker watches `target_trace/` and updates its report as new traces arrive. -Run the following command: +If the command fails with a missing `watchdog` package, install it in the same environment: ```bash -traincheck-check -f -i +pip install watchdog ``` -- `-f `: Path to the folder containing traces collected by `traincheck-collect`. -- `-i `: Path to the JSON file containing inferred invariants. +Control the report refresh interval with: -## Report Visualization Options - -Both checkers can produce a standalone HTML report and optionally log summary metrics to external monitoring tools. +```bash +traincheck-onlinecheck \ + -f target_trace \ + -i invariants.json \ + --report-interval-seconds 30 +``` -### Standalone HTML Report (default) +## Offline Checking -- Output: `/report.html` -- Includes summary counts, relation breakdown, and top violations. -- Disable with `--no-html-report`. +The offline path is simpler. First let `traincheck-collect` finish, then run: -**Offline example** ```bash -traincheck-check -f -i +traincheck-check -f target_trace -i invariants.json ``` -**Online example** +Offline checking reads the completed trace folder and writes a results directory. + +## Sampling and Checking + +Sampling is configured during trace collection: + ```bash -traincheck-onlinecheck -f -i +traincheck-collect \ + --pyscript target.py \ + --models-to-track model \ + --invariants invariants.json \ + --sampling-interval 10 \ + --warm-up-steps 10 \ + --output-dir target_trace ``` -### W&B Integration - -Enable with `--report-wandb`. You can also pass: -`--wandb-project`, `--wandb-entity`, `--wandb-run-name`, `--wandb-group`, `--wandb-tags`. +Then run either checker normally: ```bash -traincheck-check -f -i \ - --report-wandb --wandb-project +traincheck-onlinecheck -f target_trace -i invariants.json ``` ```bash -traincheck-onlinecheck -f -i \ - --report-wandb --wandb-project +traincheck-check -f target_trace -i invariants.json ``` -### MLflow Integration +The checker does not decide which steps were traced. It checks the trace files that collection produced. -Enable with `--report-mlflow`. Optional: -`--mlflow-experiment`, `--mlflow-run-name`. +## Reports and Logs + +Both checkers write: + +- `failed.log`: violated invariants. +- `passed.log`: triggered invariants that passed. +- `not_triggered.log`: invariants that never ran on the trace. +- `violations_summary.json`: compact violation summaries. +- `report.html`: browser-readable summary. + +The default output directory is timestamped. Use `-o` or `--output-dir` to choose a path: ```bash -traincheck-check -f -i \ - --report-mlflow --mlflow-experiment +traincheck-check \ + -f target_trace \ + -i invariants.json \ + --output-dir check_results ``` +## W&B and MLflow + +Log checker results to Weights & Biases: + ```bash -traincheck-onlinecheck -f -i \ - --report-mlflow --mlflow-experiment +traincheck-check \ + -f target_trace \ + -i invariants.json \ + --report-wandb \ + --wandb-project traincheck ``` -### Online Report Refresh +Attach offline checker metrics to an existing W&B run: -The online checker refreshes the report when violations change, and also on a periodic timer. -Control the interval with `--report-interval-seconds` (default: 10). +```bash +traincheck-check \ + -f target_trace \ + -i invariants.json \ + --report-wandb \ + --wandb-run-id +``` + +Log checker results to MLflow: ```bash -traincheck-onlinecheck -f -i \ - --report-interval-seconds 30 +traincheck-check \ + -f target_trace \ + -i invariants.json \ + --report-mlflow \ + --mlflow-experiment traincheck ``` -**Note:** W&B and MLflow logging are optional. If the packages are not installed, TrainCheck will skip logging and emit a warning. +The online checker supports the same W&B and MLflow reporting flags. -## Interpreting the Results +## Useful Options -After running either checking mode, TrainCheck will output a summary of detected invariant violations. Each violation entry typically includes: +- `-f, --trace-folders`: trace directories produced by `traincheck-collect`. +- `-t, --traces`: individual trace files. +- `-i, --invariants`: invariant files produced by `traincheck-infer`. +- `-o, --output-dir`: results directory. +- `--no-html-report`: skip `report.html`. +- `--report-wandb`: log summary metrics and the HTML report to W&B. +- `--report-mlflow`: log summary metrics and the HTML report to MLflow. +- `--report-interval-seconds`: online checker report refresh interval. -- **Trace file or stream name**: Identifies where the issue was found. -- **Invariant description**: Details the specific invariant that was violated. -- **Violation details**: Provides context, such as the step or epoch where the violation occurred. +Run the command help for the complete option list: -Review these results to pinpoint silent errors or unexpected behaviors in your ML training pipeline. For more information on result formats and how to diagnose issues, see [5. Detection & Diagnosis](./5-min-tutorial.md#5-detection--diagnosis) in the **5-Minute Tutorial**. +```bash +traincheck-check --help +traincheck-onlinecheck --help +``` diff --git a/docs/index.md b/docs/index.md index 4c1775bd..08b828d7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -6,91 +6,73 @@ hide:
TrainCheck -

Invariant Checking & Observability for AI Training

-

Stop flying blind. Validate training dynamics, catch silent errors, and debug with confidence automatically.

- - [Get Started](installation-guide.md){ .md-button .md-button--primary } - [5-Min Tutorial](5-min-tutorial.md){ .md-button } - [View on GitHub](https://github.com/OrderLab/traincheck){ .md-button } -
- -
- -
-

✅ Continuous Invariant Checking

-

TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants (such as gradient norms, tensor shapes, and update magnitudes) effectively catching silent corruption before it wastes GPU hours.

-
- -
-

🚀 Holistic Observability

-

Traditional tools only show you if your model crashed. TrainCheck shows you why it's degrading, analyzing internal state dynamics that loss curves miss.

-
- -
-

🧠 Zero-Config Validation

-

No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.

-
- -
-

⚡ Universal Compatibility

-

Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more.

-
+

Invariant Checking for AI Training

+

Learn normal training behavior from a healthy run, then catch silent bugs in a target run.

+ [Use TrainCheck](usage-guide.md){ .md-button .md-button--primary } + [Install](installation-guide.md){ .md-button } + [5-Min Tutorial](5-min-tutorial.md){ .md-button }
---- +TrainCheck catches silent training bugs by tracing PyTorch API calls and model state changes. You give it a reference run that behaves correctly. TrainCheck infers invariants from that run, then checks a target run for violations. -### How It Works +## Start with This Workflow -1. **Instrument**: We wrap your training loop with lightweight probes. No code changes needed. -2. **Learn**: We analyze correct runs to infer *invariants* (mathematical rules of healthy training). -3. **Check**: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults. +### 1. Collect a Reference Trace -![Workflow](assets/images/workflow.png) +```bash +traincheck-collect \ + --pyscript reference.py \ + --models-to-track model \ + --output-dir reference_trace +``` -## 🔥 Try TrainCheck +### 2. Infer Invariants -Work through [5‑Minute Experience with TrainCheck](5-min-tutorial.md). You’ll learn how to: - - Instrument a training script and collect a trace - - Automatically infer invariants - - Uncover silent bugs in the training script +```bash +traincheck-infer -f reference_trace -o invariants.json +``` -## Documentation +### 3. Collect a Target Trace -- **[Installation Guide](installation-guide.md)** -- **[Usage Guide: Scenarios and Limitations](usage-guide.md)** -- **[TrainCheck Technical Doc](technical-doc.md)** -- **[TrainCheck Dev RoadMap](https://github.com/OrderLab/traincheck/blob/main/ROADMAP.md)** +```bash +traincheck-collect \ + --pyscript target.py \ + --models-to-track model \ + --invariants invariants.json \ + --output-dir target_trace +``` -## Status +### 4. Check the Target Run -TrainCheck is under active development. Please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support. -We welcome feedback and contributions from early adopters. +Run the live checker while the target training job writes traces: -## Contributing +```bash +traincheck-onlinecheck -f target_trace -i invariants.json +``` -We welcome and value any contributions and collaborations. Please check out [Contributing to TrainCheck](https://github.com/OrderLab/traincheck/blob/main/CONTRIBUTING.md) for how to get involved. +The easier offline path is to check after trace collection finishes: -## License +```bash +traincheck-check -f target_trace -i invariants.json +``` -TrainCheck is licensed under the [Apache License 2.0](https://github.com/OrderLab/traincheck/blob/main/LICENSE). +Both checkers write violation logs and a `report.html` summary. -## Citation +## When to Use TrainCheck -If TrainCheck is relevant to your work, please cite our paper: -```bib -@inproceedings{TrainCheckOSDI2025, - author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng}, - title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks}, - booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation}, - series = {OSDI '25}, - month = {July}, - year = {2025}, - address = {Boston, MA, USA}, - publisher = {USENIX Association}, -} -``` +- You changed a training pipeline and want to catch silent logic errors early. +- A run behaves strangely, but normal metrics do not explain why. +- You want to compare a target run against a healthy reference run or an official example. +- You need lower-overhead tracing for a long run; use selective collection with `--invariants` and step sampling. -## Artifact Evaluation +## Documentation -🕵️‍♀️ OSDI AE members, please see [TrainCheck AE Guide](ae.md). +- [Use TrainCheck](usage-guide.md) +- [Installation Guide](installation-guide.md) +- [5-Minute Tutorial](5-min-tutorial.md) +- [CLI Reference: Collect](instr.md) +- [CLI Reference: Infer](infer.md) +- [CLI Reference: Check](check.md) +- [Technical Documentation](technical-doc.md) +- [Performance Benchmarks](benchmarks.md) diff --git a/docs/infer.md b/docs/infer.md index 9aebdaa7..eb672c1f 100644 --- a/docs/infer.md +++ b/docs/infer.md @@ -1,65 +1,77 @@ -# Invariant Inference & Representation +# CLI Reference: Infer Invariants -`traincheck-infer` is part of the **inference stage** of the TrainCheck workflow. It consumes trace files collected from correct training runs and infers behavioral invariants that describe expected runtime behavior. These invariants are later used by `traincheck-check` to detect violations in other training pipelines. +Start with [Use TrainCheck](usage-guide.md) if you want the full workflow. This page explains the `traincheck-infer` command. -## 📚 Table of Contents -- [🔧 Basic Usage](#-basic-usage) -- [⚙️ Advanced Usage](#️-advanced-usage) -- [📘 Invariant Concepts](#-invariant-concepts) -- [🧪 Guidelines: Choosing Input Pipelines](#-practical-guidelines-choosing-input-pipelines) -- [🧠 Tips: Performance and Stability](#-tips-performance-and-stability) -- [🔗 Next Step](TODO) +`traincheck-infer` reads traces from known-good runs and writes invariants. The checker uses those invariants to detect behavior that differs from the reference runs. -## 🔧 Basic Usage +## Basic Usage -In most cases, you only need to specify one or more folders (generated by `traincheck-collect`) containing trace files using the `-f` or `--trace-folders` flag: +Infer invariants from one reference trace folder: ```bash -traincheck-infer -f ./traincheck_mnist_trace ./traincheck_84911_trace .. +traincheck-infer -f reference_trace -o invariants.json ``` -You can provide multiple folders to aggregate traces from different correct runs or programs. This helps TrainCheck generalize better and avoid overfitting to any single pipeline, reducing false positives during checking—especially when the inferred invariants are applied to unrelated or structurally different pipelines. +Infer from multiple reference trace folders: -This command will infer invariants from all trace folders provided, and output invariants into `invariants.json`. +```bash +traincheck-infer \ + -f reference_trace_1 reference_trace_2 reference_trace_3 \ + -o invariants.json +``` + +TrainCheck reads files named like `trace_*.json` and `proxy_log.json` from each folder. + +## Choosing Input Traces + +Choose traces from runs that should be correct. A short run is usually enough because training loops repeat the same API patterns many times. + +Use multiple reference traces when the target pipeline uses behavior that one reference run does not cover, such as mixed precision, distributed training, gradient clipping, or a different optimizer. -## ⚙️ Advanced Usage -traincheck-infer provides additional flags for customization and debugging. Some concepts such as "relation" will be explained later. +## Useful Options -1. `-o, --output`: Specify a custom file name for the invariants. -2. `--disable-relation` / `--enable-relation`: Control which types of invariants to infer. This is useful for reducing noise or targeting specific checks. - ```bash - # Disable ordering-based invariants - traincheck-infer -f ./traces --disable-relation FunctionLeadRelation FunctionCoverRelation +- `-f, --trace-folders`: trace directories produced by `traincheck-collect`. +- `-t, --traces`: individual trace files. +- `-o, --output`: invariant file path. The default is `invariants.json`. +- `--disable-relation`: skip specific invariant relation types. +- `--enable-relation`: infer only specific invariant relation types. +- `--disable-precond-sampling`: disable example sampling during precondition inference. +- `--precond-sampling-threshold`: set the precondition sampling threshold. +- `-b, --backend`: choose `pandas`, `polars`, or `dict` for trace processing. - # Enable only contain and variable consistency invariants - traincheck-infer -f ./traces --enable-relation APIContainRelation ConsistencyRelation - ``` - > See [traincheck.invariant.relation_pool](../traincheck/invariant/relation_pool.py) for a complete list of invariants. -3. `-b, --backend`: Select the data processing engine for trace handling. - - `pandas` (default): stable and well-tested. - - `polars`: faster for large traces (experimental) - - `dict`: pure Python dictionary backend (experimental) +Run the command help for the complete option list: -> Other flags (e.g. `--debug`, `-t --traces`) are available via traincheck-infer --help, but are rarely needed unless you are debugging or developing TrainCheck itself. +```bash +traincheck-infer --help +``` + +## Relation Filtering +Relation filtering is useful when a reference trace overfits to ordering details that do not matter for your target run. -## 📘 Invariant Concepts +For example, disable ordering-based relations: + +```bash +traincheck-infer \ + -f reference_trace \ + -o invariants.json \ + --disable-relation FunctionLeadRelation FunctionCoverRelation +``` -TrainCheck infers **invariants** — logical properties that are consistently held during correct training runs. These invariants are used to define the *expected* behavior of a training pipeline, and later help detect silent issues when applied to other runs. +Enable only specific relation types: -Each invariant describes a specific pattern of behavior observed in the trace, such as: -- Attribute changes during a function call (e.g., `.grad` becomes `None` in `zero_grad()`) -- Ordering relationships between API calls (e.g., `zero_grad()` should occur before `step()`) -- Consistency among values across different parameters (e.g., shared parameters should have the same value across devices during distributed training) +```bash +traincheck-infer \ + -f reference_trace \ + -o invariants.json \ + --enable-relation APIContainRelation ConsistencyRelation +``` -### Invariant Representation +## Invariant File -An invariant is defined by three things: -1. **relation**: the relationship this invariant encodes, can be viewed as an invariant template. Each relation has a separate inference algorithm defined (e.g., [ConsistencyRelation.infer](../traincheck/invariant/consistency_relation.py)) -2. **params**: descriptors for entities that should obey the relationship. -3. **precondition**: a logical predicate defining the context when an invariant can be applied. +The output file is JSON Lines: one invariant per line. Each invariant describes a relation that held in the reference trace, plus a precondition that says when the relation applies. -In the actual json representation of invariants in the `traincheck-infer` output, an invariant looks like this. +Example: ```json { @@ -78,131 +90,20 @@ In the actual json representation of invariants in the `traincheck-infer` output "post_value": null } ], - "precondition": { - "parent_func_call_pre": { - "inverted": true, - "preconditions": [ - { - "clauses": [ - { - "type": "constant", - "prop_name": "meta_vars.step", - "additional_path": "None", - "prop_dtype": "int", - "values": [ - 0 - ] - } - ] - }, - { - "clauses": [ - { - "type": "constant", - "prop_name": "meta_vars.stage", - "additional_path": "None", - "prop_dtype": "str", - "values": [ - "init", - "testing" - ] - } - ] - } - ] - } - }, - "num_positive_examples": 200, - "num_negative_examples": 1 + "num_positive_examples": 200 } ``` -This invariant encodes the expectation that calling torch.optim.optimizer.Optimizer.zero_grad() should reset gradients — that is, the .grad attribute of torch.nn.Parameter objects should transition from a non-zero value to null (i.e., None or missing). -- **text_description:** - - A human-readable summary of the invariant. - > Note: This field is generated using a best-effort strategy and may not fully reflect the invariant’s semantics. In some cases, it may be missing or incomplete. 📆 We are planning to further formalize this field in the future. - -- **relation: "APIContainRelation"** - - An event is expected to happen within the duration of an API invocation. - -- **params:** - - An API call: `zero_grad()` on a PyTorch optimizer - - An attribute: `.grad` on a `torch.nn.Parameter`, which should change from a non-zero value (`"pre_value": "non_zero"`) to null (`"post_value": null`) during the call - -- **precondition:** - This invariant only applies **outside** the following contexts: - - The first step of training (`meta_vars.step == 0`) - - The init or testing stages (`meta_vars.stage in {"init", "testing"}`) - > These are specified as inverted preconditions, meaning the invariant does not apply during those times (e.g., it’s okay to not clear .grad on the first step when nothing has been backpropagated yet). - -- **num_positive_examples: 20** - This behavior was observed and confirmed 200 times in the reference traces. - -- **num_negative_examples: 1** - The invariant failed once — in this case, during the first training iteration, when .grad had not yet been populated before the zero_grad() call. - > **🎯 This behavior is expected and correctly handled by the precondition, which excludes step 0.** - -### Invariant Inference Workflow - -At a high level, TrainCheck performs invariant inference in three stages: +This invariant says that `Optimizer.zero_grad()` normally clears parameter gradients in the observed context. -1. Hypothesis Generation +## Next Step - For each supported relation type, TrainCheck scans the provided traces and generates hypotheses by identifying patterns where a potential invariant could exist (i.e., when matching examples are observed). +Collect a target trace with the invariant file: -2. Example Collection - - For every hypothesis, TrainCheck performs a full scan across all provided traces to gather positive examples (where the hypothesized invariant holds) and negative examples (where it does not). - -3. Precondition Deduction - - TrainCheck analyzes the collected examples to infer a distinguishing predicate—a logical condition that holds true for all positive examples and false for negative ones. This predicate becomes the invariant’s precondition, reducing false positives during checking. - -⚙️ For full details on the inference algorithms, please refer to our OSDI’25 paper (documentation is in progress). - -## 🧪 Practical Guidelines: Choosing Input Pipelines - -When selecting input pipelines for invariant inference, there are two main considerations: - -1. **Representativeness** - - You want your input pipelines to be diverse enough to infer a representative set of invariants. This helps: - - Avoid overfitting to specific patterns. - - Ensure that inferred invariants and preconditions remain accurate across varying scenarios. - - For example, if none of your input pipelines use mixed precision, TrainCheck might infer invariants like: - - > "For mathematical operations, the output dtype must equal the input dtype." - - However, if mixed precision pipelines are included, TrainCheck will refine such invariants by adding preconditions like: - > "This applies only when a torch.autocast context manager is not active." - - **⚡ How many pipelines should you include?** It depends on how different your target pipeline is from available reference pipelines: - - If the target is a minor variant of a known-good pipeline, using just that reference may suffice. - - If the target pipeline introduces new frameworks, tasks, or architectures, include a broader set of inputs to improve generalization. - - -2. **Inference Time** - - Inference time is generally not a major concern, since inference happens offline. However, due to the repetitive nature of training loops, you can safely shorten reference runs without sacrificing invariant quality. - - In practice: - - For all bugs detected by TrainCheck so far, we limited inference traces to at most 100 iterations. - - Shortened runs have shown no significant impact on the usefulness or accuracy of inferred invariants. - -### Core Principles – A Summary -- Focus on the diversity of input traces — capturing different configurations, behaviors, or modes of operation. -- The length or size of traces matters far less. -- Efficient inference is achievable with short, representative runs. - -## Implementation Limitations - -TrainCheck operates on large traces with a dynamic schema, where variable types and fields can change over time. This, combined with the need for cross-trace comparisons, limits the use of typical data storage solutions like SQL databases or optimized DataFrame libraries (e.g., Polars), which require fixed schemas. - -To handle this, we use in-process Pandas DataFrames backed by NumPy. While effective, this approach is currently single-threaded due to Python’s GIL, leaving room for future performance improvements. - -We are exploring options such as shared-memory DataFrames, schema standardization, or schemaless databases (e.g., MongoDB) if data transmission overhead proves manageable. - -> Note: While data sharding could improve parallelism, it would overcomplicate cross-trace and cross-time analysis and is better handled at the storage layer rather than within inference logic. \ No newline at end of file +```bash +traincheck-collect \ + --pyscript target.py \ + --models-to-track model \ + --invariants invariants.json \ + --output-dir target_trace +``` diff --git a/docs/installation-guide.md b/docs/installation-guide.md index 91c8dc32..85308a82 100644 --- a/docs/installation-guide.md +++ b/docs/installation-guide.md @@ -78,17 +78,21 @@ uv pip install traincheck ``` 6. **Verify Installation** - You should now have three clis installed in your system. Do a quick test to see of these commands are available and functional. + You should now have the TrainCheck CLIs installed in your environment. Run: ```bash traincheck-collect --help traincheck-infer --help + traincheck-onlinecheck --help traincheck-check --help ``` ## Next Steps +- **Use TrainCheck** + Follow the [workflow guide](./usage-guide.md) to collect a reference trace, infer invariants, collect a target trace, and run the checker. + - **5‑Minute TrainCheck Experience** Follow the [5‑Minute Tutorial](./5-min-tutorial.md) to instrument a script, infer invariants, and catch silent bugs in under five minutes. - **Technical Documentation** - Explore the [TrainCheck Technical Doc](./technical-doc.md) for a comprehensive guide to features, configuration, and advanced workflows. \ No newline at end of file + Explore the [TrainCheck Technical Doc](./technical-doc.md) for a comprehensive guide to features, configuration, and advanced workflows. diff --git a/docs/instr.md b/docs/instr.md index 6f9be6e7..e558fafe 100644 --- a/docs/instr.md +++ b/docs/instr.md @@ -1,143 +1,104 @@ -# Instrumentation & Trace Representation +# CLI Reference: Collect Traces -`traincheck-collect` is the starting point of TrainCheck's workflow. It instruments your PyTorch training script to capture runtime behavior, generating detailed execution traces for later invariant inference and issue detection. +Start with [Use TrainCheck](usage-guide.md) if you want the full workflow. This page explains the `traincheck-collect` command. -This document explains how to use `traincheck-collect` effectively. -TrainCheck dynamically wraps key PyTorch APIs and monitors model states—**no modifications to your original training code are required**. +`traincheck-collect` instruments a PyTorch training script and writes trace files. Use it for two jobs: -Use `traincheck-collect` when you need to: -- Generate traces from **reference pipelines** for invariant inference. -- Collect traces from **target pipelines** to detect silent issues using pre-inferred invariants. +- Full reference collection for invariant inference. +- Selective target collection for checking with an existing invariant file. -## Table of Contents +## Full Reference Collection -1. [Introduction](#instrumentation--trace-representation) -2. [🔧 Basic Usage](#-basic-usage) - - [Configuration File Example](#configuration-file-example) - - [Running traincheck-collect](#running-traincheck-collect) - - [Selective Instrumentation for Checking](#selective-instrumentation-for-checking) - - [Output Structure](#output-structure) - - [Overriding Configuration via CLI](#overriding-configuration-via-cli) -3. [Adding Meta Variables to Traces](#adding-meta-variables-to-traces) - - [How Meta Variables Improve Inference](#learn-how-meta-variables-improve-invariant-inference) - - [Examples of Useful Meta Variables](#-examples-of-useful-meta-variables) - - [How to Annotate Meta Variables](#how-to-annotate-meta-variables) -4. [Trace Representation](#trace-representation) -5. [Instrumentation Mechanisms](#instrumentation-mechanisms) -6. [Advanced Usage](#advanced-usage) -7. [Algorithms Overview](#algorithms-overview) -8. [Troubleshooting & FAQs](#troubleshooting--faqs) - -## 🔧 Basic Usage - -`traincheck-collect` requires three types of input: - -1. **Python script** to instrument. -2. **Launch arguments** (if any) for executing the script. -3. **Instrumentation-specific configurations**. - -You can provide these inputs either directly via the command line or through a configuration file. -▶️ **Recommendation**: Use a configuration file for clarity and reusability. - -Here’s an example configuration: - -```yaml -pyscript: ./mnist.py # Python entry point of your training program. -shscript: ./run.sh # [Optional] Shell script to launch with custom arguments or environment setup. -modules_to_instr: # Libraries to instrument. Defaults to ['torch'] if omitted. - - torch -models_to_track: # [Optional] Variable names of models to track. Leave empty to disable model tracking. - - model -model_tracker_style: proxy # [Optional] Tracking method: "proxy" (default), "subclass", or "sampler". -copy_all_files: false # [Optional] Set true if your code relies on relative paths (e.g., local datasets/configs). -``` - -You can find example configurations and training programs in: - • [MNIST Example](./assets/examples/traincheck-collect/mnist-config/) - • [GPT-2 Pretrain Example](./assets/examples/traincheck-collect/gpt2-pretrain-config/) - -Run TrainCheck trace collection with: +Use full collection on a known-good run: ```bash -traincheck-collect --use-config --config +traincheck-collect \ + --pyscript train.py \ + --models-to-track model \ + --output-dir reference_trace ``` -This command instruments the specified libraries and model variables, then executes your program. -(Details on instrumentation mechanisms and limitations will follow in the next section. TODO) +This command runs `train.py`, tracks the Python variable named `model`, and writes trace files into `reference_trace/`. -### Selective Instrumentation for Checking +## Selective Target Collection -When checking for silent issues, `traincheck-collect` supports selective instrumentation to improve efficiency. -Simply provide the invariants file: +Use selective collection when you already have an invariant file: ```bash -traincheck-collect --use-config --config --invariants +traincheck-collect \ + --pyscript train.py \ + --models-to-track model \ + --invariants invariants.json \ + --output-dir target_trace ``` -TrainCheck will automatically adjust instrumentation granularity based on the provided invariants. +`--invariants` tells TrainCheck which APIs and variables matter for checking. This usually reduces target-run overhead compared with full reference collection. -### Output Structure -By default, TrainCheck creates a folder named: +Do not combine `--invariants` with `--use-full-instr` when you want selective collection. + +## Step Sampling + +Sampling is also configured on `traincheck-collect`: ```bash -traincheck_run___ +traincheck-collect \ + --pyscript train.py \ + --models-to-track model \ + --invariants invariants.json \ + --sampling-interval 10 \ + --warm-up-steps 10 \ + --output-dir target_trace ``` -This folder contains: -- Collected traces -- Instrumented scripts and execution logs (if the program completes successfully) - -You can also provide any additional arguments not specified in the configuration through the commandline interface, such as +This traces the warm-up steps, then traces every tenth step. Use sampling for long target runs after you have confirmed TrainCheck works on a short run. -### Overriding Configuration via CLI +## Config Files -You can override or supplement configuration settings by providing additional arguments directly via the command line. For example: +Use `--use-config` when the collection command needs repeated options: ```bash -# Write trace files to ./trace_training instead of using the default auto-generated folder name -traincheck-collect --use-config --config --output-dir trace_training +traincheck-collect --use-config --config traincheck.yml ``` -To view all available command-line arguments and configuration options, run: +Example: -```bash -traincheck-collect --help +```yaml +pyscript: ./train.py +shscript: ./run.sh +modules_to_instr: + - torch +models_to_track: + - model +model_tracker_style: proxy +copy_all_files: false +output_dir: traincheck_trace ``` -**Note**: When using a configuration file, replace hyphens (-) in argument names with underscores (_). -For example: -- Command-line: `--output-dir trace_training` -- Configuration file: `output_dir: trace_training` - -## Adding Meta Variables to Traces - -You can enhance your traces by providing **custom meta variables**—semantic information about your program's execution. These annotations improve the **quality and precision** of inferred invariants by offering context that might not be directly observable from raw traces. +Config keys use underscores, not hyphens. For example, the CLI flag `--output-dir` becomes `output_dir` in YAML. -
-Learn how meta variables improve invariant inference +## Useful Options -TrainCheck infers **preconditions** for each invariant—these are predicates that distinguish between positive and negative examples in the trace. -- A **positive example** is a trace segment where the invariant holds. -- A **negative example** is where it is violated. +- `--pyscript`: Python entry point for the training program. +- `--shscript`: shell script used to launch the Python program. +- `--models-to-track`: model variable names to track. +- `--modules-to-instr`: Python modules to instrument, usually `torch`. +- `--invariants`: invariant files for selective collection. +- `--output-dir`: directory for traces and logs. +- `--sampling-interval`: collect every Nth step after warm-up. +- `--warm-up-steps`: collect the first N steps. +- `--copy-all-files`: copy files beside the training script into the output directory. +- `--model-tracker-style`: choose `proxy`, `subclass`, or `sampler`. -Many invariants are inherently **conditional**, meaning they only hold true under certain contexts (e.g., during training but not initialization). TrainCheck tries to automatically discover such conditions. +Run the command help for the complete option list: -However, trace data alone may lack sufficient context. This is where **meta variables** come in—they inject semantic hints (like execution phase or step number) to guide smarter inference. - -
- -### ✨ Examples of Useful Meta Variables -1. **`stage`** — Indicates whether a trace record belongs to initialization, training, or evaluation. -2. **`step_id`** — The current training step or iteration number. -3. **Custom arguments** — Any domain-specific flags or parameters relevant to your training logic. +```bash +traincheck-collect --help +``` -### How to Annotate Meta Variables -📌 **[To Be Documented]** -Instructions for defining and injecting meta variables into traces will be provided in a future update. +## Output Files -## Trace Representation -📌 **[To Be Documented]** +The output directory contains trace files, environment metadata, logs, and the instrumented training script. The checker accepts the full output directory through `-f` or `--trace-folders`. -## Instrumentation Mechanisms -📌 **[To Be Documented]** -Details about TrainCheck’s instrumentation strategies, supported APIs, and limitations will be covered here later. +```bash +traincheck-check -f target_trace -i invariants.json +``` diff --git a/docs/technical-doc.md b/docs/technical-doc.md index a49d6a4f..b7be4903 100644 --- a/docs/technical-doc.md +++ b/docs/technical-doc.md @@ -2,23 +2,25 @@ TrainCheck is a lightweight, invariant-based instrumentation and analysis tool for identifying silent correctness issues in PyTorch training pipelines. It infers behavioral invariants from correct reference runs (e.g., official examples or clean configurations), then checks other scripts for behavioral violations. TrainCheck is designed to be minimally intrusive—requiring no code modifications or rewrites of training logic. -## 🔧 System Overview +## System Overview -TrainCheck consists of three core command-line utilities: +TrainCheck consists of four user-facing command-line utilities: 1. **traincheck-collect** – Instruments a training pipeline and collects trace logs. 2. **traincheck-infer** – Infers behavioral invariants from the collected traces. -3. **traincheck-check** – Checks new traces against a set of inferred invariants to detect silent issues. +3. **traincheck-onlinecheck** – Checks a target trace folder while training is still running. +4. **traincheck-check** – Checks completed traces against inferred invariants. TrainCheck workflows are organized into two stages: -1. **🧪 Inference Stage** +1. **Inference Stage** - **traincheck-collect** collects execution traces from reference training pipelines. - **traincheck-infer** analyzes traces and produces invariants that describe correct/expected runtime behavior. -2. **🚨 Checking Stage** - - **traincheck-collect** is used again to trace the target (possibly buggy) pipeline. - - **traincheck-check** verifies whether the collected trace violates any of the known invariants. +2. **Checking Stage** + - **traincheck-collect** traces the target pipeline with `--invariants` for selective collection. + - **traincheck-onlinecheck** verifies traces while the target run is active. + - **traincheck-check** verifies completed traces after collection finishes. ### 📦 Pre-Inferred Invariants (On the Roadmap) @@ -29,15 +31,15 @@ You may still want to run inference in the following cases: - When working with custom training stacks outside supported libraries. - When you want to increase specificity by inferring invariants from a set of related, known-good pipelines (e.g. in industrial settings). -## 📚 Component Documentation +## Component Documentation -Each utility is documented separately: +Start with [Use TrainCheck](usage-guide.md) for the workflow. Use these pages as command references: -- [Collecting Traces with traincheck-collect](instr.md) - Usage, instrumentation caveats, and trace file format. +- [Collecting Traces with traincheck-collect](instr.md) + Collection modes, config files, model tracking, selective collection, and sampling. -- [Inferring Invariants with traincheck-infer](infer.md) -CLI usage, performance considerations, invariant format, and the inference algorithm (relations, preconditions, etc.). +- [Inferring Invariants with traincheck-infer](infer.md) + Trace inputs, invariant outputs, relation filtering, and inference options. -- [Checking Violations with traincheck-check](check.md) -How to apply invariants to new traces, result interpretation, and result file formats. +- [Checking Violations](check.md) + Live checking, offline checking, reports, and integrations. diff --git a/docs/usage-guide.md b/docs/usage-guide.md index 39f12cbd..7105fb5d 100644 --- a/docs/usage-guide.md +++ b/docs/usage-guide.md @@ -1,40 +1,155 @@ -# 🧪 TrainCheck: Usage Guide +# Use TrainCheck -TrainCheck helps detect and diagnose silent errors in deep learning training runs—issues that don't crash your code but silently break correctness. +This guide is for an ML engineer who has a training script and wants to know which TrainCheck command to run next. -## 🚀 Quick Start +TrainCheck uses a reference run to learn invariants: rules that describe normal training behavior. It then checks a target run against those invariants and reports violations. -Check out the [5-minute guide](5-min-tutorial.md) for a minimal working example. +## The Default Workflow -## ✅ Common Use Cases +### 1. Collect a Reference Trace -TrainCheck is useful when your training process doesn’t converge, behaves inconsistently, or silently fails. It can help you: +Start with a short, known-good run. The run can come from your own training code, a previous clean run, or an official example that uses the same framework features. -- **Monitor** long-running training jobs and catch issues early -- **Debug** finished runs and pinpoint where things went wrong -- **Sanity-check** new pipelines, code changes, or infrastructure upgrades +```bash +traincheck-collect \ + --pyscript reference.py \ + --models-to-track model \ + --output-dir reference_trace +``` -TrainCheck detects a range of correctness issues—like misused APIs, incorrect training logic, or hardware faults—without requiring labels or modifications to your training code. +`--models-to-track model` names the Python variable that holds the model in `reference.py`. If your script uses a different variable name, use that name instead. -**While TrainCheck focuses on correctness, it’s also useful for *ruling out bugs* so you can focus on algorithm design with confidence.** +TrainCheck writes trace files and an `env_dump.txt` file into `reference_trace/`. -## 🧠 Tips for Effective Use +### 2. Infer Invariants -1. **Use short runs to reduce overhead.** - If your hardware is stable, you can validate just the beginning of training. Use smaller models and fewer iterations to speed up turnaround time. +Use the reference trace to produce an invariant file: -2. **Choose good reference runs for inference.** - - If you have a past run of the same code that worked well, just use that. - - You can also use small-scale example pipelines that cover different features of the framework (e.g., various optimizers, mixed precision, optional flags). - - If you're debugging a new or niche feature with limited history, try using the official example as a reference. Even if the example is not bug-free, invariant violations can still highlight behavioral differences between your run and the example, helping you debug faster. +```bash +traincheck-infer -f reference_trace -o invariants.json +``` -3. **Minimize scale when collecting traces.** - - Shrink the pipeline by using a smaller model, running for only ~10 iterations, and using the minimal necessary compute setup (e.g., 2 nodes for distributed training). +You can pass multiple reference trace folders when one run does not cover enough behavior: +```bash +traincheck-infer -f reference_trace_1 reference_trace_2 -o invariants.json +``` -## 🚧 Current Limitations +More diverse reference traces reduce overfitting. Short runs are usually enough because training loops repeat the same operations many times. -- **Eager mode only.** TrainCheck instrumentor currently works only in PyTorch eager mode. Features like `torch.compile` are disabled during instrumentation. +### 3. Collect a Target Trace -- **Not fully real-time (yet).** Invariant checking is semi-online. Full real-time support is planned but not yet available. +Run the target script with the inferred invariants: +```bash +traincheck-collect \ + --pyscript target.py \ + --models-to-track model \ + --invariants invariants.json \ + --output-dir target_trace +``` + +Passing `--invariants` enables selective trace collection. TrainCheck traces the APIs and variables needed by the invariant file instead of collecting a full reference-style trace. + +For long target runs, sample steps during collection: + +```bash +traincheck-collect \ + --pyscript target.py \ + --models-to-track model \ + --invariants invariants.json \ + --sampling-interval 10 \ + --warm-up-steps 10 \ + --output-dir target_trace +``` + +This traces the warm-up steps, then traces every tenth step. Sampling is a `traincheck-collect` option. The checker reads the collected trace; it does not control which steps were traced. + +### 4. Check the Target Run + +For live checking, start the online checker while `traincheck-collect` is still writing into `target_trace/`: + +```bash +traincheck-onlinecheck -f target_trace -i invariants.json +``` + +If `traincheck-onlinecheck` fails with a missing `watchdog` package, install it in the same environment: + +```bash +pip install watchdog +``` + +The easier path is offline checking. Wait for trace collection to finish, then run: + +```bash +traincheck-check -f target_trace -i invariants.json +``` + +Use offline checking first when you are learning TrainCheck or reproducing an issue locally. Use live checking when you want violations while the training job is still running. + +## What TrainCheck Writes + +`traincheck-collect` writes: + +- `trace_*.json` files for API traces. +- `proxy_log.json` when model-variable tracking is active. +- `env_dump.txt` with the collection arguments. +- an instrumented copy of the training script and execution logs. + +`traincheck-infer` writes: + +- `invariants.json` by default, or the path passed with `-o`. + +`traincheck-check` and `traincheck-onlinecheck` write: + +- `failed.log` for violated invariants. +- `passed.log` for triggered invariants that passed. +- `not_triggered.log` for invariants that never ran on the trace. +- `violations_summary.json` for machine-readable violation summaries. +- `report.html` for a browser-readable report. + +## Choosing Reference and Target Runs + +Use a reference run that should be correct. If you have a clean run of the same pipeline, use it. If you are debugging a new pipeline, start with an official example or a smaller training job that uses the same optimizer, precision mode, and distributed setup. + +Use a target run for the script you want to check. It can be a modified version of the reference script, a larger training job, or a run that already looks suspicious. + +Keep both runs short while you are iterating. TrainCheck needs representative behavior more than long training time. + +## Common Adjustments + +Use a config file when the command line gets long: + +```bash +traincheck-collect --use-config --config traincheck.yml +``` + +A minimal config looks like this: + +```yaml +pyscript: ./train.py +models_to_track: + - model +modules_to_instr: + - torch +output_dir: traincheck_trace +``` + +When using a config file, use underscores in config keys, such as `output_dir`, instead of CLI-style hyphens. + +Use a shell script when your training command needs environment variables or launcher arguments: + +```yaml +pyscript: ./train.py +shscript: ./run.sh +models_to_track: + - model +``` + +Use `--copy-all-files` if the training script reads local files through relative paths. + +## Current Limits + +TrainCheck currently instruments PyTorch eager-mode execution. If your script uses `torch.compile`, pass `--use-torch-compile` so TrainCheck can keep compatibility behavior explicit. + +Tracing adds overhead. Start with short runs, selective target tracing, and step sampling before scaling to a long job. diff --git a/mkdocs.yml b/mkdocs.yml index 9cc345e2..b0160422 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -42,10 +42,14 @@ nav: - Paper: https://www.usenix.org/conference/osdi25/presentation/jiang - Documentation: - "Installation Guide": installation-guide.md - - "5 Minute Quick Start": 5-min-tutorial.md + - "Use TrainCheck": usage-guide.md + - "5-Minute Tutorial": 5-min-tutorial.md + - "CLI Reference": + - "Collect Traces": instr.md + - "Infer Invariants": infer.md + - "Check Traces": check.md - "Success Stories": successful-stories.md - "Technical Documentation": technical-doc.md - - "Usage Tips": usage-guide.md - "Performance Benchmarks": benchmarks.md markdown_extensions: diff --git a/pyproject.toml b/pyproject.toml index d62cabcb..fa95fc77 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -34,7 +34,8 @@ dependencies = [ "polars", "pyyaml", "orjson", - "numpy" + "numpy", + "watchdog" ] readme = "README.md" license = {text = "Apache-2.0"}