NVIDIA RecSys Examples

Overview

NVIDIA RecSys Examples is a collection of optimized recommender models and components.

The project is organized into two parts:

Examples

HSTU recommender examples for large-scale ranking and retrieval training, with TorchRec, Megatron-Core, DynamicEmb, training benchmarks, and optimized HSTU attention through fbgemm_gpu_hstu
HSTU inference with paged GPU KV cache, asynchronous host KV onload/offload, Triton Inference Server, CUDA graph optimization, and C++ deployment with AOTInductor
Semantic ID generative recommender examples for SID-GR training and retrieval, including hierarchical semantic-ID prediction, Megatron-Core decoder support, TorchRec jagged tensors, baseline beam generation, and KV-cache generate_beam_decode()
SID-GR inference for long-context, short-decode, large-beam serving with ContextKV/BeamKV/BeamPath runtime abstractions, continuous batching, CUDA graph replay, HTTP /generate, and SGLang comparison benchmarks

Standalone GPU Libraries

DynamicEmb for model-parallel dynamic embedding tables with GPU/host hash-table storage, TorchRec EmbeddingCollection and EmbeddingBagCollection integration, admission and eviction controls, cache/prefetch support, fused pooling/sequence kernels, and Torch-exportable inference embedding tables
RecSys KVCache Manager for user-ID-based KV-cache reuse in generative recommender inference, with paged GPU KV tables, asynchronous onboarding/offloading, native pinned-host storage, FlexKV-backed lower-tier storage, and FlexKV CPU breakdown analysis
Beam search decode attention kernels for SID-GR KV-cache generation, with fused and 3-kernel paths across SM8x, SM90, SM100, and SM120 GPUs

What's New

[2026/6/15] 🎉v26.05 released!
- Adds a new SID-GR inference example for large-beam generative retrieval serving and benchmarking.
- Enables HSTU + DynamicEmb end-to-end training on Blackwell (sm_100) and refreshes HSTU benchmark fixes, docs, and training examples.
- Extends beam-search decode attention to SM8x and improves DynamicEmb, segmented unique, and FlexKV benchmark coverage.
[2026/5/20] 🎉v26.04 released!
- Refactors the previous async KV-cache manager into a standalone RecSys KVCache Manager package, a new FlexKV backend for multi-node/multi-tier KV storage, LLM-style KV APIs, and updated HSTU inference examples.
- Introduces a new beam-search decode attention kernel and CuTe kernels plus a generate_beam_decode() entry point, enabling more efficient KV-cache-based beam generation for the SID-GR model with vectorized masking utilities.
[2026/4/14] 🎉v26.03 released!
- We added Torch export and AOTInductor packaging for end-to-end HSTU C++ inference. See the HSTU inference overview and the C++ inference guide.
- We improved DynamicEmb with table fusion and expansion, relaxed embedding-table alignment (no longer power-of-two), and capacity sizing aligned to bucket_capacity. See DynamicEmb.
- We added an HSTU end-to-end training benchmark suite with progressive optimizations. See the HSTU training benchmark and E2E benchmark notes.
- We published HSTU inference benchmark results on B200 in the HSTU inference benchmark.
- We migrated HSTU attention to fbgemm_gpu_hstu, removed the legacy compatibility layer, and improved the training stack (fewer device-to-host syncs in jagged tensor handling, balancer tuning, and debug logging). See HSTU training setup.

More

[2026/2/13] 🎉v26.01 released!
- We optimized HSTU KVCacheManager, moving Python-based KV cache management to optimized C++ implementation with asynchronous onload/offload operation and compression support. Benchmark shows onload and offload latency can be fully hidden under HSTU inference.
- We introduced a HSTU training optimization with workload-balanced batch shuffling for data parallel training.
- We added caching and prefetching support for EmbeddingBagCollection.
[2026/1/13] 🎉v25.12 released!
- Added Triton Inference Server support for HSTU inference. Follow the HSTU inference Triton example to try it out.
- We introduced our first semantic-id retrieval model example. Follow the semantic‑id retrieval (sid_gr) documentation to run it.
[2025/12/10] 🎉v25.11 released!
- DynamicEmb supports embedding admission, that decides whether a new feature ID is allowed to create or update an embedding entry in the dynamic embedding table. By controlling admission, the system can prevent very rare or noisy IDs from consuming parameters and optimizer state that bring little training benefit.
[2025/11/11] 🎉v25.10 released!
- HSTU training example supports sequence parallelism.
- DynamicEmb supports LRU score checkpointing, gradient clipping.
- Decouple scaling sequence length from the maximum sequence length limit in HSTU attention and extend HSTU support to the SM89 GPU architecture for training.
[2025/10/20] 🎉v25.09 released!
- Integrated prefetching and caching into the HSTU training example.
- DynamicEmb now supports distributed embedding dumping and memory scaling.
- Added kernel fusion in the HSTU block for inference, including KVCache fixes.
- HSTU attention now supports FP8 quantization.
[2025/9/8] 🎉v25.08 released!
- Added cache support for DynamicEmb, enabling seamless hot embedding migration between cache and storage.
- Released an end-to-end HSTU inference example, demonstrating precision aligned with training.
- Enabled evaluation mode support for DynamicEmb.
[2025/8/1] 🎉v25.07 released!
- Released HSTU inference benchmark, including a paged KV-cache HSTU kernel, a KV-cache manager based on TensorRT-LLM, CUDA graph, and other optimizations.
- Added support for Tensor Parallelism in the HSTU layer.
[2025/7/4] 🎉v25.06 released!
- DynamicEmb lookup module performance improvements and LFU eviction support.
- Pipeline support for HSTU example, recompute support for HSTU layer, and customized CUDA ops for jagged tensor concat.
[2025/5/29] 🎉v25.05 released!
- Enhancements to DynamicEmb functionality, including support for EmbeddingBagCollection, truncated normal initialization, and initial_accumulator_value for Adagrad.
- Fusion of operations like layernorm and dropout in the HSTU layer, resulting in about 1.2x end-to-end speedup.
- Fix convergence issues on the Kuairand dataset.

For more detailed release notes, please refer to our releases.

Get Started

The examples we supported:

HSTU recommender examples
HSTU inference — KV cache, Triton Inference Server, C++ AOTInductor
SID based generative recommender examples
SID-GR inference example

Benchmarks

Contribution Guidelines

Please see our contributing guidelines for details on how to contribute to this project.

Resources

Video

Blog

Community

Join our community channels to ask questions, provide feedback, and interact with other users and developers:

GitHub Issues: For bug reports and feature requests
NVIDIA Developer Forums

References

If you use RecSys Examples in your research, please cite:

@Manual{,
  title = {RecSys Examples: A collection of recommender system implementations},
  author = {NVIDIA Corporation},
  year = {2024},
  url = {https://github.com/NVIDIA/recsys-examples},
}

For more citation information and referenced papers, see CITATION.md.

License

This project is licensed under the Apache License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
.github		.github
corelib		corelib
docker		docker
examples		examples
jenkins		jenkins
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.md		CITATION.md
CLA.md		CLA.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
HOW_TO_BUILD_AND_RUN_DEMO.md		HOW_TO_BUILD_AND_RUN_DEMO.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
print_env.sh		print_env.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NVIDIA RecSys Examples

Overview

Examples

Standalone GPU Libraries

What's New

Get Started

Benchmarks

Contribution Guidelines

Resources

Video

Blog

Community

References

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NVIDIA RecSys Examples

Overview

Examples

Standalone GPU Libraries

What's New

Get Started

Benchmarks

Contribution Guidelines

Resources

Video

Blog

Community

References

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages