NVIDIA RecSys Examples is a collection of optimized recommender models and components.
The project is organized into two parts:
- HSTU recommender examples for large-scale ranking and retrieval training, with TorchRec, Megatron-Core, DynamicEmb, training benchmarks, and optimized HSTU attention through
fbgemm_gpu_hstu - HSTU inference with paged GPU KV cache, asynchronous host KV onload/offload, Triton Inference Server, CUDA graph optimization, and C++ deployment with AOTInductor
- Semantic ID generative recommender examples for SID-GR training and retrieval, including hierarchical semantic-ID prediction, Megatron-Core decoder support, TorchRec jagged tensors, baseline beam generation, and KV-cache
generate_beam_decode() - SID-GR inference for long-context, short-decode, large-beam serving with ContextKV/BeamKV/BeamPath runtime abstractions, continuous batching, CUDA graph replay, HTTP
/generate, and SGLang comparison benchmarks
- DynamicEmb for model-parallel dynamic embedding tables with GPU/host hash-table storage, TorchRec
EmbeddingCollectionandEmbeddingBagCollectionintegration, admission and eviction controls, cache/prefetch support, fused pooling/sequence kernels, and Torch-exportable inference embedding tables - RecSys KVCache Manager for user-ID-based KV-cache reuse in generative recommender inference, with paged GPU KV tables, asynchronous onboarding/offloading, native pinned-host storage, FlexKV-backed lower-tier storage, and FlexKV CPU breakdown analysis
- Beam search decode attention kernels for SID-GR KV-cache generation, with fused and 3-kernel paths across SM8x, SM90, SM100, and SM120 GPUs
- [2026/6/15] 🎉v26.05 released!
- Adds a new SID-GR inference example for large-beam generative retrieval serving and benchmarking.
- Enables HSTU + DynamicEmb end-to-end training on Blackwell (
sm_100) and refreshes HSTU benchmark fixes, docs, and training examples. - Extends beam-search decode attention to SM8x and improves DynamicEmb, segmented unique, and FlexKV benchmark coverage.
- [2026/5/20] 🎉v26.04 released!
- Refactors the previous async KV-cache manager into a standalone RecSys KVCache Manager package, a new FlexKV backend for multi-node/multi-tier KV storage, LLM-style KV APIs, and updated HSTU inference examples.
- Introduces a new beam-search decode attention kernel and CuTe kernels plus a
generate_beam_decode()entry point, enabling more efficient KV-cache-based beam generation for the SID-GR model with vectorized masking utilities.
- [2026/4/14] 🎉v26.03 released!
- We added Torch export and AOTInductor packaging for end-to-end HSTU C++ inference. See the HSTU inference overview and the C++ inference guide.
- We improved DynamicEmb with table fusion and expansion, relaxed embedding-table alignment (no longer power-of-two), and capacity sizing aligned to
bucket_capacity. See DynamicEmb. - We added an HSTU end-to-end training benchmark suite with progressive optimizations. See the HSTU training benchmark and E2E benchmark notes.
- We published HSTU inference benchmark results on B200 in the HSTU inference benchmark.
- We migrated HSTU attention to
fbgemm_gpu_hstu, removed the legacy compatibility layer, and improved the training stack (fewer device-to-host syncs in jagged tensor handling, balancer tuning, and debug logging). See HSTU training setup.
More
-
[2026/2/13] 🎉v26.01 released!
- We optimized HSTU KVCacheManager, moving Python-based KV cache management to optimized C++ implementation with asynchronous onload/offload operation and compression support. Benchmark shows onload and offload latency can be fully hidden under HSTU inference.
- We introduced a HSTU training optimization with workload-balanced batch shuffling for data parallel training.
- We added caching and prefetching support for
EmbeddingBagCollection.
-
[2026/1/13] 🎉v25.12 released!
- Added Triton Inference Server support for HSTU inference. Follow the HSTU inference Triton example to try it out.
- We introduced our first semantic-id retrieval model example. Follow the semantic‑id retrieval (sid_gr) documentation to run it.
-
[2025/12/10] 🎉v25.11 released!
- DynamicEmb supports embedding admission, that decides whether a new feature ID is allowed to create or update an embedding entry in the dynamic embedding table. By controlling admission, the system can prevent very rare or noisy IDs from consuming parameters and optimizer state that bring little training benefit.
-
[2025/11/11] 🎉v25.10 released!
- HSTU training example supports sequence parallelism.
- DynamicEmb supports LRU score checkpointing, gradient clipping.
- Decouple scaling sequence length from the maximum sequence length limit in HSTU attention and extend HSTU support to the SM89 GPU architecture for training.
-
[2025/10/20] 🎉v25.09 released!
- Integrated prefetching and caching into the HSTU training example.
- DynamicEmb now supports distributed embedding dumping and memory scaling.
- Added kernel fusion in the HSTU block for inference, including KVCache fixes.
- HSTU attention now supports FP8 quantization.
-
[2025/9/8] 🎉v25.08 released!
- Added cache support for DynamicEmb, enabling seamless hot embedding migration between cache and storage.
- Released an end-to-end HSTU inference example, demonstrating precision aligned with training.
- Enabled evaluation mode support for DynamicEmb.
-
[2025/8/1] 🎉v25.07 released!
- Released HSTU inference benchmark, including a paged KV-cache HSTU kernel, a KV-cache manager based on TensorRT-LLM, CUDA graph, and other optimizations.
- Added support for Tensor Parallelism in the HSTU layer.
-
[2025/7/4] 🎉v25.06 released!
- DynamicEmb lookup module performance improvements and LFU eviction support.
- Pipeline support for HSTU example, recompute support for HSTU layer, and customized CUDA ops for jagged tensor concat.
-
[2025/5/29] 🎉v25.05 released!
- Enhancements to DynamicEmb functionality, including support for EmbeddingBagCollection, truncated normal initialization, and initial_accumulator_value for Adagrad.
- Fusion of operations like layernorm and dropout in the HSTU layer, resulting in about 1.2x end-to-end speedup.
- Fix convergence issues on the Kuairand dataset.
For more detailed release notes, please refer to our releases.
The examples we supported:
- HSTU recommender examples
- HSTU inference — KV cache, Triton Inference Server, C++ AOTInductor
- SID based generative recommender examples
- SID-GR inference example
Please see our contributing guidelines for details on how to contribute to this project.
- NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design
- NVIDIA recsys-examples: 生成式推荐系统大规模训练推理的高效实践(上篇)
Join our community channels to ask questions, provide feedback, and interact with other users and developers:
- GitHub Issues: For bug reports and feature requests
- NVIDIA Developer Forums
If you use RecSys Examples in your research, please cite:
@Manual{,
title = {RecSys Examples: A collection of recommender system implementations},
author = {NVIDIA Corporation},
year = {2024},
url = {https://github.com/NVIDIA/recsys-examples},
}
For more citation information and referenced papers, see CITATION.md.
This project is licensed under the Apache License - see the LICENSE file for details.