issue/407 - Release the GIL; preallocated workspace by pengcheng888 · Pull Request #408 · InfiniTensor/InfiniLM

pengcheng888 · 2026-06-03T12:05:08Z

Summary

Motivation

Closes #

Type of Change

feat — new feature / new model
fix — bug fix
perf — performance improvement (no behavioral change)
refactor — code restructuring without behavior change
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
No stray merge commits from main — the branch is rebased cleanly on top of the current main.
No fixup! / squash! / wip commits remain.
Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, tab/space mixing, or stray BOMs.
Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

Code follows the Google C++ Style Guide strictly.
Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
No raw new/delete; RAII / smart pointers / existing allocators are used.
Changed files are formatted by scripts/format.py.
No changes/reference to csrc/models/llama_legacy/.

Python Specific (if Python files changed)

Code is PEP 8 compliant.
Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
Changed files are formatted by scripts/format.py.
No changes/reference to python/infinilm/auto_config.py.

Testing

For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
Passed single request test (examples/test_infer.py), or specify the reason for skipping.
Passed offline performance test (examples/bench.py), or specify the reason for skipping.
Passed sanity test (test/bench/test_benchmark.py), or specify the reason for skipping.
Passed service test (python/infinilm/server/inference_server.py + scripts/test_perf.py), or specify the reason for skipping.

Build, CI, and Tooling

The project builds cleanly from a fresh directory on at least one affected platform.

Documentation

README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
Third-party code is license-compatible and attributed.
No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

pengcheng888 · 2026-06-09T10:03:44Z

依赖InfiniCore的pr , InfiniTensor/InfiniCore#1226

pengcheng888 · 2026-06-09T10:07:53Z


            auto graph = std::get<0>(result->second.compiled);
-            auto shared_output = std::shared_ptr<InfinilmModel::Output>(new InfinilmModel::Output{std::get<1>(result->second.compiled)->logits->resume_from_blob_()});
+            // Reuse the GraphTensor output captured at compile time.


nvidia会存在double free 的问题。定位到是graph也给Tensor一个deleter，导致了二次释放。修改了shared_output后，不再有double free 的问题。

为什么会二次释放？如果你删掉，会不会导致有些地址无法释放？

pengcheng888 · 2026-06-10T01:33:24Z

+ * Slots may overlap; scratch_bytes is max span, not sum of slots. Safe use requires
+ * temporal reuse across forward phases.
+ */
+class WorkspaceManager {


WorkspaceManager的名字来源于vllm

新增了WorkspaceManager，去管理forward的buffer。
(1) 在MLP等模块构造函数中会调用register_buffer函数，提供buffer信息。 (不会分配gpu)
(2) 模型创建结束后，调用finalize_and_bind函数，根据汇总的total_bytes 申请一个DataType::U8的大的scratch_buffer_。(此时才分配空间)
(3) 推理过程中会复用这个scratch_buffer_，不再申请tensor

PanZezhong1725 · 2026-06-10T03:03:27Z

                       size_t num_kv_heads,
-                       size_t layer_idx);
+                       size_t layer_idx,
+                       const infinicore::Device &device);


device为什么所有的构造器都要传

PanZezhong1725 · 2026-06-10T03:04:48Z


            auto graph = std::get<0>(result->second.compiled);
-            auto shared_output = std::shared_ptr<InfinilmModel::Output>(new InfinilmModel::Output{std::get<1>(result->second.compiled)->logits->resume_from_blob_()});
+            // Reuse the GraphTensor output captured at compile time.


为什么会二次释放？如果你删掉，会不会导致有些地址无法释放？

…Python threads

pengcheng888 requested a review from a team June 3, 2026 12:05

pengcheng888 marked this pull request as draft June 3, 2026 12:05

pengcheng888 force-pushed the issue/407 branch from c811d4c to 4bc7f93 Compare June 4, 2026 05:33

pengcheng888 linked an issue Jun 4, 2026 that may be closed by this pull request

[DEV] 释放GIL锁后，优化推理过程中显存占用 #407

Open

pengcheng888 mentioned this pull request Jun 4, 2026

issue/411 - 处理特殊情况下的数据拷贝错误的问题 A patch was applied to address the issue of missed data copying. #412

Closed

48 tasks

pengcheng888 force-pushed the issue/407 branch 3 times, most recently from 1c6f631 to c99ea2a Compare June 9, 2026 08:07

ma-hang force-pushed the issue/407 branch from b0dd035 to b38dafd Compare June 9, 2026 10:01

pengcheng888 marked this pull request as ready for review June 9, 2026 10:03

pengcheng888 requested review from PanZezhong1725 and qinyiqun June 9, 2026 10:05

pengcheng888 commented Jun 9, 2026

View reviewed changes

pengcheng888 commented Jun 10, 2026

View reviewed changes

pengcheng888 force-pushed the issue/407 branch from b4d262b to 5062b11 Compare June 10, 2026 01:40

PanZezhong1725 requested changes Jun 10, 2026

View reviewed changes

wangpengcheng and others added 6 commits June 10, 2026 08:11

issue/407 - preallocated workspace

65d8ecf

issue/407 - pybind: release GIL in forward() to avoid blocking other …

e1b65cc

…Python threads

issue/407 - fix: early token budget check

ed1c8c0

refactor with register_inference_buffer.

6bb2040

refactor: improve WorkspaceManager buffer registration

2aea7df

issue/407 - refine the code

6d1fa23

pengcheng888 force-pushed the issue/407 branch from 5062b11 to 6d1fa23 Compare June 10, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue/407 - Release the GIL; preallocated workspace#408

issue/407 - Release the GIL; preallocated workspace#408
pengcheng888 wants to merge 6 commits into
mainfrom
issue/407

pengcheng888 commented Jun 3, 2026

Uh oh!

pengcheng888 commented Jun 9, 2026

Uh oh!

pengcheng888 Jun 9, 2026

Uh oh!

PanZezhong1725 Jun 10, 2026

Uh oh!

pengcheng888 Jun 10, 2026 •

edited

Loading

Uh oh!

PanZezhong1725 Jun 10, 2026

Uh oh!

PanZezhong1725 Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pengcheng888 commented Jun 3, 2026

Summary

Motivation

Type of Change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

General Code Hygiene (applies to all languages)

C++ Specific (if C++ files changed)

Python Specific (if Python files changed)

Testing

Build, CI, and Tooling

Documentation

Security and Safety

Uh oh!

pengcheng888 commented Jun 9, 2026

Uh oh!

pengcheng888 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

PanZezhong1725 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

pengcheng888 Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PanZezhong1725 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

PanZezhong1725 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pengcheng888 Jun 10, 2026 •

edited

Loading