Issue/340 cf by wooway777 · Pull Request #403 · InfiniTensor/InfiniLM

wooway777 · 2026-05-29T06:59:05Z

Summary

Motivation

Closes #

Type of Change

feat — new feature / new model
fix — bug fix
perf — performance improvement (no behavioral change)
refactor — code restructuring without behavior change
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
No stray merge commits from main — the branch is rebased cleanly on top of the current main.
No fixup! / squash! / wip commits remain.
Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, tab/space mixing, or stray BOMs.
Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

Code follows the Google C++ Style Guide strictly.
Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
No raw new/delete; RAII / smart pointers / existing allocators are used.
Changed files are formatted by scripts/format.py.
No changes/reference to csrc/models/llama_legacy/.

Python Specific (if Python files changed)

Code is PEP 8 compliant.
Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
Changed files are formatted by scripts/format.py.
No changes/reference to python/infinilm/auto_config.py.

Testing

For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
Passed single request test (examples/test_infer.py), or specify the reason for skipping.
Passed offline performance test (examples/bench.py), or specify the reason for skipping.
Passed sanity test (test/bench/test_benchmark.py), or specify the reason for skipping.
Passed service test (python/infinilm/server/inference_server.py + scripts/test_perf.py), or specify the reason for skipping.

Build, CI, and Tooling

The project builds cleanly from a fresh directory on at least one affected platform.

Documentation

README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
Third-party code is license-compatible and attributed.
No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

pengcheng888 · 2026-06-08T02:08:43Z

        return result;
    }
+    // chunk-prefill must be checked before decode (decode would also match if chunk_size==1)
+    result = chunk_prefill_compiler_.get()->get_compiled(input);


构造函数中将chunk_prefill_compiler_放在了 if (enable_chunk_prefill_graph_) 中。这也需要放在 if (enable_chunk_prefill_graph_) 中么

ma-hang · 2026-06-08T06:24:00Z

+                    return SchedulerOutput([req], is_prefill=True)
+                # Already batched some middle chunks; defer this last-chunk one.
+                self.chunking_queue.sync_q.put(req)
+                break


这里为什么不能放入调度，是因为没有为这种情况录图吗还是说有什么考虑吗
即使不放，是不是也不应该break，还应该继续检查后续的chunking queue中的请求吧，极端坏的情况，第二个请求就遇上了chunk_is_last，提前结束应该会导致利用率特别低吧？

ma-hang · 2026-06-08T06:36:09Z

+                    # Send this long one back to waiting; flush the normal batch first.
+                    req.status = RequestStatus.WAITING
+                    self.waiting_queue.sync_q.put(req)
+                    break


这里为什么不能切chunked 放入batch中呢
即使不放，是否也不应该break，因为可能前面调度到的要计算的token数目极少，应继续检查后续请求吧；或许可以结合max_num_batched_tokens这个作为限制判断break？

另外，当前的实现slot mapping和block tables是一次申请好，在计算model inputs时做偏移，取出要算的部分；
这个对block的占用是有些浪费的，比如会存在超长请求长期空占block；是不是可以在每轮被调度时按需分配（每次chunk size）；这样后期优化can accept reqs时，就不需要再调整chunk部分了。

ma-hang · 2026-06-08T06:54:06Z

+                # Restore deferred ones.
+                for d in deferred_requests:
+                    self.waiting_queue.sync_q.put(d)
+                return scheduler_output


这里直接return scheduler_output不太好吧，如果waiting队列第一个来的是个长请求，这样这一轮应该只会算这一个请求的chunk size长度的token吧？算力是不是没有被充分利用

Add feature ChunkPrefill

fde8d7c

wooway777 force-pushed the issue/340-cf branch from 9f17be1 to fde8d7c Compare May 29, 2026 07:08

pengcheng888 reviewed Jun 8, 2026

View reviewed changes

ma-hang reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/340 cf#403

Issue/340 cf#403
wooway777 wants to merge 1 commit into
issue/340from
issue/340-cf

wooway777 commented May 29, 2026

Uh oh!

pengcheng888 Jun 8, 2026

Uh oh!

ma-hang Jun 8, 2026

Uh oh!

ma-hang Jun 8, 2026

Uh oh!

ma-hang Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wooway777 commented May 29, 2026

Summary

Motivation

Type of Change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

General Code Hygiene (applies to all languages)

C++ Specific (if C++ files changed)

Python Specific (if Python files changed)

Testing

Build, CI, and Tooling

Documentation

Security and Safety

Uh oh!

pengcheng888 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ma-hang Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ma-hang Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ma-hang Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants