Issue/340 cf#403
Conversation
| return result; | ||
| } | ||
| // chunk-prefill must be checked before decode (decode would also match if chunk_size==1) | ||
| result = chunk_prefill_compiler_.get()->get_compiled(input); |
There was a problem hiding this comment.
构造函数中将chunk_prefill_compiler_放在了 if (enable_chunk_prefill_graph_) 中。 这也需要放在 if (enable_chunk_prefill_graph_) 中么
| return SchedulerOutput([req], is_prefill=True) | ||
| # Already batched some middle chunks; defer this last-chunk one. | ||
| self.chunking_queue.sync_q.put(req) | ||
| break |
There was a problem hiding this comment.
这里为什么不能放入调度,是因为没有为这种情况录图吗 还是说有什么考虑吗
即使不放,是不是也不应该break,还应该继续检查后续的chunking queue中的请求吧,极端坏的情况,第二个请求就遇上了chunk_is_last,提前结束应该会导致利用率特别低吧?
| # Send this long one back to waiting; flush the normal batch first. | ||
| req.status = RequestStatus.WAITING | ||
| self.waiting_queue.sync_q.put(req) | ||
| break |
There was a problem hiding this comment.
这里为什么不能切chunked 放入batch中呢
即使不放,是否也不应该break,因为可能前面调度到的要计算的token数目极少,应继续检查后续请求吧;或许可以结合max_num_batched_tokens这个作为限制判断break?
另外,当前的实现slot mapping和block tables是一次申请好,在计算model inputs时做偏移,取出要算的部分;
这个对block的占用是有些浪费的,比如会存在超长请求长期空占block;是不是可以在每轮被调度时按需分配(每次chunk size);这样后期优化can accept reqs时,就不需要再调整chunk部分了。
| # Restore deferred ones. | ||
| for d in deferred_requests: | ||
| self.waiting_queue.sync_q.put(d) | ||
| return scheduler_output |
There was a problem hiding this comment.
这里直接return scheduler_output不太好吧,如果waiting队列第一个来的是个长请求,这样这一轮应该只会算这一个请求的chunk size长度的token吧? 算力是不是没有被充分利用
Summary
Motivation
Closes #
Type of Change
feat— new feature / new modelfix— bug fixperf— performance improvement (no behavioral change)refactor— code restructuring without behavior changetest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changesTest Results of Involved Models on Supported Platforms (Please attach screenshots)
Benchmark / Performance Impact
Notes for Reviewers
Checklist
Title, Branch, and Commits
feat(nvidia): …,fix(cuda/gemm): …).<type>/xxx-yyyy-zzzzwhere<type>matches the PR title's Conventional Commits type and words are joined with hyphens (seeCONTRIBUTING.md§Branches).CONTRIBUTING.md§Pull Requests).main— the branch is rebased cleanly on top of the currentmain.fixup!/squash!/wipcommits remain.Scope and Design
CONTRIBUTING.md§Code/General).printf/std::cout/print(...)left behind, orTODOwithout an owner and issue link.General Code Hygiene (applies to all languages)
CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).the `seqlens_k` tensor) (CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General; §Python).C++ Specific (if C++ files changed)
CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).new/delete; RAII / smart pointers / existing allocators are used.scripts/format.py.csrc/models/llama_legacy/.Python Specific (if Python files changed)
CONTRIBUTING.md§Python).CONTRIBUTING.md§Python).scripts/format.py.python/infinilm/auto_config.py.Testing
examples/test_infer.py), or specify the reason for skipping.examples/bench.py), or specify the reason for skipping.test/bench/test_benchmark.py), or specify the reason for skipping.python/infinilm/server/inference_server.py+scripts/test_perf.py), or specify the reason for skipping.Build, CI, and Tooling
Documentation
README.md,CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.!orBREAKING CHANGE:footer.Security and Safety