Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions VULKAN_BUILD_VERIFICATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Vulkan Build Verification for PR #79

Verifies that PR #79 (`dflash: enable Qwen3-Coder-Next on Vulkan`, commit `b788b4af1`) builds cleanly for the Vulkan backend on two independent host environments.

---

## Verification 1 — AMD Strix Halo (Debian 13 / Mesa RADV)

**Commit verified:** `b788b4af1` — `dflash: enable Qwen3-Coder-Next on Vulkan`

### Build System

| Component | Detail |
|---|---|
| **OS** | Debian 13 (trixie), Linux 6.12.90-amd64 |
| **CPU** | AMD Ryzen AI MAX+ 395 w/ Radeon 8060S (16C/32T, up to 5185 MHz) |
| **GPU** | AMD Strix Halo Radeon 8060S (RADV GFX1151, Mesa 25.0.7) |
| **RAM** | 30 GiB |
| **CMake** | 3.31.6 |
| **Compiler** | GCC 14.2.0 (`-march=native`) |
| **Make** | GNU Make 4.4.1, 32 parallel jobs |
| **Vulkan SDK** | 1.4.309, glslc (shaderc 2025.2, glslang 15.1.0) |
| **ccache** | Enabled (found automatically) |

## CMake Configuration

```
cmake -B build_vulkan -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
```

Key flags:
- `GGML_VULKAN=ON` — Vulkan backend enabled
- `GGML_NATIVE=ON` — CPU backend compiled with `-march=native`
- `GGML_CPU=ON` — CPU backend included (cooperative with Vulkan)
- `CMAKE_BUILD_TYPE=Release`

Vulkan shader features detected:
- `GL_KHR_cooperative_matrix` — supported
- `GL_NV_cooperative_matrix2` — supported
- `GL_EXT_integer_dot_product` — supported
- `GL_EXT_bfloat16` — not supported (driver/hardware limitation)

## Build Result

- **Configure:** Success (rc=0, 1.8s)
- **Build:** Success (100%, 0 warnings treated as errors)
- **Binaries produced:** `llama-server`, `llama-cli`, `llama-bench`, `llama`, `llama-perplexity`, `llama-imatrix`, `llama-quantize`, `llama-llama-bench`, all test binaries
- **Working tree:** Clean, nothing to commit

### Conclusion

PR #79 builds cleanly for Vulkan on native AMD hardware (Strix Halo / Radeon 8060S) with Mesa RADV driver. No compilation errors or warnings.

---

## Verification 2 — NVIDIA RTX 3090 host (Debian 11 / NVIDIA Vulkan ICD)

**Commit verified:** `b788b4af1` — `dflash: enable Qwen3-Coder-Next on Vulkan` (on the updated PR head `4efe54303`)

### Build System

| Component | Detail |
|---|---|
| **OS** | Debian GNU/Linux 11 (bullseye), Linux 5.10.0-43-amd64 |
| **CPU** | AMD Ryzen 9 5950X 16-Core (32 threads, `-march=native`) |
| **GPU** | NVIDIA GeForce RTX 3090 (GA102) — host for the build |
| **RAM** | 62 GiB |
| **CMake** | 3.31.11 |
| **Compiler** | GCC 10.2.1 (`g++ (Debian 10.2.1-6)`) |
| **Make/Ninja** | Ninja 1.10.1, 32 parallel jobs |
| **glslc** | shaderc v2023.2 (built from source, installed to `~/.local/bin/glslc`; not packaged in Debian 11) |
| **Vulkan headers** | 1.4.309 (KhronosGroup `Vulkan-Headers` v1.4.309, header-only, installed to `~/.local`; Debian 11 ships only 1.2.162 which is too old for current `ggml-vulkan.cpp`) |
| **Vulkan loader** | libvulkan 1.2.162 (system `libvulkan-dev`) — sufficient to link; newer symbols are resolved at runtime via the loader/ICD |

### Setup notes for Debian 11

Debian 11 does not ship `glslc`/`shaderc` and its Vulkan headers (1.2.162) predate symbols the current Vulkan backend requires (`vk::PhysicalDeviceMaintenance4Properties`, `vk::DriverId::eMesaTurnip`/`eMesaDozen`, `layer_setting_info`). Two host-side additions were needed, neither touching the PR source:

1. Build and install `glslc` from Shaderc `v2023.2` source → `~/.local/bin/glslc`.
2. Install header-only `Vulkan-Headers` v1.4.309 → `~/.local/include/vulkan`.

Then configure with the local prefix so CMake's `FindVulkan` picks up the new headers:

```bash
export CMAKE_PREFIX_PATH="$HOME/.local:$CMAKE_PREFIX_PATH"
cmake -B build_vulkan -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release \
-DVulkan_INCLUDE_DIR="$HOME/.local/include"
cmake --build build_vulkan -j32
```

### CMake Configuration

```
-- Found Vulkan: /usr/lib/x86_64-linux-gnu/libvulkan.so (found version "1.4.309") found components: glslc missing components: glslangValidator
-- Vulkan found
-- GL_KHR_cooperative_matrix not supported by glslc
-- GL_NV_cooperative_matrix2 not supported by glslc
-- GL_EXT_integer_dot_product not supported by glslc
-- GL_EXT_bfloat16 not supported by glslc
-- Including Vulkan backend
```

(Shader feature probes are `not supported` due to the older `glslc` v2023.2 build; this only disables optional cooperative-matrix fast paths and does not affect compilation.)

### Build Result

- **Configure:** Success (rc=0)
- **ggml-vulkan target:** Success (rc=0, 100%)
- **Full build:** Success (rc=0, 100%)
- **Binaries produced:** `llama-server`, `llama-cli`, `llama-bench`, `llama-perplexity`, `test-dflash-plumbing` (plus shared libs `libggml-vulkan.so`, `libllama-server-impl.so`, etc.)
- **DFlash plumbing test:** `test-dflash-plumbing` → rc=0
- **Errors:** 0 (zero `error:` lines in the full build log)
- **Warnings:** only benign — 35× `-Wdouble-promotion`, 1× `-Wmissing-field-initializers` (no `-Werror`)

### Conclusion

PR #79 builds cleanly for Vulkan on a Debian 11 / NVIDIA RTX 3090 host once `glslc` and current Vulkan headers are supplied locally (no PR source changes required). Combined with Verification 1, the Vulkan backend compiles end-to-end on both AMD/Mesa RADV (Debian 13) and NVIDIA (Debian 11) hosts.
45 changes: 29 additions & 16 deletions common/speculative.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2574,6 +2574,12 @@ struct common_speculative_impl_dflash : public common_speculative_impl {
llama_set_dflash_gpu_capture(ctx_tgt, false);
LOG_WRN("dflash: GPU cross ring unavailable; using CPU hidden capture\n");
}

// Full-attention DFlash layers may read target context from the drafter's
// normal KV cache only when the GPU cross ring can populate it. Vulkan
// CPU hidden capture has no such cache and must project K/V freshly from
// target_hidden in the drafter graph.
llama_set_dflash_target_kv_available(ctx_dft, gpu_ring_handle != nullptr);
}

~common_speculative_impl_dflash() override {
Expand Down Expand Up @@ -3005,13 +3011,16 @@ struct common_speculative_impl_dflash : public common_speculative_impl {
// build drafter batch: [id_last, mask, mask, ..., mask]
// positions stay on the target model's absolute timeline so RoPE
// tracks the accepted suffix instead of restarting at the window.
// batch size adapts to n_draft+1 (saves compute when n_max < block_size-1)
const int batch_len = n_draft + 1;
// DFlash attention is non-causal over the query block, so shorter
// n_draft+1 batches are not semantically equivalent to the trained
// full block. Decode the full block but consume only output_len rows.
const int batch_len = block_size;
const int output_len = n_draft + 1;
const int draft_pos_base = committed_len;
common_batch_clear(batch_dft);
common_batch_add(batch_dft, id_last, draft_pos_base, { seq_id }, true);
for (int i = 1; i < batch_len; ++i) {
common_batch_add(batch_dft, mask_token_id, draft_pos_base + i, { seq_id }, true);
common_batch_add(batch_dft, mask_token_id, draft_pos_base + i, { seq_id }, i < output_len);
}

const int64_t t2 = ggml_time_us();
Expand All @@ -3025,15 +3034,15 @@ struct common_speculative_impl_dflash : public common_speculative_impl {

const int64_t t3 = ggml_time_us();

// read argmax tokens for positions 1..batch_len-1 (skip position 0 = staged_first)
// read argmax tokens for output positions 1..output_len-1 (skip position 0 = staged_first)
{
int32_t * argmax = llama_get_logits_argmax(ctx_dft);
float * argmax_probs = llama_get_logits_argmax_probs(ctx_dft);
const int K_flat = llama_get_logits_argmax_k(ctx_dft);
const int argmax_rows = llama_get_logits_argmax_n(ctx_dft);
if (argmax) {
const int n_vocab = llama_vocab_n_tokens(llama_model_get_vocab(model_dft));
if (!common_dflash_argmax_shape_valid(__func__, argmax_rows, batch_len, K_flat)) {
if (!common_dflash_argmax_shape_valid(__func__, argmax_rows, output_len, K_flat)) {
if (dp.draft_log_probs) {
dp.draft_log_probs->clear();
}
Expand All @@ -3042,21 +3051,21 @@ struct common_speculative_impl_dflash : public common_speculative_impl {
}

// GPU argmax path - only top-k ids/probs are transferred.
for (int i = 1; i < batch_len && (int) result.size() < n_draft; ++i) {
for (int i = 1; i < output_len && (int) result.size() < n_draft; ++i) {
const auto params = dp;
if (argmax_probs && p_min > 0.0f && (int) result.size() >= params.n_min) {
float log_prob = argmax_probs[i * K_flat];
float log_p_min = logf(p_min);
if (log_prob < log_p_min) {
LOG_DBG("dflash: early stop at position %d/%d (prob %.3f < p_min %.3f)\n",
i, batch_len, expf(log_prob), p_min);
i, output_len, expf(log_prob), p_min);
break;
}
}
const int32_t token_raw = argmax[i * K_flat];
if (!common_dflash_argmax_token_valid(token_raw, n_vocab)) {
const float score = argmax_probs ? argmax_probs[i * K_flat] : std::numeric_limits<float>::quiet_NaN();
note_invalid_reduced_logits(__func__, token_raw, i, batch_len, K_flat,
note_invalid_reduced_logits(__func__, token_raw, i, output_len, K_flat,
committed_len, cross_len, -1, 0, score);
if (dp.draft_log_probs) {
dp.draft_log_probs->clear();
Expand All @@ -3073,7 +3082,7 @@ struct common_speculative_impl_dflash : public common_speculative_impl {
} else {
// fallback: CPU argmax over full vocab
const int n_vocab_dft = llama_vocab_n_tokens(llama_model_get_vocab(model_dft));
for (int i = 1; i < batch_len && (int) result.size() < n_draft; ++i) {
for (int i = 1; i < output_len && (int) result.size() < n_draft; ++i) {
float * logits = llama_get_logits_ith(ctx_dft, i);
if (!logits) {
break;
Expand Down Expand Up @@ -4561,7 +4570,11 @@ void common_speculative_draft_batch(
const llama_model * model_dft = llama_get_model(ctx_dft);
const int block_size = llama_model_dflash_block_size(model_dft);
const int n_draft = std::min(block_size - 1, params.n_max);
const int batch_len = n_draft + 1;
// Keep the full query block for flat/batched DFlash too. Tree drafting
// already does this; non-causal query attention makes shorter batches
// semantically different from the trained full block.
const int batch_len = block_size;
const int output_len = n_draft + 1;
const llama_token mask_tok = (llama_token) llama_model_dflash_mask_token_id(model_dft);

const int64_t t0 = ggml_time_us();
Expand Down Expand Up @@ -4641,7 +4654,7 @@ void common_speculative_draft_batch(
for (const auto & rs : ready) {
common_batch_add(batch, id_last_per_spec[rs.spec_idx], rs.draft_pos_base, { rs.seq_id }, true);
for (int i = 1; i < batch_len; i++) {
common_batch_add(batch, mask_tok, rs.draft_pos_base + i, { rs.seq_id }, true);
common_batch_add(batch, mask_tok, rs.draft_pos_base + i, { rs.seq_id }, i < output_len);
}
}

Expand All @@ -4664,19 +4677,19 @@ void common_speculative_draft_batch(
auto & rs = ready[r];
auto & result = result_per_spec[rs.spec_idx];
std::vector<float> * log_probs = log_probs_per_spec ? &(*log_probs_per_spec)[rs.spec_idx] : nullptr;
const int offset = r * batch_len;
const int offset = r * output_len;

if (argmax) {
const int n_vocab = llama_vocab_n_tokens(llama_model_get_vocab(model_dft));
if (!common_dflash_argmax_shape_valid(__func__, argmax_rows, n_ready * batch_len, K_flat)) {
if (!common_dflash_argmax_shape_valid(__func__, argmax_rows, n_ready * output_len, K_flat)) {
if (log_probs) {
log_probs->clear();
}
result.clear();
return;
}

for (int i = 1; i < batch_len && (int) result.size() < n_draft; i++) {
for (int i = 1; i < output_len && (int) result.size() < n_draft; i++) {
if (argmax_probs && params.p_min > 0.0f && (int) result.size() >= params.n_min) {
float log_prob = argmax_probs[(offset + i) * K_flat];
if (log_prob < logf(params.p_min)) {
Expand All @@ -4687,7 +4700,7 @@ void common_speculative_draft_batch(
if (!common_dflash_argmax_token_valid(token_raw, n_vocab)) {
const float score = argmax_probs ? argmax_probs[(offset + i) * K_flat] : std::numeric_limits<float>::quiet_NaN();
auto * dfl = static_cast<common_speculative_impl_dflash *>(rs.impl);
dfl->note_invalid_reduced_logits(__func__, token_raw, i, batch_len, K_flat,
dfl->note_invalid_reduced_logits(__func__, token_raw, i, output_len, K_flat,
rs.draft_pos_base, rs.cross_len, rs.spec_idx, offset, score);
if (log_probs) {
log_probs->clear();
Expand All @@ -4703,7 +4716,7 @@ void common_speculative_draft_batch(
}
} else {
const int n_vocab = llama_vocab_n_tokens(llama_model_get_vocab(model_dft));
for (int i = 1; i < batch_len && (int) result.size() < n_draft; i++) {
for (int i = 1; i < output_len && (int) result.size() < n_draft; i++) {
float * logits = llama_get_logits_ith(ctx_dft, offset + i);
if (!logits) {
break;
Expand Down
Loading