Skip to content

Staging to Main#76

Merged
cryptopoly merged 46 commits into
mainfrom
staging
Jun 17, 2026
Merged

Staging to Main#76
cryptopoly merged 46 commits into
mainfrom
staging

Conversation

@cryptopoly

Copy link
Copy Markdown
Owner

No description provided.

dependabot Bot and others added 30 commits May 8, 2026 20:19
Bumps [tauri-plugin-opener](https://github.com/tauri-apps/plugins-workspace) from 2.5.3 to 2.5.4.
- [Release notes](https://github.com/tauri-apps/plugins-workspace/releases)
- [Commits](tauri-apps/plugins-workspace@http-v2.5.3...http-v2.5.4)

---
updated-dependencies:
- dependency-name: tauri-plugin-opener
  dependency-version: 2.5.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tauri-plugin-dialog](https://github.com/tauri-apps/plugins-workspace) from 2.7.0 to 2.7.1.
- [Release notes](https://github.com/tauri-apps/plugins-workspace/releases)
- [Commits](tauri-apps/plugins-workspace@log-v2.7.0...log-v2.7.1)

---
updated-dependencies:
- dependency-name: tauri-plugin-dialog
  dependency-version: 2.7.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [pymdown-extensions](https://github.com/facelessuser/pymdown-extensions) to permit the latest version.
- [Release notes](https://github.com/facelessuser/pymdown-extensions/releases)
- [Commits](facelessuser/pymdown-extensions@10.7...10.21.3)

---
updated-dependencies:
- dependency-name: pymdown-extensions
  dependency-version: 10.21.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tauri](https://github.com/tauri-apps/tauri) from 2.11.0 to 2.11.2.
- [Release notes](https://github.com/tauri-apps/tauri/releases)
- [Commits](tauri-apps/tauri@tauri-v2.11.0...tauri-v2.11.2)

---
updated-dependencies:
- dependency-name: tauri
  dependency-version: 2.11.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [mkdocs](https://github.com/mkdocs/mkdocs) to permit the latest version.
- [Release notes](https://github.com/mkdocs/mkdocs/releases)
- [Commits](mkdocs/mkdocs@1.6.0...1.6.1)

---
updated-dependencies:
- dependency-name: mkdocs
  dependency-version: 1.6.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tauri-build](https://github.com/tauri-apps/tauri) from 2.6.0 to 2.6.2.
- [Release notes](https://github.com/tauri-apps/tauri/releases)
- [Commits](tauri-apps/tauri@tauri-build-v2.6.0...tauri-build-v2.6.2)

---
updated-dependencies:
- dependency-name: tauri-build
  dependency-version: 2.6.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tar](https://github.com/composefs/tar-rs) from 0.4.45 to 0.4.46.
- [Release notes](https://github.com/composefs/tar-rs/releases)
- [Commits](composefs/tar-rs@0.4.45...0.4.46)

---
updated-dependencies:
- dependency-name: tar
  dependency-version: 0.4.46
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.149 to 1.0.150.
- [Release notes](https://github.com/serde-rs/json/releases)
- [Commits](serde-rs/json@v1.0.149...v1.0.150)

---
updated-dependencies:
- dependency-name: serde_json
  dependency-version: 1.0.150
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [rust-i18n](https://github.com/longbridge/rust-i18n) from 3.1.2 to 4.0.0.
- [Release notes](https://github.com/longbridge/rust-i18n/releases)
- [Commits](longbridge/rust-i18n@v3.1.2...v4.0.0)

---
updated-dependencies:
- dependency-name: rust-i18n
  dependency-version: 4.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
….22.0

Loose-floor bumps (same pattern as FU-058/062/063/069); no code change.
- turboquant-mlx-full 0.5.0 -> 0.6.2: expert-streaming reader tuning
  (read-coalescing, --prefetch-ahead, --pin-file). Apple Silicon.
- mlx-vlm 0.5.0 -> 0.6.0.
- vllm 0.21.0 -> 0.22.0 (both [vllm] and [triattention] extras):
  DeepSeek-V4 MTP spec-dec, Qwen3.5/3.6 GatedDeltaNet fixes, Gemma4
  fixes, multi-tier KV offload. CUDA-only; not exercised locally.

dflash-mlx v0.1.8 is available but deferred (FU-057 API-rewrite migration).
Cold mlx_lm+mlx+mlx_vlm import baseline crept to ~17.5s solo and ~31s
under a sustained E2E run (concurrent model loads + thermal throttle),
re-issued per MLX cell. 20s then 30s ceilings each blew a different
Phase 1 cell with 'mlx_worker probe timed out'. 45s clears the ~31s
loaded peak with headroom, still bounded enough to surface a wedged
worker. Follow-up noted: cache the probe so it isn't re-run per load.
…rent-watch killpg

Two fixes for the per-load MLX probe storm surfaced by the E2E sweep:

1. CAPABILITY_CACHE_TTL_SECONDS 10s -> 300s. load_model's
   refresh_capabilities() re-probed on every load because the 10s TTL was
   shorter than a single load+generate (40-70s), spawning a blocking
   17-31s mlx_lm+mlx+mlx_vlm import each time (the creep behind FU-068's
   probe-timeout bumps). Native caps only change on install, and every
   install path force-refreshes, so a long TTL is safe.

2. _json_subprocess spawns the probe with start_new_session=True so it is
   no longer in the backend's process group. app._watch_parent_and_exit
   killpg(SIGTERM)s the group when the backend's parent dies; on a
   non-Tauri launch whose launch shell exits, that SIGTERM'd the probe
   mid-run ("probe exited with code -15"). The probe is a few-second
   transient, so escaping the parent-death cleanup leaks nothing.

Tests: test_inference + test_setup_routes + test_diagnostics_routes +
test_route_contracts green; standalone probe rc=0.
…plers

Tier 1+2 of the chat-LLM perf/quality review.

Performance (llama.cpp):
- --cache-reuse 256 + cache_prompt:true on the chat payload so a growing
  conversation reuses the slot KV and re-prefills only the new suffix
  instead of the whole history (turn-2+ TTFT drops sharply; was O(n^2)).
- Emit --flash-attn on when the user's fused_attention flag is set. It was
  plumbed into load_model + stored on LoadedModelInfo but never turned into
  a flag; threaded fused_attention into _build_command. Large Metal
  decode/KV-memory win; required for quantized KV cache types.

Quality (samplers):
- llama.cpp: add DRY (dry_multiplier/base/allowed_length), XTC
  (xtc_probability/threshold), top_n_sigma to _LLAMA_SAMPLER_KEYS
  (forward-only; old binaries ignore unknown fields).
- MLX: wire XTC into the sampler + add repeat_penalty via a new
  logits_processors builder. repeat_penalty was shown in the UI but
  silently dropped because mlx-lm applies it through logits_processors,
  not make_sampler.
- /v1 parity: forward min_p / repeat_penalty / mirostat(_tau/_eta) which
  the OpenAI-compat path dropped; added the request-model fields.

Tests: new sampler/logits/parity cases; touched-area suites green. The one
failing test (dflash runtime-bundle) is pre-existing/unrelated (orphaned
dflash-mlx pin, FU-057).
…oning

Tier 3 of the chat-LLM review.

- Stop re-feeding prior <think> reasoning into history every turn. The live
  chat path now passes preserve_reasoning=False; upstream Qwen3 /
  DeepSeek-R1 templates strip prior reasoning, and replaying it inflated
  the prompt each turn. _build_history_with_reasoning keeps the capability
  for callers that still want it (sessions side already passed False).
- Token-budgeted sliding window (optional token_budget arg). Keeps system
  messages + the newest turns that fit, drops the oldest — bounding prompt
  growth so a long chat can't silently truncate on llama.cpp or overflow
  context on MLX. Budget reserves room for system prompt + current prompt +
  max_tokens + template overhead, floors at 512 so the latest turn is
  always kept. Conservative ~3 chars/token estimate (no tokenizer at this
  layer) errs toward under-filling to avoid overflow.

Tests: +7 windowing/budget cases; generate-path + services suites green.
The native MLX chat path rebuilt a fresh cache every turn and re-prefilled
the whole conversation. This keeps one persistent mlx-lm prompt cache on the
worker and reuses the longest matching token prefix across turns: trim the
divergent tail, prefill only the new suffix, re-commit keyed by
prompt+generated tokens. A single-slot port of mlx-lm server's
LRUPromptCache.fetch_nearest_cache.

- New backend_service/mlx_worker_prompt_cache.py: acquire / commit /
  invalidate. Gated to the native strategy (compression caches keep their
  path); guarded by can_trim_prompt_cache (SSM/Mamba/rotating-full reset,
  mlx-lm #980); resets on model change / no common prefix / partial trim /
  any exception -> fresh full prefill (identical output, no speedup).
- WorkerState gains _persist_cache / _persist_tokens /
  _persist_cache_model_ref; invalidated on every load / unload / profile
  change. generate_standard + stream_generate collect generated token ids
  (GenerationResponse.token) so the persisted token list always equals the
  cache's positional contents (exact next-turn trim).

Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit, same session):
turn 1 promptTokens=34, turn 2 promptTokens=16 (vs ~90 without reuse) with a
coherent, context-aware turn-2 answer -- ~5.6x less prompt processing, no
corruption.

Tests: +12 reuse-logic cases (fake cache, trim accounting, all reset paths);
MLX worker suite green (the one fail, dflash runtime-bundle, is the
pre-existing orphaned-pin issue).
Tier 2 added DRY/XTC/top-n-sigma at the engine layer but nothing populated
them from a request, so they were unreachable. Complete the chain:

- GenerateRequest gains xtcProbability / xtcThreshold / dryMultiplier /
  dryBase / dryAllowedLength; _build_sampler_overrides maps them to the
  engine-side snake_case keys (llama-server forwards all via
  _LLAMA_SAMPLER_KEYS; mlx-lm applies XTC via make_sampler, ignores DRY).
- SamplerPanel gains xtc_probability / xtc_threshold / dry_multiplier rows;
  SamplerOverrides type + samplerOverrides storage/serialize carry the new
  fields.

XTC adds creative variety (both engines); DRY kills verbatim repetition
loops better than repeat_penalty (llama.cpp). Both default off.

Tests: backend mapping + frontend projection/round-trip; tsc + vitest green.
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v5...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
…ites)

The chat SSE stream emitted one json.dumps + frame per token. Batch visible
token text in the standard (non-tool) stream path and flush on a 24-char /
50ms window, before any non-token event (reasoning / reasoningDone / panic /
thermal / error), and at stream end. Disabled when per-token logprobs are
requested (they must stay 1:1 aligned); the agent/tool path is unchanged.
full_text accumulation + the per-token runaway / stall / loop guards are
untouched, so persisted output + abort behaviour are identical; the frontend
reassembles the same text from larger frames.

Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit): a ~40-token reply
streamed as 9 token frames (avg ~20 chars) instead of ~40, text fully
coherent + correctly ordered (phase -> tokens -> done). ~4-5x fewer frames.
…dribble

The token coalescer landed on the standard stream path but the agent/tool
path still emitted one frame per token, and the agent loop fake-streamed the
already-computed final answer 4 chars at a time. Now the agent token
forwarding uses the same coalescer (flush before toolCallStart /
toolCallResult), and the agent loop emits the final answer in 48-char chunks
instead of 4 -- less fake latency + fewer yields. Tool-call detection +
execution flow is unchanged.

Tests: test_agent + test_backend_service green; identical coalescer mechanics
to the live-validated standard path.
perf(chat): KV reuse, flash-attn, modern samplers, history window, token coalescing
New-feature gate (CLAUDE.md) for the chat-LLM work:
- "modern samplers (DRY+XTC)": a chat generate carrying xtcProbability +
  dryMultiplier must be accepted and produce tokens (proves the request
  field -> _build_sampler_overrides -> engine plumbing). Hard gate.
- "MLX prompt-cache reuse": two same-session turns; passes when turn-2
  reprocesses fewer prompt tokens (cache reused), skips when it doesn't
  engage (a model whose generated tokens don't round-trip at the answer
  boundary, or a reasoning model -> correct graceful full-reprocess). Soft +
  non-flaky; the reuse/trim logic is hard-tested in
  tests/test_mlx_prompt_cache.py.

pre-build-check.sh needs no change: it already runs the pytest/tsc/vitest
that cover the new unit tests, and no deps / pins / cache-types changed.
…m + vllm

Release upstream polish.

Deps (loose floor bumps, no code change):
- mlx-vlm 0.6.0 -> 0.6.3
- vllm 0.22.0 -> 0.22.1 ([vllm] + [triattention] extras)

Discover catalog -- two frontier sparse-MoE families (text-only, verified
HF repos + real on-disk sizes):
- DeepSeek V4: Flash (284B / ~13B active, 1M ctx, baked-in MTP head) + Pro
  (1.6T). mlx-community 4-bit Flash (154 GB) is the local-viable entry;
  official BF16 + 8-bit + Pro listed for awareness.
- GLM-5 / GLM-5.1: GlmMoeDsa MoE (256 experts / 8 active, ~200K ctx).
  unsloth GGUF (Q4_K_M ~515 GB) + mlx-community MXFP4 + zai-org BF16.
Both text-only (configs carry no vision_config) so capabilities omit vision
-- no broken composer affordance.

Tests + gate:
- tests/test_catalog_text_families.py: parse + required-field + text-only +
  discover-payload checks.
- E2E phase 0 "new model families" check asserts both surface in the live
  /api/workspace catalog with their full variant set. Validated: phase 0
  PASS, 11 checks.

Tracked follow-ups (not in this change): MTPLX installer already auto-updates
to v1.0.1 (re-test FU-079 empty-output vs its new /v1 streaming); dflash-mlx
v0.1.9 migration stays deferred (FU-057); llama-cpp-turboquant branch drifted
(FU-065 commit-pin needs a verified test-build).
- FU-065: turbo branch drifted 2cbfdc62 -> 73eb521d (reproducibility risk
  confirmed; pin still deferred pending a verified test-compile).
- FU-079: MTPLX hit v1.0.0/v1.0.1 (installer auto-updates from 0.3.5); v1.0.0
  added real /v1 token streaming -> re-test the empty-output against v1.0.1.
- FU-067: dflash-mlx v0.1.9 now tagged; FU-057 migration stays deferred.
Gemma 4 (gemma-4 family):
- E2B: 2B multimodal, 128K ctx — official QAT Q4_0 GGUF (~1.5 GB) + BF16
- 31B: 31B multimodal, 256K ctx — MLX 8-bit, unsloth Q4_K_M GGUF, official QAT GGUF, BF16
- Both carry vision capability (Gemma4ForConditionalGeneration + vision_config confirmed)

MiniMax M2.7 (minimax-m2 family):
- 256 routed experts / 8 active, 200K ctx, ~240B total params / ~480 GB BF16
- mlx-community MXFP4 (~120 GB), unsloth GGUF Q4_K_M (~130 GB), official BF16

Qwen3.7 skipped — no official Qwen/Qwen3.7-* repo exists on HF as of 2026-06-12.

Tests: 7 catalog gate checks updated to cover all 4 frontier families
(shape, vision vs text-only, context windows, discover payload presence).
…0.23.0

2026-06-15 upstream scan:
- turboquant-mlx-full 0.8.0: adds Mamba/hybrid arch support + GPT-OSS-120B
  optimizations. Same TurboQuantKVCache call surface, backward compatible.
  Floor: 0.6.2 → 0.8.0.
- vllm 0.23.0 released. Floor: 0.22.1 → 0.23.0 (both [vllm] and
  [triattention] extras).

No action needed:
- mlx-vlm 0.6.3: already at floor, unchanged.
- mlx-lm 0.31.3: installed version, loose >=0.22.0 floor sufficient.
- mlx 0.31.2: installed version, loose >=0.22.0 floor sufficient.
- diffusers 0.38.0: at floor, no new release.
- TriAttention: still at pinned c3744ee6 (v0.2.0), no upstream change.

Deferred (tracker notes updated):
- dflash-mlx: v0.1.10 now tagged; FU-057/067 migration still deferred.
- llama-server-turbo branch: HEAD drifted to 7985f6b9; FU-065 deferred.
- TurboQuant+: v0.3.2.3 latest, no PyPI wheel, FU-032 trigger not met.
- MTPLX: now v1.0.4; FU-079 re-test still pending.
feat(catalog): frontier model families + dep bumps (2026-06-15)
…ons/upload-artifact-7

chore(deps): bump actions/upload-artifact from 5 to 7
…ons/checkout-6

chore(deps): bump actions/checkout from 4 to 6
…ons/setup-python-6

chore(deps): bump actions/setup-python from 5 to 6
…ons-gte-10.21.3

chore(deps): update pymdown-extensions requirement from >=10.7 to >=10.21.3
chore(deps): update mkdocs requirement from >=1.6 to >=1.6.1
chore(deps): bump mkdocs-material from >=9.5 to >=9.7.6
…ri-plugin-opener-2.5.4

Bump tauri-plugin-opener from 2.5.3 to 2.5.4 in /src-tauri
…ri-plugin-dialog-2.7.1

Bump tauri-plugin-dialog from 2.7.0 to 2.7.1 in /src-tauri
…-0.4.46

chore(deps): bump tar from 0.4.45 to 0.4.46 in /src-tauri
…de_json-1.0.150

chore(deps): bump serde_json from 1.0.149 to 1.0.150 in /src-tauri
…ri-build-2.6.2

chore(deps): bump tauri-build from 2.6.0 to 2.6.2 in /src-tauri
…ri-2.11.2

chore(deps): bump tauri from 2.11.0 to 2.11.2 in /src-tauri
…t-i18n-4.0.0

chore(deps): bump rust-i18n from 3.1.2 to 4.0.0 in /src-tauri
@cryptopoly cryptopoly merged commit 2ce908a into main Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant