Staging to Main#76
Merged
Merged
Conversation
Bumps [tauri-plugin-opener](https://github.com/tauri-apps/plugins-workspace) from 2.5.3 to 2.5.4. - [Release notes](https://github.com/tauri-apps/plugins-workspace/releases) - [Commits](tauri-apps/plugins-workspace@http-v2.5.3...http-v2.5.4) --- updated-dependencies: - dependency-name: tauri-plugin-opener dependency-version: 2.5.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tauri-plugin-dialog](https://github.com/tauri-apps/plugins-workspace) from 2.7.0 to 2.7.1. - [Release notes](https://github.com/tauri-apps/plugins-workspace/releases) - [Commits](tauri-apps/plugins-workspace@log-v2.7.0...log-v2.7.1) --- updated-dependencies: - dependency-name: tauri-plugin-dialog dependency-version: 2.7.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [pymdown-extensions](https://github.com/facelessuser/pymdown-extensions) to permit the latest version. - [Release notes](https://github.com/facelessuser/pymdown-extensions/releases) - [Commits](facelessuser/pymdown-extensions@10.7...10.21.3) --- updated-dependencies: - dependency-name: pymdown-extensions dependency-version: 10.21.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tauri](https://github.com/tauri-apps/tauri) from 2.11.0 to 2.11.2. - [Release notes](https://github.com/tauri-apps/tauri/releases) - [Commits](tauri-apps/tauri@tauri-v2.11.0...tauri-v2.11.2) --- updated-dependencies: - dependency-name: tauri dependency-version: 2.11.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [mkdocs](https://github.com/mkdocs/mkdocs) to permit the latest version. - [Release notes](https://github.com/mkdocs/mkdocs/releases) - [Commits](mkdocs/mkdocs@1.6.0...1.6.1) --- updated-dependencies: - dependency-name: mkdocs dependency-version: 1.6.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tauri-build](https://github.com/tauri-apps/tauri) from 2.6.0 to 2.6.2. - [Release notes](https://github.com/tauri-apps/tauri/releases) - [Commits](tauri-apps/tauri@tauri-build-v2.6.0...tauri-build-v2.6.2) --- updated-dependencies: - dependency-name: tauri-build dependency-version: 2.6.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tar](https://github.com/composefs/tar-rs) from 0.4.45 to 0.4.46. - [Release notes](https://github.com/composefs/tar-rs/releases) - [Commits](composefs/tar-rs@0.4.45...0.4.46) --- updated-dependencies: - dependency-name: tar dependency-version: 0.4.46 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.149 to 1.0.150. - [Release notes](https://github.com/serde-rs/json/releases) - [Commits](serde-rs/json@v1.0.149...v1.0.150) --- updated-dependencies: - dependency-name: serde_json dependency-version: 1.0.150 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [rust-i18n](https://github.com/longbridge/rust-i18n) from 3.1.2 to 4.0.0. - [Release notes](https://github.com/longbridge/rust-i18n/releases) - [Commits](longbridge/rust-i18n@v3.1.2...v4.0.0) --- updated-dependencies: - dependency-name: rust-i18n dependency-version: 4.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
….22.0 Loose-floor bumps (same pattern as FU-058/062/063/069); no code change. - turboquant-mlx-full 0.5.0 -> 0.6.2: expert-streaming reader tuning (read-coalescing, --prefetch-ahead, --pin-file). Apple Silicon. - mlx-vlm 0.5.0 -> 0.6.0. - vllm 0.21.0 -> 0.22.0 (both [vllm] and [triattention] extras): DeepSeek-V4 MTP spec-dec, Qwen3.5/3.6 GatedDeltaNet fixes, Gemma4 fixes, multi-tier KV offload. CUDA-only; not exercised locally. dflash-mlx v0.1.8 is available but deferred (FU-057 API-rewrite migration).
Cold mlx_lm+mlx+mlx_vlm import baseline crept to ~17.5s solo and ~31s under a sustained E2E run (concurrent model loads + thermal throttle), re-issued per MLX cell. 20s then 30s ceilings each blew a different Phase 1 cell with 'mlx_worker probe timed out'. 45s clears the ~31s loaded peak with headroom, still bounded enough to surface a wedged worker. Follow-up noted: cache the probe so it isn't re-run per load.
…rent-watch killpg
Two fixes for the per-load MLX probe storm surfaced by the E2E sweep:
1. CAPABILITY_CACHE_TTL_SECONDS 10s -> 300s. load_model's
refresh_capabilities() re-probed on every load because the 10s TTL was
shorter than a single load+generate (40-70s), spawning a blocking
17-31s mlx_lm+mlx+mlx_vlm import each time (the creep behind FU-068's
probe-timeout bumps). Native caps only change on install, and every
install path force-refreshes, so a long TTL is safe.
2. _json_subprocess spawns the probe with start_new_session=True so it is
no longer in the backend's process group. app._watch_parent_and_exit
killpg(SIGTERM)s the group when the backend's parent dies; on a
non-Tauri launch whose launch shell exits, that SIGTERM'd the probe
mid-run ("probe exited with code -15"). The probe is a few-second
transient, so escaping the parent-death cleanup leaks nothing.
Tests: test_inference + test_setup_routes + test_diagnostics_routes +
test_route_contracts green; standalone probe rc=0.
…plers Tier 1+2 of the chat-LLM perf/quality review. Performance (llama.cpp): - --cache-reuse 256 + cache_prompt:true on the chat payload so a growing conversation reuses the slot KV and re-prefills only the new suffix instead of the whole history (turn-2+ TTFT drops sharply; was O(n^2)). - Emit --flash-attn on when the user's fused_attention flag is set. It was plumbed into load_model + stored on LoadedModelInfo but never turned into a flag; threaded fused_attention into _build_command. Large Metal decode/KV-memory win; required for quantized KV cache types. Quality (samplers): - llama.cpp: add DRY (dry_multiplier/base/allowed_length), XTC (xtc_probability/threshold), top_n_sigma to _LLAMA_SAMPLER_KEYS (forward-only; old binaries ignore unknown fields). - MLX: wire XTC into the sampler + add repeat_penalty via a new logits_processors builder. repeat_penalty was shown in the UI but silently dropped because mlx-lm applies it through logits_processors, not make_sampler. - /v1 parity: forward min_p / repeat_penalty / mirostat(_tau/_eta) which the OpenAI-compat path dropped; added the request-model fields. Tests: new sampler/logits/parity cases; touched-area suites green. The one failing test (dflash runtime-bundle) is pre-existing/unrelated (orphaned dflash-mlx pin, FU-057).
…oning Tier 3 of the chat-LLM review. - Stop re-feeding prior <think> reasoning into history every turn. The live chat path now passes preserve_reasoning=False; upstream Qwen3 / DeepSeek-R1 templates strip prior reasoning, and replaying it inflated the prompt each turn. _build_history_with_reasoning keeps the capability for callers that still want it (sessions side already passed False). - Token-budgeted sliding window (optional token_budget arg). Keeps system messages + the newest turns that fit, drops the oldest — bounding prompt growth so a long chat can't silently truncate on llama.cpp or overflow context on MLX. Budget reserves room for system prompt + current prompt + max_tokens + template overhead, floors at 512 so the latest turn is always kept. Conservative ~3 chars/token estimate (no tokenizer at this layer) errs toward under-filling to avoid overflow. Tests: +7 windowing/budget cases; generate-path + services suites green.
The native MLX chat path rebuilt a fresh cache every turn and re-prefilled the whole conversation. This keeps one persistent mlx-lm prompt cache on the worker and reuses the longest matching token prefix across turns: trim the divergent tail, prefill only the new suffix, re-commit keyed by prompt+generated tokens. A single-slot port of mlx-lm server's LRUPromptCache.fetch_nearest_cache. - New backend_service/mlx_worker_prompt_cache.py: acquire / commit / invalidate. Gated to the native strategy (compression caches keep their path); guarded by can_trim_prompt_cache (SSM/Mamba/rotating-full reset, mlx-lm #980); resets on model change / no common prefix / partial trim / any exception -> fresh full prefill (identical output, no speedup). - WorkerState gains _persist_cache / _persist_tokens / _persist_cache_model_ref; invalidated on every load / unload / profile change. generate_standard + stream_generate collect generated token ids (GenerationResponse.token) so the persisted token list always equals the cache's positional contents (exact next-turn trim). Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit, same session): turn 1 promptTokens=34, turn 2 promptTokens=16 (vs ~90 without reuse) with a coherent, context-aware turn-2 answer -- ~5.6x less prompt processing, no corruption. Tests: +12 reuse-logic cases (fake cache, trim accounting, all reset paths); MLX worker suite green (the one fail, dflash runtime-bundle, is the pre-existing orphaned-pin issue).
Tier 2 added DRY/XTC/top-n-sigma at the engine layer but nothing populated them from a request, so they were unreachable. Complete the chain: - GenerateRequest gains xtcProbability / xtcThreshold / dryMultiplier / dryBase / dryAllowedLength; _build_sampler_overrides maps them to the engine-side snake_case keys (llama-server forwards all via _LLAMA_SAMPLER_KEYS; mlx-lm applies XTC via make_sampler, ignores DRY). - SamplerPanel gains xtc_probability / xtc_threshold / dry_multiplier rows; SamplerOverrides type + samplerOverrides storage/serialize carry the new fields. XTC adds creative variety (both engines); DRY kills verbatim repetition loops better than repeat_penalty (llama.cpp). Both default off. Tests: backend mapping + frontend projection/round-trip; tsc + vitest green.
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 7. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v5...v7) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v5...v6) --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
…ites) The chat SSE stream emitted one json.dumps + frame per token. Batch visible token text in the standard (non-tool) stream path and flush on a 24-char / 50ms window, before any non-token event (reasoning / reasoningDone / panic / thermal / error), and at stream end. Disabled when per-token logprobs are requested (they must stay 1:1 aligned); the agent/tool path is unchanged. full_text accumulation + the per-token runaway / stall / loop guards are untouched, so persisted output + abort behaviour are identical; the frontend reassembles the same text from larger frames. Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit): a ~40-token reply streamed as 9 token frames (avg ~20 chars) instead of ~40, text fully coherent + correctly ordered (phase -> tokens -> done). ~4-5x fewer frames.
…dribble The token coalescer landed on the standard stream path but the agent/tool path still emitted one frame per token, and the agent loop fake-streamed the already-computed final answer 4 chars at a time. Now the agent token forwarding uses the same coalescer (flush before toolCallStart / toolCallResult), and the agent loop emits the final answer in 48-char chunks instead of 4 -- less fake latency + fewer yields. Tool-call detection + execution flow is unchanged. Tests: test_agent + test_backend_service green; identical coalescer mechanics to the live-validated standard path.
perf(chat): KV reuse, flash-attn, modern samplers, history window, token coalescing
New-feature gate (CLAUDE.md) for the chat-LLM work: - "modern samplers (DRY+XTC)": a chat generate carrying xtcProbability + dryMultiplier must be accepted and produce tokens (proves the request field -> _build_sampler_overrides -> engine plumbing). Hard gate. - "MLX prompt-cache reuse": two same-session turns; passes when turn-2 reprocesses fewer prompt tokens (cache reused), skips when it doesn't engage (a model whose generated tokens don't round-trip at the answer boundary, or a reasoning model -> correct graceful full-reprocess). Soft + non-flaky; the reuse/trim logic is hard-tested in tests/test_mlx_prompt_cache.py. pre-build-check.sh needs no change: it already runs the pytest/tsc/vitest that cover the new unit tests, and no deps / pins / cache-types changed.
…m + vllm Release upstream polish. Deps (loose floor bumps, no code change): - mlx-vlm 0.6.0 -> 0.6.3 - vllm 0.22.0 -> 0.22.1 ([vllm] + [triattention] extras) Discover catalog -- two frontier sparse-MoE families (text-only, verified HF repos + real on-disk sizes): - DeepSeek V4: Flash (284B / ~13B active, 1M ctx, baked-in MTP head) + Pro (1.6T). mlx-community 4-bit Flash (154 GB) is the local-viable entry; official BF16 + 8-bit + Pro listed for awareness. - GLM-5 / GLM-5.1: GlmMoeDsa MoE (256 experts / 8 active, ~200K ctx). unsloth GGUF (Q4_K_M ~515 GB) + mlx-community MXFP4 + zai-org BF16. Both text-only (configs carry no vision_config) so capabilities omit vision -- no broken composer affordance. Tests + gate: - tests/test_catalog_text_families.py: parse + required-field + text-only + discover-payload checks. - E2E phase 0 "new model families" check asserts both surface in the live /api/workspace catalog with their full variant set. Validated: phase 0 PASS, 11 checks. Tracked follow-ups (not in this change): MTPLX installer already auto-updates to v1.0.1 (re-test FU-079 empty-output vs its new /v1 streaming); dflash-mlx v0.1.9 migration stays deferred (FU-057); llama-cpp-turboquant branch drifted (FU-065 commit-pin needs a verified test-build).
- FU-065: turbo branch drifted 2cbfdc62 -> 73eb521d (reproducibility risk confirmed; pin still deferred pending a verified test-compile). - FU-079: MTPLX hit v1.0.0/v1.0.1 (installer auto-updates from 0.3.5); v1.0.0 added real /v1 token streaming -> re-test the empty-output against v1.0.1. - FU-067: dflash-mlx v0.1.9 now tagged; FU-057 migration stays deferred.
Gemma 4 (gemma-4 family): - E2B: 2B multimodal, 128K ctx — official QAT Q4_0 GGUF (~1.5 GB) + BF16 - 31B: 31B multimodal, 256K ctx — MLX 8-bit, unsloth Q4_K_M GGUF, official QAT GGUF, BF16 - Both carry vision capability (Gemma4ForConditionalGeneration + vision_config confirmed) MiniMax M2.7 (minimax-m2 family): - 256 routed experts / 8 active, 200K ctx, ~240B total params / ~480 GB BF16 - mlx-community MXFP4 (~120 GB), unsloth GGUF Q4_K_M (~130 GB), official BF16 Qwen3.7 skipped — no official Qwen/Qwen3.7-* repo exists on HF as of 2026-06-12. Tests: 7 catalog gate checks updated to cover all 4 frontier families (shape, vision vs text-only, context windows, discover payload presence).
…0.23.0 2026-06-15 upstream scan: - turboquant-mlx-full 0.8.0: adds Mamba/hybrid arch support + GPT-OSS-120B optimizations. Same TurboQuantKVCache call surface, backward compatible. Floor: 0.6.2 → 0.8.0. - vllm 0.23.0 released. Floor: 0.22.1 → 0.23.0 (both [vllm] and [triattention] extras). No action needed: - mlx-vlm 0.6.3: already at floor, unchanged. - mlx-lm 0.31.3: installed version, loose >=0.22.0 floor sufficient. - mlx 0.31.2: installed version, loose >=0.22.0 floor sufficient. - diffusers 0.38.0: at floor, no new release. - TriAttention: still at pinned c3744ee6 (v0.2.0), no upstream change. Deferred (tracker notes updated): - dflash-mlx: v0.1.10 now tagged; FU-057/067 migration still deferred. - llama-server-turbo branch: HEAD drifted to 7985f6b9; FU-065 deferred. - TurboQuant+: v0.3.2.3 latest, no PyPI wheel, FU-032 trigger not met. - MTPLX: now v1.0.4; FU-079 re-test still pending.
feat(catalog): frontier model families + dep bumps (2026-06-15)
…ons/upload-artifact-7 chore(deps): bump actions/upload-artifact from 5 to 7
…ons/checkout-6 chore(deps): bump actions/checkout from 4 to 6
…ons/setup-python-6 chore(deps): bump actions/setup-python from 5 to 6
…ons-gte-10.21.3 chore(deps): update pymdown-extensions requirement from >=10.7 to >=10.21.3
chore(deps): update mkdocs requirement from >=1.6 to >=1.6.1
chore(deps): bump mkdocs-material from >=9.5 to >=9.7.6
…ri-plugin-opener-2.5.4 Bump tauri-plugin-opener from 2.5.3 to 2.5.4 in /src-tauri
…ri-plugin-dialog-2.7.1 Bump tauri-plugin-dialog from 2.7.0 to 2.7.1 in /src-tauri
…-0.4.46 chore(deps): bump tar from 0.4.45 to 0.4.46 in /src-tauri
…de_json-1.0.150 chore(deps): bump serde_json from 1.0.149 to 1.0.150 in /src-tauri
…ri-build-2.6.2 chore(deps): bump tauri-build from 2.6.0 to 2.6.2 in /src-tauri
…ri-2.11.2 chore(deps): bump tauri from 2.11.0 to 2.11.2 in /src-tauri
…t-i18n-4.0.0 chore(deps): bump rust-i18n from 3.1.2 to 4.0.0 in /src-tauri
…ort 4.1.0 API change)
…ED_LIMIT_INFORMATION (windows-sys 0.61.2)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.