Skip to content

fix(cluster-labels): content-based vocab cache invalidation (survive reinstall)#323

Open
lstein wants to merge 1 commit into
masterfrom
lstein/fix/vocab-cache-content-fingerprint
Open

fix(cluster-labels): content-based vocab cache invalidation (survive reinstall)#323
lstein wants to merge 1 commit into
masterfrom
lstein/fix/vocab-cache-content-fingerprint

Conversation

@lstein

@lstein lstein commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Problem

The encoded tagging vocabulary was being rebuilt from scratch on first startup after a reinstall, even though the cached embeddings were still on disk and valid.

Root cause: both the vocab-embedding cache (_read_cached_vocab) and the per-album labels cache (_read_cached_labels) invalidated by comparing the cache file's mtime against the source vocab files. A pip reinstall rewrites the bundled cluster_vocab.txt with a fresh mtime even when its contents are byte-for-byte identical, so the mtime gate tripped and forced a multi-second CLIP re-encode of the entire vocabulary — despite the content being unchanged.

(The build was triggered legitimately, by a surviving per-device autotaggingEnabled preference or a second browser tab — that part is working as designed. This PR fixes the wasteful rebuild.)

Change

Switch both caches to purely content-based invalidation:

  • _read_cached_vocab — drop the mtime gate. The exact phrase set is already stamped into the .npz and compared (set(phrases) != set(current_phrases)), which catches every real change (edits, additions, user-extras deletion/rename) regardless of mtime direction.
  • _read_cached_labels — drop the vocab-mtime gate; rely on the stored vocab_fingerprint (sha256 content hash) already compared just below. The umap.npz and source-embeddings mtime checks stay, since those track album-local regeneration, not reinstalls.

The per-image in-process LRU (compute_image_label) intentionally keeps its mtime key: it's process-scoped, so a reinstall never reuses it anyway, and mtime is cheaper than hashing the vocabulary on that hot path.

Nothing here deletes the cache when autotagging is turned off — once encoded, the vocabulary stays cached across reinstalls for later use.

Tests

  • Repurposed the two tests that asserted "pure mtime bump → rebuild" to assert "content edit → rebuild" (the real intent).
  • Added regression tests proving a pure touch with identical content does not rebuild either the vocab-embedding cache or the per-album labels cache.

All green locally: 384 backend + 356 frontend tests pass, make lint clean.

🤖 Generated with Claude Code

…reinstall)

The vocab embedding cache and per-album labels cache were invalidated by
comparing file mtimes against the source vocab files. A `pip` reinstall
rewrites the bundled `cluster_vocab.txt` with a fresh mtime even when its
contents are byte-for-byte identical, so the next startup discarded a valid
cache and forced a multi-second CLIP re-encode of the whole vocabulary.

Switch both caches to purely content-based invalidation:

- `_read_cached_vocab`: drop the mtime gate. The exact phrase set is already
  stamped in the `.npz` and compared, which catches every real change (edits,
  additions, user-extras deletion/rename) regardless of mtime direction.
- `_read_cached_labels`: drop the vocab-mtime gate; rely on the stored
  `vocab_fingerprint` content hash already compared below. The umap.npz and
  source-embeddings mtime checks stay (those track album-local regeneration).

The per-image in-process LRU keeps its mtime key on purpose: it is
process-scoped, so a reinstall never reuses it anyway.

Tests: repurpose the two "mtime bump -> rebuild" tests to assert
"content edit -> rebuild", and add regression tests proving a pure `touch`
with identical content does NOT rebuild either cache.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant