Resolve AppKit and Agent Skills versions from compatibility manifest#5139
Resolve AppKit and Agent Skills versions from compatibility manifest#5139
Conversation
Approval status: pending
|
renaudhartert-db
left a comment
There was a problem hiding this comment.
Thanks for the PR @pkosiec. Could we have a chat internally about what you're trying to achieve? I'd like to make sure that this is aligned with the overall direction we're planning to evolve that command toward.
d7d005f to
3241ae2
Compare
|
@renaudhartert-db @simonfaltum Please take a look 🙏 I applied all the suggestions after discussion with Simon (thanks for the feedback!). Now the manifest resides in the CLI repo. |
4806f28 to
877ba5c
Compare
simonfaltum
left a comment
There was a problem hiding this comment.
Review summary
I ran a multi-agent review (Isaac + Cursor) on this PR. Both reviewers converged on REQUEST_CHANGES: the 4-tier fallback design is sound and tests are substantial, but the live remote manifest is treated as more trusted than its current validation justifies, and the embedded fallback is applied inconsistently across consumers.
Must-fix before merge:
parseManifestdoesn't validateappkit/skillsvalues — empty/non-semver values fall through andgit cloneends up using AppKit's default branch instead of a pinned tag.- Embedded not-found fallback is missing from
UpdateSkills,runManifestOnly, anddefaultListSkills— inconsistent recovery across commands. IsNotFoundErroruses fragile substring matching — replace with a sentinel error wrapped at the source.Resolvereturns the lowest entry when CLI < min(keys) — needs a documented never-prune policy or fail-and-fallback to embedded.writeLocalManifestremoves the cache before rename — failed rename leaves the user with no cache, defeating tier 3a.
Other findings posted inline below: 4xx retry suppression, missing User-Agent, debug-only cache write errors, naming/test cleanups, and questions about rollback story and offline first-run latency.
Some of these may overlap with the changes already requested by @renaudhartert-db; happy to defer where they do.
| func IsNotFoundError(err error) bool { | ||
| if err == nil { | ||
| return false | ||
| } | ||
| msg := strings.ToLower(err.Error()) | ||
| return strings.Contains(msg, "not found") || strings.Contains(msg, "404") | ||
| } |
There was a problem hiding this comment.
Important: IsNotFoundError is fragile substring matching across package boundaries.
The errors this inspects come from different packages (installer.GitHubManifestSource.FetchManifest, cmd/apps/init.go's awaitTemplate). Any string change in those callers silently breaks the fallback. It also matches false positives (e.g. "hostname not found", "exit code 404") and misses true positives like HTTP 410 Gone or git "reference does not exist" wording.
Suggested fix: define a sentinel var ErrNotFound = errors.New("compat target not found") in clicompat, and wrap returned errors at the source — installer/source.go already classifies HTTP 404 and could return fmt.Errorf("...: %w", clicompat.ErrNotFound). Then callers use errors.Is(err, clicompat.ErrNotFound). This decouples the fallback from error message wording.
There was a problem hiding this comment.
Replaced the strings.Contains(msg, "404") check with a typed HTTPStatusError (used by fetchRemote in clicompat and wrapped with ErrNotFound sentinel in installer/source.go for 404). Callers can now use errors.Is(err, clicompat.ErrNotFound).
Kept a narrow strings.Contains(msg, "not found") fallback for git clone errors — git.Clone produces untyped stderr text like "Remote branch X not found in upstream origin". Refactoring that to wrap a typed error is out of scope for this PR since it touches libs/git/ consumers beyond clicompat.
There was a problem hiding this comment.
Unfortunately we can't avoid that string matching as the git lib (that uses raw Git client) doesn't use typed errors; even if we refactored that package, we'd stil produce typed errors based on string matching, so I don't think it's worth to do it 🤔 Or, alternatively, the whole library. could be refactored to go-git? 🤔
|
|
||
| // httpClient is the HTTP client used for manifest fetches. Package-level var | ||
| // so tests can replace it. | ||
| var httpClient = &http.Client{Timeout: fetchTimeout} |
There was a problem hiding this comment.
Nit: httpClient is a mutable package var swapped by tests (clicompat_test.go). Fine today since tests don't t.Parallel(), but the moment anyone tries to parallelize this package it'll race.
Lower-effort fix: leave it but add a comment that this package is intentionally not parallel-test-safe. Better: inject the *http.Client into a small struct or pass via context.
There was a problem hiding this comment.
Added a comment. Leaving the package-level var as-is since injecting via struct or context would be a larger refactor for minimal gain (tests are intentionally sequential).
| // 1. Local cached file (if fresh, < 1 hour old) | ||
| // 2. Remote fetch from GitHub (with retry) | ||
| // 3. Stale local file (if remote fails but a previously cached file exists) | ||
| // 4. Embedded manifest compiled into the binary | ||
| func FetchManifest(ctx context.Context) (Manifest, error) { | ||
| localPath := manifestLocalPath(ctx) | ||
|
|
||
| // Read local file once — reuse across tiers. | ||
| local, localErr := readLocalManifest(localPath) | ||
|
|
||
| // Tier 1: local file is fresh. | ||
| if localErr == nil && local.isFresh(cacheTTL) { | ||
| log.Debugf(ctx, "Using cached manifest from %s", localPath) | ||
| return local.manifest, nil | ||
| } | ||
|
|
||
| // Tier 2: fetch from remote (local file missing or stale). | ||
| m, fetchErr := fetchRemoteWithRetry(ctx) | ||
| if fetchErr == nil { | ||
| writeLocalManifest(ctx, localPath, m) | ||
| return m, nil | ||
| } | ||
|
|
||
| // Tier 3a: local file exists but stale — use it anyway. | ||
| if localErr == nil { | ||
| log.Debugf(ctx, "Using stale cached manifest (remote failed: %v)", fetchErr) | ||
| return local.manifest, nil | ||
| } | ||
|
|
||
| // Tier 3b: embedded manifest. |
There was a problem hiding this comment.
Question: rollback story for a bad cli-compat.json?
The cache TTL is 1 hour, so a bad manifest poisons all clients for up to an hour after a fix is merged (cache hit on stale-but-fresh cached bad manifest). Is this acceptable, or should there be a way for users to force-refresh? An env var (DATABRICKS_CLI_COMPAT_REFRESH=1) or --refresh-compat flag would give an escape hatch.
There was a problem hiding this comment.
Three mitigations exist: (1) stale cache is preferred over embedded on remote failure, (2) embedded manifest is always available as last resort, (3) 1h TTL bounds exposure.
Also added DATABRICKS_CACHE_ENABLED=false support in this PR (consistent with libs/cache), which bypasses the local cache entirely and forces a fresh remote fetch. This gives users an immediate escape hatch if they hit a bad cached manifest.
| func ResolveAppKitVersion(ctx context.Context) (string, error) { | ||
| entry, err := resolveEntry(ctx, build.GetInfo().Version) | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
| return entry.AppKit, nil | ||
| } | ||
|
|
||
| // ResolveAgentSkillsVersion resolves the Agent Skills version for the current CLI. | ||
| func ResolveAgentSkillsVersion(ctx context.Context) (string, error) { | ||
| entry, err := resolveEntry(ctx, build.GetInfo().Version) | ||
| if err != nil { | ||
| return "", err |
There was a problem hiding this comment.
Question: apps init flag help text uses embedded version, but runCreate calls the network-fetching ResolveAppKitVersion.
These can disagree (help text says default is X, but the actual default is Y). Probably fine, but worth confirming this is intentional and maybe a one-liner in the README to set the expectation.
There was a problem hiding this comment.
Good catch. Dropped the specific version from the help text — it now says "default: auto-detected" instead of showing the embedded version, since the actual default comes from the network-fetching ResolveAppKitVersion which may resolve a newer version.
877ba5c to
3fd14a4
Compare
- Embed cli-compat.json in the CLI binary with build-time fetch - Resolve AppKit and Agent Skills versions from the manifest - Add 1h local cache and retry for runtime manifest fetches - Show default AppKit version in --version flag help - Print resolved skills version during aitools install Co-authored-by: Isaac
- Guard printVersionLine against empty latestRef to prevent confusing "Update available: v" output when skills version resolution fails - Sort versionedKeys in TestEmbeddedManifest_IsWellFormed to prevent flaky test from map iteration randomness - Preserve embedded manifest parse error in FetchManifest error message - Add test for GetSkillsRef fallback to embedded manifest Co-authored-by: Isaac
- Rename libs/depversions/ to libs/clicompat/ (package + files) - Rename internal/build/dep_versions.go to clicompat.go - Rename EmbeddedManifestJSON to CLICompatManifestJSON - Extract devVersionPrefix const for "0.0.0-dev" - Wrap both errors with %w in FetchManifest fallback - Add template-v tag validation to bump-cli-compat skill - Remove confusing positional args from skill (named flags only) - Fix README: "After each AppKit or Agent Skills release" Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
- Fix stale libs/depversions/ references in README (renamed to libs/clicompat/) - Fix incorrect "next" key description (only used for dev builds, not newer-than-all) - Fix import ordering (clicompat before cmdctx/cmdio alphabetically) - Fix slices import grouping in clicompat_test.go (stdlib, not separate group) - Fix error format: semicolon instead of period before hint text - Fix FetchManifest godoc: "4-tier fallback" to match numbered list - Fix writeLocalManifest comment: explain temp-file-then-rename pattern - Fix README example values to match actual manifest - Fix flaky test: pre-populate cache to avoid real network calls Co-authored-by: Isaac
When the manifest resolves to a version that doesn't exist as a git tag (404), retry with the version from the embedded manifest. Only triggers on "not found" errors, not transient network failures. Also: - Rename EmbeddedDefaultAppKitVersion/EmbeddedResolve* to ResolveEmbeddedAppKitVersion/ResolveEmbeddedAgentSkillsVersion - Remove duplicate log lines (keep only log.Warnf, drop cmdio.LogString) - Drop ctx parameter from ResolveEmbedded* (not needed) Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
Address all review comments: add sentinel ErrNotFound and typed HTTPStatusError, validate manifest entry values, centralize not-found fallback for skills and AppKit consumers, skip retries on 4xx, add User-Agent header, support DATABRICKS_CACHE_ENABLED, and document pruning policy and trust model. Co-authored-by: Isaac
3fd14a4 to
702d30e
Compare
Summary
Introduces a CLI compatibility manifest (
internal/build/cli-compat.json) that maps CLI versions to compatible AppKit template and Agent Skills versions. This enables template updates to reach users without CLI releases.The manifest is resolved with a 3-tier fallback:
Companion PRs:
Screenshot