diff --git a/README.md b/README.md index aa49d41c..0cd8316a 100644 --- a/README.md +++ b/README.md @@ -177,6 +177,13 @@ provider-backed ELF evidence was required. rejects broad superiority claims and leaves qmd debug ergonomics, OpenViking trajectory, Letta core/archive, graph/RAG quality, and XY-930 private/provider gates as follow-up work. +- qmd debug-ergonomics retest after XY-982: the June 19 operator-debug live retest + keeps the qmd edge unchanged. ELF scores 6 pass/0 wrong_result with trace and + candidate-drop visibility across all six jobs, while qmd keeps replay commands on + all six jobs but records 0 pass/6 wrong_result because service trace hydration and + intermediate candidate-drop stages are not exposed. This confirms ELF's narrow + trace/stage visibility wins without erasing qmd's default top-k JSON and short CLI + replay advantage. - Full-suite live real-world adapter sweep after XY-926: ELF and qmd emit Docker-isolated `live_real_world` records for all 55 checked-in jobs across 13 suites through `cargo make real-world-memory-live-adapters`. Both keep the original @@ -285,6 +292,7 @@ Detailed evidence and interpretation: - [Proactive Brief Scoring Report - June 16, 2026](docs/evidence/benchmarking/2026-06-16-proactive-brief-scoring-report.md) - [Scheduled Memory Task Scoring Report - June 16, 2026](docs/evidence/benchmarking/2026-06-16-scheduled-memory-task-scoring-report.md) - [Dreaming Competitor-Strength Retest Report - June 17, 2026](docs/evidence/benchmarking/2026-06-17-dreaming-competitor-strength-retest-report.md) +- [qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026](docs/evidence/benchmarking/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md) - [Live Baseline Benchmark Runbook](docs/runbook/benchmarking/live_baseline_benchmark.md) - [Single-User Production Runbook](docs/runbook/single_user_production.md) - Benchmark contract: @@ -369,6 +377,7 @@ Detailed comparison, mechanism-level analysis, and source map: - [Proactive Brief Scoring Report - June 16, 2026](docs/evidence/benchmarking/2026-06-16-proactive-brief-scoring-report.md) - [Scheduled Memory Task Scoring Report - June 16, 2026](docs/evidence/benchmarking/2026-06-16-scheduled-memory-task-scoring-report.md) - [Dreaming Competitor-Strength Retest Report - June 17, 2026](docs/evidence/benchmarking/2026-06-17-dreaming-competitor-strength-retest-report.md) +- [qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026](docs/evidence/benchmarking/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md) - [Live Baseline Benchmark Runbook](docs/runbook/benchmarking/live_baseline_benchmark.md) - [Real-World Agent Memory Benchmark](docs/runbook/benchmarking/real_world_agent_memory_benchmark.md) - [External Memory Improvement Plan](docs/evidence/external_memory/external_memory_improvement_plan.md) @@ -380,10 +389,10 @@ Detailed comparison, mechanism-level analysis, and source map: - [Derived Knowledge Page Follow-Up Research](docs/research/derived_knowledge_page_followup.md) - [Dreaming Product Surface Follow-Up Research](docs/research/dreaming_product_surface_followup.md) -Latest real-world benchmark report: June 17, 2026. Latest external research refresh: -June 11, 2026; June 17 adds the Dreaming competitor-strength closeout retest and -optimization queue after the June 16 temporal reconciliation, live consolidation -self-check, proactive-brief, and scheduled-memory scoring evidence. +Latest real-world benchmark report: June 19, 2026. Latest external research refresh: +June 11, 2026; June 19 adds the qmd debug-ergonomics Dreaming retest after the June +17 competitor-strength closeout and the June 16 temporal reconciliation, live +consolidation self-check, proactive-brief, and scheduled-memory scoring evidence. ## Documentation diff --git a/apps/elf-eval/fixtures/report_snapshots/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.json b/apps/elf-eval/fixtures/report_snapshots/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.json new file mode 100644 index 00000000..8606c7e3 --- /dev/null +++ b/apps/elf-eval/fixtures/report_snapshots/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.json @@ -0,0 +1,229 @@ +{ + "schema": "elf.qmd_debug_ergonomics_dreaming_retest_report/v1", + "report_id": "xy-982-qmd-debug-ergonomics-dreaming-retest-2026-06-19", + "authority": "XY-982", + "created_at": "2026-06-19T04:48:00Z", + "purpose": "Retest qmd debug ergonomics after the Dreaming-readiness stages and XY-955 closeout while preserving local-debug artifact boundaries.", + "source_evidence_cutoff": "2026-06-19", + "source_baseline": { + "trace_replay_diagnostics_report": "docs/evidence/benchmarking/2026-06-11-elf-qmd-trace-replay-diagnostics-report.md", + "trace_replay_diagnostics_snapshot": "apps/elf-eval/fixtures/report_snapshots/2026-06-11-elf-qmd-trace-replay-diagnostics-report.json", + "dreaming_competitor_strength_retest_report": "docs/evidence/benchmarking/2026-06-17-dreaming-competitor-strength-retest-report.md", + "dreaming_competitor_strength_retest_snapshot": "apps/elf-eval/fixtures/report_snapshots/2026-06-17-dreaming-competitor-strength-retest-report.json", + "fresh_live_operator_debug_summary": "tmp/real-world-job/operator-ux-live-adapters/summary.json" + }, + "judgment_terms": [ + "improved", + "regressed", + "unchanged", + "not_tested", + "non_goal" + ], + "status_terms": [ + "pass", + "wrong_result", + "not_tested", + "not_encoded", + "typed_non_pass", + "non_goal" + ], + "summary": { + "overall_judgment": "unchanged_with_live_operator_debug_confirmation", + "debug_ergonomics_edge": "qmd_default_top10_and_short_cli_replay_preserved", + "broader_superiority": "not_proven", + "improved_scenario_count": 0, + "regressed_scenario_count": 0, + "unchanged_scenario_count": 6, + "not_tested_scenario_count": 3, + "non_goal_scenario_count": 1, + "unsupported_claims_rejected": [ + "ELF does not broadly beat qmd from this retest.", + "qmd's live operator-debug wrong_result rows do not erase qmd's default top-k and short CLI replay edge.", + "ELF trace/admin endpoint availability is not proof that the default stress report emits qmd-level candidate visibility.", + "Expansion, dense/sparse contribution, fusion, and rerank-on quality remain unproven until comparable artifacts are emitted." + ] + }, + "commands": [ + { + "command": "cargo make real-world-job-operator-ux-live-adapters", + "status": "pass", + "artifact": "tmp/real-world-job/operator-ux-live-adapters/summary.json", + "summary": { + "schema": "elf.real_world_operator_debug_live_adapter_sweep/v1", + "generated_at": "2026-06-19T04:48:00Z", + "boundary": "This narrow sweep scores operator-debugging fixtures only. It does not change core ranking, launch OpenMemory or claude-mem UI flows, or convert fixture-only UX evidence into broad product superiority." + } + } + ], + "adapter_summaries": [ + { + "adapter_id": "elf_operator_debug_live", + "evidence_class": "live_real_world", + "job_count": 6, + "pass": 6, + "wrong_result": 0, + "expected_evidence_recall": 1.0, + "trace_available_count": 6, + "trace_incomplete_count": 0, + "replay_command_available_count": 6, + "candidate_drop_visibility": "stage visibility present across all jobs", + "repair_action_clear_count": 6, + "raw_sql_needed_count": 0, + "mean_score": 1.0, + "mean_latency_ms": 17.494 + }, + { + "adapter_id": "qmd_operator_debug_live", + "evidence_class": "live_real_world", + "job_count": 6, + "pass": 0, + "wrong_result": 6, + "expected_evidence_recall": 1.0, + "trace_available_count": 0, + "trace_incomplete_count": 6, + "replay_command_available_count": 6, + "candidate_drop_visibility": "qmd top-k replay output is available, but intermediate candidate-drop stages are not exposed", + "repair_action_clear_count": 6, + "raw_sql_needed_count": 0, + "mean_score": 0.658, + "mean_latency_ms": 1231.328 + } + ], + "scenario_retests": [ + { + "scenario_id": "qmd_default_top10_candidate_artifact", + "baseline_outcome": "loss", + "current_outcome": "loss", + "judgment": "unchanged", + "evidence": [ + "docs/evidence/benchmarking/2026-06-11-elf-qmd-trace-replay-diagnostics-report.md" + ], + "boundary": "qmd still exposes direct top-10 rows; ELF has trace ids and admin surfaces but no default qmd-like candidate artifact in the stress report." + }, + { + "scenario_id": "qmd_short_cli_replay", + "baseline_outcome": "loss", + "current_outcome": "loss", + "judgment": "unchanged", + "evidence": [ + "docs/evidence/benchmarking/2026-06-11-elf-qmd-trace-replay-diagnostics-report.md" + ], + "boundary": "qmd replay remains a short local CLI path; ELF replay still depends on service config, headers, traces, and bundle hydration." + }, + { + "scenario_id": "elf_operator_debug_trace_hydration", + "baseline_outcome": "win", + "current_outcome": "win", + "judgment": "unchanged", + "evidence": [ + "tmp/real-world-job/operator-ux-live-adapters/summary.json" + ], + "current_counts": { + "elf_trace_available": 6, + "qmd_trace_available": 0, + "qmd_trace_incomplete": 6 + }, + "boundary": "ELF has trace visibility on 6/6 jobs; qmd has replay commands but no service trace hydration in this slice." + }, + { + "scenario_id": "operator_debug_replay_command_availability", + "baseline_outcome": "tie", + "current_outcome": "tie", + "judgment": "unchanged", + "evidence": [ + "tmp/real-world-job/operator-ux-live-adapters/summary.json" + ], + "current_counts": { + "elf_replay_command_available": 6, + "qmd_replay_command_available": 6 + }, + "boundary": "Both adapters emit replay commands on 6/6 jobs; this does not score equivalent UI quality." + }, + { + "scenario_id": "operator_debug_candidate_drop_visibility", + "baseline_outcome": "win", + "current_outcome": "win", + "judgment": "unchanged", + "evidence": [ + "tmp/real-world-job/operator-ux-live-adapters/summary.json" + ], + "current_counts": { + "elf_visible_jobs": 6, + "qmd_intermediate_stage_visible_jobs": 0 + }, + "typed_non_pass_states": [ + "retrieved_but_dropped" + ], + "boundary": "ELF exposes stage visibility; qmd exposes top-k output but not intermediate drops." + }, + { + "scenario_id": "operator_debug_selected_but_not_narrated_visibility", + "baseline_outcome": "win", + "current_outcome": "win", + "judgment": "unchanged", + "evidence": [ + "tmp/real-world-job/operator-ux-live-adapters/summary.json" + ], + "typed_non_pass_states": [ + "selected_but_not_narrated" + ], + "boundary": "ELF exposes final results and narration-stage details for the selected-but-not-narrated case; qmd does not expose an equivalent service trace surface." + }, + { + "scenario_id": "query_expansion_attribution", + "baseline_outcome": "not_tested", + "current_outcome": "not_tested", + "judgment": "not_tested", + "boundary": "No comparable expansion-variant artifact exists for both systems." + }, + { + "scenario_id": "dense_sparse_channel_attribution", + "baseline_outcome": "not_tested", + "current_outcome": "not_tested", + "judgment": "not_tested", + "boundary": "Current artifacts still do not expose comparable dense-only and sparse-only contribution data." + }, + { + "scenario_id": "fusion_attribution", + "baseline_outcome": "not_tested", + "current_outcome": "not_tested", + "judgment": "not_tested", + "boundary": "Current artifacts still do not expose comparable fusion inputs, rank deltas, or dropped candidates." + }, + { + "scenario_id": "rerank_attribution", + "baseline_outcome": "non_goal", + "current_outcome": "non_goal", + "judgment": "non_goal", + "boundary": "The qmd materializer path remains a --no-rerank path for this evidence line." + } + ], + "claim_boundaries": { + "allowed": [ + "qmd's default local-debug edge remains: top-10 candidate rows plus short CLI replay.", + "ELF still wins the narrow live operator-debug trace hydration, candidate-drop visibility, and selected-but-not-narrated visibility slice.", + "Both systems still expose replay commands for the operator-debug fixtures.", + "The Dreaming-stage retest did not find a debug-ergonomics regression." + ], + "not_allowed": [ + "Do not claim ELF broadly beats qmd from this retest.", + "Do not treat qmd's 0 pass/6 wrong_result live operator-debug slice as proof that qmd's default top-k/replay edge is gone.", + "Do not claim expansion, fusion, dense/sparse contribution, or rerank parity until directly comparable artifacts are emitted.", + "Do not collapse not_tested, non_goal, or wrong_result into pass evidence." + ] + }, + "next_optimization_direction": { + "priority": "P0", + "summary": "Emit comparable candidate-replay artifacts for both ELF and qmd before rerunning any broad debug-ergonomics claim.", + "required_fields": [ + "immediate_top_k_rows", + "expansion_variants", + "dense_only_candidates", + "sparse_only_candidates", + "fusion_rank_deltas", + "rerank_score_or_disabled_marker", + "dropped_or_demoted_expected_evidence", + "one_command_replay_for_each_system" + ] + } +} diff --git a/apps/elf-eval/tests/real_world_job_benchmark.rs b/apps/elf-eval/tests/real_world_job_benchmark.rs index e6aab322..a6fa7b0d 100644 --- a/apps/elf-eval/tests/real_world_job_benchmark.rs +++ b/apps/elf-eval/tests/real_world_job_benchmark.rs @@ -218,6 +218,18 @@ fn dreaming_competitor_strength_retest_report_markdown_path() -> Result .join("2026-06-17-dreaming-competitor-strength-retest-report.md")) } +fn qmd_debug_ergonomics_dreaming_retest_report_json_path() -> Result { + report_snapshot_path("2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.json") +} + +fn qmd_debug_ergonomics_dreaming_retest_report_markdown_path() -> Result { + Ok(workspace_root()? + .join("docs") + .join("evidence") + .join("benchmarking") + .join("2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md")) +} + fn live_temporal_reconciliation_report_json_path() -> Result { report_snapshot_path("2026-06-16-live-temporal-reconciliation-report.json") } @@ -2882,11 +2894,174 @@ fn dreaming_competitor_strength_retest_report_closes_xy955_without_overclaims() benchmarking_index.contains("2026-06-17-dreaming-competitor-strength-retest-report.md") ); assert!(readme.contains("Dreaming Competitor-Strength Retest Report - June 17, 2026")); - assert!(readme.contains("Latest real-world benchmark report: June 17, 2026")); + assert!(readme.contains("17 competitor-strength closeout")); + + Ok(()) +} + +#[test] +fn qmd_debug_ergonomics_dreaming_retest_report_preserves_qmd_edge() -> Result<()> { + let report = serde_json::from_str::(&fs::read_to_string( + qmd_debug_ergonomics_dreaming_retest_report_json_path()?, + )?)?; + let markdown = + fs::read_to_string(qmd_debug_ergonomics_dreaming_retest_report_markdown_path()?)?; + let benchmarking_index = fs::read_to_string(benchmarking_index_path()?)?; + let readme = fs::read_to_string(readme_path()?)?; + + assert_qmd_debug_retest_summary(&report)?; + assert_qmd_debug_retest_command_and_adapters(&report)?; + assert_qmd_debug_retest_scenarios(&report)?; + assert_qmd_debug_retest_boundaries(&report)?; + assert_qmd_debug_retest_markdown_and_indexes(&markdown, &benchmarking_index, &readme); + + Ok(()) +} + +fn assert_qmd_debug_retest_summary(report: &Value) -> Result<()> { + assert_eq!( + report.pointer("/schema").and_then(Value::as_str), + Some("elf.qmd_debug_ergonomics_dreaming_retest_report/v1") + ); + assert_eq!(report.pointer("/authority").and_then(Value::as_str), Some("XY-982")); + assert_eq!( + report.pointer("/summary/overall_judgment").and_then(Value::as_str), + Some("unchanged_with_live_operator_debug_confirmation") + ); + assert_eq!( + report.pointer("/summary/debug_ergonomics_edge").and_then(Value::as_str), + Some("qmd_default_top10_and_short_cli_replay_preserved") + ); + assert_eq!( + report.pointer("/summary/broader_superiority").and_then(Value::as_str), + Some("not_proven") + ); + assert_eq!(report.pointer("/summary/improved_scenario_count").and_then(Value::as_u64), Some(0)); + assert_eq!( + report.pointer("/summary/regressed_scenario_count").and_then(Value::as_u64), + Some(0) + ); + assert_eq!( + report.pointer("/summary/unchanged_scenario_count").and_then(Value::as_u64), + Some(6) + ); + assert!(array_contains_str( + report, + "/summary/unsupported_claims_rejected", + "qmd's live operator-debug wrong_result rows do not erase qmd's default top-k and short CLI replay edge." + )?); Ok(()) } +fn assert_qmd_debug_retest_command_and_adapters(report: &Value) -> Result<()> { + let command = find_by_field( + array_at(report, "/commands")?, + "/command", + "cargo make real-world-job-operator-ux-live-adapters", + )?; + + assert_eq!(command.pointer("/status").and_then(Value::as_str), Some("pass")); + assert_eq!( + command.pointer("/summary/schema").and_then(Value::as_str), + Some("elf.real_world_operator_debug_live_adapter_sweep/v1") + ); + + let adapters = array_at(report, "/adapter_summaries")?; + let elf = find_by_field(adapters, "/adapter_id", "elf_operator_debug_live")?; + let qmd = find_by_field(adapters, "/adapter_id", "qmd_operator_debug_live")?; + + assert_eq!(elf.pointer("/job_count").and_then(Value::as_u64), Some(6)); + assert_eq!(elf.pointer("/pass").and_then(Value::as_u64), Some(6)); + assert_eq!(elf.pointer("/wrong_result").and_then(Value::as_u64), Some(0)); + assert_eq!(elf.pointer("/trace_available_count").and_then(Value::as_u64), Some(6)); + assert_eq!(elf.pointer("/replay_command_available_count").and_then(Value::as_u64), Some(6)); + assert_eq!(qmd.pointer("/job_count").and_then(Value::as_u64), Some(6)); + assert_eq!(qmd.pointer("/pass").and_then(Value::as_u64), Some(0)); + assert_eq!(qmd.pointer("/wrong_result").and_then(Value::as_u64), Some(6)); + assert_eq!(qmd.pointer("/trace_available_count").and_then(Value::as_u64), Some(0)); + assert_eq!(qmd.pointer("/trace_incomplete_count").and_then(Value::as_u64), Some(6)); + assert_eq!(qmd.pointer("/replay_command_available_count").and_then(Value::as_u64), Some(6)); + + Ok(()) +} + +fn assert_qmd_debug_retest_scenarios(report: &Value) -> Result<()> { + let scenarios = array_at(report, "/scenario_retests")?; + let top10 = find_by_field(scenarios, "/scenario_id", "qmd_default_top10_candidate_artifact")?; + let replay = find_by_field(scenarios, "/scenario_id", "qmd_short_cli_replay")?; + let trace = find_by_field(scenarios, "/scenario_id", "elf_operator_debug_trace_hydration")?; + let candidate = + find_by_field(scenarios, "/scenario_id", "operator_debug_candidate_drop_visibility")?; + let expansion = find_by_field(scenarios, "/scenario_id", "query_expansion_attribution")?; + let fusion = find_by_field(scenarios, "/scenario_id", "fusion_attribution")?; + let rerank = find_by_field(scenarios, "/scenario_id", "rerank_attribution")?; + + assert_eq!(scenarios.len(), 10); + assert_eq!(top10.pointer("/judgment").and_then(Value::as_str), Some("unchanged")); + assert_eq!(top10.pointer("/current_outcome").and_then(Value::as_str), Some("loss")); + assert_eq!(replay.pointer("/current_outcome").and_then(Value::as_str), Some("loss")); + assert_eq!( + trace.pointer("/current_counts/elf_trace_available").and_then(Value::as_u64), + Some(6) + ); + assert_eq!( + trace.pointer("/current_counts/qmd_trace_available").and_then(Value::as_u64), + Some(0) + ); + assert_eq!( + candidate + .pointer("/current_counts/qmd_intermediate_stage_visible_jobs") + .and_then(Value::as_u64), + Some(0) + ); + assert!(array_contains_str(candidate, "/typed_non_pass_states", "retrieved_but_dropped")?); + assert_eq!(expansion.pointer("/judgment").and_then(Value::as_str), Some("not_tested")); + assert_eq!(fusion.pointer("/judgment").and_then(Value::as_str), Some("not_tested")); + assert_eq!(rerank.pointer("/judgment").and_then(Value::as_str), Some("non_goal")); + + Ok(()) +} + +fn assert_qmd_debug_retest_boundaries(report: &Value) -> Result<()> { + assert!(array_contains_str( + report, + "/claim_boundaries/allowed", + "qmd's default local-debug edge remains: top-10 candidate rows plus short CLI replay." + )?); + assert!(array_contains_str( + report, + "/claim_boundaries/not_allowed", + "Do not claim ELF broadly beats qmd from this retest." + )?); + assert!(array_contains_str( + report, + "/next_optimization_direction/required_fields", + "fusion_rank_deltas" + )?); + + Ok(()) +} + +fn assert_qmd_debug_retest_markdown_and_indexes( + markdown: &str, + benchmarking_index: &str, + readme: &str, +) { + assert!(markdown.contains("The qmd debug-ergonomics outcome is unchanged")); + assert!(markdown.contains("ELF 6 pass/0 wrong_result; qmd 0 pass/6 wrong_result")); + assert!( + markdown.contains("Do not treat qmd's 0 pass/6 wrong_result live operator-debug slice") + ); + assert!(markdown.contains("Immediate top-k rows with source id")); + assert!( + benchmarking_index.contains("2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md") + ); + assert!(readme.contains("qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026")); + assert!(readme.contains("Latest real-world benchmark report: June 19, 2026")); + assert!(readme.contains("keeps the qmd edge unchanged")); +} + fn assert_xy955_commands(report: &Value) -> Result<()> { let commands = array_at(report, "/commands")?; let aggregate = find_by_field(commands, "/command", "cargo make real-world-memory")?; diff --git a/docs/evidence/benchmarking/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md b/docs/evidence/benchmarking/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md new file mode 100644 index 00000000..53cda6bf --- /dev/null +++ b/docs/evidence/benchmarking/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md @@ -0,0 +1,128 @@ +--- +type: Evidence +title: "qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026" +description: "Checked-in benchmark evidence record: qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026." +resource: docs/evidence/benchmarking/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md +status: active +authority: current_state +owner: evidence +last_verified: 2026-06-19 +tags: + - docs + - evidence + - benchmarking +--- +# qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026 + +Goal: Close XY-982 by retesting the qmd debug-ergonomics follow-up after the +Dreaming-readiness stages and the XY-955 competitor-strength closeout. +Read this when: You need to know whether Dreaming-stage improvements erased, +improved, or regressed the qmd local-debug artifact edge. +Inputs: +`apps/elf-eval/fixtures/report_snapshots/2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.json`, +`docs/evidence/benchmarking/2026-06-11-elf-qmd-trace-replay-diagnostics-report.md`, +`docs/evidence/benchmarking/2026-06-17-dreaming-competitor-strength-retest-report.md`, +and the fresh `tmp/real-world-job/operator-ux-live-adapters/summary.json` output. +Outputs: Scenario-level improved/regressed/unchanged/not-tested/non-goal judgments +for qmd debug ergonomics, with claim boundaries and next optimization direction. + +## Executive Judgment + +The qmd debug-ergonomics outcome is unchanged after the Dreaming stages. + +The fresh live operator-debug retest confirms ELF's narrow trace/stage visibility +advantage: + +- `cargo make real-world-job-operator-ux-live-adapters` passed on June 19, 2026. +- ELF scored 6 operator-debug jobs, with 6 pass, 0 wrong_result, trace visibility on + all 6 jobs, replay commands on all 6 jobs, and no raw SQL requirement. +- qmd scored the same 6 operator-debug jobs, with 0 pass and 6 wrong_result because + local replay output is available but service trace hydration and intermediate + candidate-drop stages are not exposed in this live slice. + +This does not erase qmd's measured debug edge from the June 11 diagnostics. qmd still +preserves the default top-10 candidate JSON and short local CLI replay advantage. +ELF has useful service trace/admin surfaces, but the default stress/report artifacts +still do not emit a directly comparable qmd-style candidate artifact with expansion, +fusion, rerank, and dropped-candidate stage details. + +No retested debug-ergonomics scenario regressed. No broad ELF-over-qmd superiority +claim is supported. + +## Command Evidence + +| Command | Status | Artifact | Result | +| --- | --- | --- | --- | +| `cargo make real-world-job-operator-ux-live-adapters` | `pass` | `tmp/real-world-job/operator-ux-live-adapters/summary.json` | ELF 6 pass/0 wrong_result; qmd 0 pass/6 wrong_result. | + +## Fresh Live Retest + +| Adapter | Jobs | Pass | Wrong result | Trace available | Replay available | Candidate-drop visibility | Raw SQL needed | +| --- | --- | --- | --- | --- | --- | --- | --- | +| ELF operator-debug live | 6 | 6 | 0 | 6 | 6 | stage visibility present across all jobs | 0 | +| qmd operator-debug live | 6 | 0 | 6 | 0 | 6 | top-k replay output only; no intermediate candidate-drop stages | 0 | + +The qmd rows are typed non-pass for this live operator-debug slice, not a regression +of qmd's default local replay surface. qmd remains useful for direct local top-k +inspection. + +## Scenario Retest Matrix + +| Scenario | June 11 baseline | June 19 retest | Judgment | Boundary | +| --- | --- | --- | --- | --- | +| qmd default top-10 candidate artifact | ELF `loss` | ELF `loss` | `unchanged` | qmd still exposes direct top-10 rows; ELF has trace ids and admin surfaces but no default qmd-like candidate artifact in the stress report. | +| qmd short CLI replay | ELF `loss` | ELF `loss` | `unchanged` | qmd replay remains a short local CLI path; ELF replay still depends on service config, headers, traces, and bundle hydration. | +| ELF operator-debug trace hydration | ELF `win` | ELF `win` | `unchanged` | ELF has trace visibility on 6/6 jobs; qmd has replay commands but no service trace hydration in this slice. | +| Operator-debug replay command availability | `tie` | `tie` | `unchanged` | Both adapters emit replay commands on 6/6 jobs; this does not score equivalent UI quality. | +| Operator-debug candidate-drop visibility | ELF `win` | ELF `win` | `unchanged` | ELF exposes stage visibility; qmd exposes top-k output but not intermediate drops. | +| Operator-debug selected-but-not-narrated visibility | ELF `win` | ELF `win` | `unchanged` | ELF exposes final results and narration-stage details for the selected-but-not-narrated case; qmd does not expose an equivalent service trace surface. | +| Query expansion attribution | `not_tested` | `not_tested` | `not_tested` | No comparable expansion-variant artifact exists for both systems. | +| Dense/sparse channel attribution | `not_tested` | `not_tested` | `not_tested` | Current artifacts still do not expose comparable dense-only and sparse-only contribution data. | +| Fusion attribution | `not_tested` | `not_tested` | `not_tested` | Current artifacts still do not expose comparable fusion inputs, rank deltas, or dropped candidates. | +| Rerank attribution | `non_goal` | `non_goal` | `non_goal` | The qmd materializer path remains a `--no-rerank` path for this evidence line. | + +## Improvement and Regression Readback + +| Bucket | Count | Meaning | +| --- | --- | --- | +| `improved` | 0 | The retest did not add a new comparable default artifact that beats qmd's local debug surface. | +| `regressed` | 0 | No checked scenario moved backward from the June 11 or June 17 evidence. | +| `unchanged` | 6 | qmd keeps the default top-k/replay edge; ELF keeps the operator-debug trace/stage visibility wins. | +| `not_tested` | 3 | Expansion, dense/sparse contribution, and fusion are still missing comparable artifacts. | +| `non_goal` | 1 | Rerank scoring remains out of scope for the qmd `--no-rerank` materializer path. | + +## Claim Boundaries + +Allowed: + +- qmd's default local-debug edge remains: top-10 candidate rows plus short CLI replay. +- ELF still wins the narrow live operator-debug trace hydration, candidate-drop + visibility, and selected-but-not-narrated visibility slice. +- Both systems still expose replay commands for the operator-debug fixtures. +- The Dreaming-stage retest did not find a debug-ergonomics regression. + +Not allowed: + +- Do not claim ELF broadly beats qmd from this retest. +- Do not treat qmd's 0 pass/6 wrong_result live operator-debug slice as proof that + qmd's default top-k/replay edge is gone. +- Do not claim expansion, fusion, dense/sparse contribution, or rerank parity until + directly comparable artifacts are emitted. +- Do not collapse `not_tested`, `non_goal`, or `wrong_result` into pass evidence. + +## Next Optimization Direction + +The next useful improvement is not another broad leaderboard rerun. It is a comparable +candidate-replay artifact for both ELF and qmd that emits: + +1. Immediate top-k rows with source id, file or note id, score, snippet, and rank. +2. Expansion variants and whether the original query was retained. +3. Dense-only and sparse-only candidate sets. +4. Fusion rank deltas and score contributions. +5. Rerank score, or an explicit rerank-disabled marker. +6. Dropped or demoted expected evidence. +7. One-command replay lines for both systems. + +Until that exists, the correct conclusion is unchanged: qmd keeps the default local +debug artifact edge, while ELF keeps the service-backed operator-debug trace/stage +visibility wins. diff --git a/docs/evidence/benchmarking/index.md b/docs/evidence/benchmarking/index.md index 2f7c6428..609f4cf0 100644 --- a/docs/evidence/benchmarking/index.md +++ b/docs/evidence/benchmarking/index.md @@ -36,3 +36,4 @@ Routes to: Benchmarking evidence concepts under `docs/evidence/benchmarking/`. - `2026-06-16-proactive-brief-scoring-report.md`: Proactive Brief Scoring Report - June 16, 2026. - `2026-06-16-scheduled-memory-task-scoring-report.md`: Real-World Job Benchmark Report. - `2026-06-17-dreaming-competitor-strength-retest-report.md`: Dreaming Competitor-Strength Retest Report - June 17, 2026. +- `2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md`: qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026; confirms qmd's default top-k/replay edge is unchanged while ELF keeps the narrow operator-debug trace/stage visibility wins.