Files

pqhuy1987 b223466ded

Deploy SOLUTION_ERP / build-deploy (push) Successful in 3m52s

Details

[CLAUDE] Docs: setup RAG Framework v1.3 governance + eval framework

- docs/governance/README.md: Path B delegation stub → AI_INFRA canonical
  Phase/BC vocabulary documented (9 phase + 10 BC SOLUTION_ERP-specific)
- .claude/rag.json: add _decision_log block (10 rationale entries) +
  add .claude/agents/**/*.md to corpus_paths (fix Case D harvest gap)
- eval/evaluator.md: inline executor spec v1.0 (Spec A strict)
- eval/golden-set-solution_erp.jsonl: 14-entry golden set v1.1
  (5 gotcha + 3 pattern + 3 decision + 3 negative)
- eval/runs/2026-05-26-baseline-v1.0-failed.json: v1.0 attempt
  recall@5=0.455 FAIL — root cause diagnosis Case A/C/D
- eval/runs/2026-05-26-baseline-v1.1-pending.json: v1.1 attempt
  pending CLI restart for accurate numbers
- eval/trial-state-lock.json: 2-section split (quality_gate +
  drift_monitor) per v1.3 §6.2, 4-week milestones 2026-05-26 → 2026-06-23

CRITICAL lesson: bootstrap.py --project flag overrides collection name only.
Use --config D:\...\SOLUTION_ERP\.claude\rag.json for correct project root.
Old projects.json had root_path=AI_INFRA for solution_erp (Anti #24) — FIXED.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-26 13:14:23 +07:00

2.6 KiB

Raw Blame History

Eval Executor Spec — SOLUTION_ERP

Version: v1.0 (2026-05-26) Spec: A — Strict (expected chunk must appear in top-5, rerank ≥ 0.7 = confident hit) Framework: RAG v1.3 §6.3 — Spec A vs B locked BEFORE first baseline Companion: RAG-FRAMEWORK-V1.3-SETUP-GUIDE.md §6

Execution protocol

1. Run search_memory for each query

# Fire all 14 queries in parallel (MCP tool)
mcp__rag-unified__search_memory(
    query=<query>,
    scope="self",      # project = solution_erp
    top_k=5,
    use_rerank=True
)

2. Scoring per query (Spec A — Strict)

Hit condition	Score
Expected source_path appears in top-5 AND rerank ≥ 0.7	✅ HIT
Expected source_path appears in top-5 BUT rerank < 0.7	✗ MISS (Case A suspect)
Expected source_path NOT in top-5	✗ MISS — classify Case B/C/D
Negative query: 0 results OR all rerank < 0.7	✅ CORRECT EXCLUSION

3. recall@5 calculation

recall@5 = hits / positive_queries
positive_queries = 11 (q01-q11, excluding 3 negative q12-q14)
gate_threshold = 0.7 → must hit ≥ 8/11

4. Case classification for failures

Per v1.3 §10:

Case A: chunk in top-5 but rerank low → threshold calibration
Case B: chunk NOT top-5 but IS top-20 → retrieval param tuning
Case C: chunk NOT top-20 but verbatim phrase IS in corpus → rerank context-density bias
Case D: verbatim phrase NOT in corpus → harvest gap

5. Output format

Save to eval/runs/YYYY-MM-DD-baseline-vN.N.json:

{
  "run_date": "YYYY-MM-DD",
  "golden_set_version": "vN.N",
  "spec": "A",
  "results": [
    {
      "id": "q01",
      "query": "...",
      "expected_source": "...",
      "hit": true/false,
      "top_1_source": "...",
      "top_1_rerank": 0.000,
      "case": null/"A"/"B"/"C"/"D"
    }
  ],
  "recall_at_5": 0.000,
  "avg_top1_rerank": 0.000,
  "pass_gate": true/false
}

Golden set file

eval/golden-set-solution_erp.jsonl — 14 entries (immutable during trial period)

Mutation rules:

❌ DO NOT rephrase query mid-trial (Anti #11)
❌ DO NOT modify expected_source_paths post-baseline (Anti #12)
✅ Version bump v1.0 → v1.1 OK WITH lock of prior version + transparent re-author (AI_INFRA lesson §3.5)

Weekly Friday execution

Fire 14 queries SAME (no modification)
Score → recall@5 + avg_rerank
Compare vs eval/trial-state-lock.json baseline
Check chunk_count drift (Qdrant LIVE vs baseline)
Update lock file milestone status
If recall < gate → apply §15.1 4-cause triage

2.6 KiB Raw Blame History