# Eval Executor Spec — SOLUTION_ERP > **Version:** v1.0 (2026-05-26) > **Spec:** A — Strict (expected chunk must appear in top-5, rerank ≥ 0.7 = confident hit) > **Framework:** RAG v1.3 §6.3 — Spec A vs B locked BEFORE first baseline > **Companion:** `RAG-FRAMEWORK-V1.3-SETUP-GUIDE.md` §6 --- ## Execution protocol ### 1. Run search_memory for each query ```python # Fire all 14 queries in parallel (MCP tool) mcp__rag-unified__search_memory( query=, scope="self", # project = solution_erp top_k=5, use_rerank=True ) ``` ### 2. Scoring per query (Spec A — Strict) | Hit condition | Score | |---|---| | Expected source_path appears in top-5 AND rerank ≥ 0.7 | ✅ HIT | | Expected source_path appears in top-5 BUT rerank < 0.7 | ✗ MISS (Case A suspect) | | Expected source_path NOT in top-5 | ✗ MISS — classify Case B/C/D | | Negative query: 0 results OR all rerank < 0.7 | ✅ CORRECT EXCLUSION | ### 3. recall@5 calculation ``` recall@5 = hits / positive_queries positive_queries = 11 (q01-q11, excluding 3 negative q12-q14) gate_threshold = 0.7 → must hit ≥ 8/11 ``` ### 4. Case classification for failures Per v1.3 §10: - **Case A:** chunk in top-5 but rerank low → threshold calibration - **Case B:** chunk NOT top-5 but IS top-20 → retrieval param tuning - **Case C:** chunk NOT top-20 but verbatim phrase IS in corpus → rerank context-density bias - **Case D:** verbatim phrase NOT in corpus → harvest gap ### 5. Output format Save to `eval/runs/YYYY-MM-DD-baseline-vN.N.json`: ```json { "run_date": "YYYY-MM-DD", "golden_set_version": "vN.N", "spec": "A", "results": [ { "id": "q01", "query": "...", "expected_source": "...", "hit": true/false, "top_1_source": "...", "top_1_rerank": 0.000, "case": null/"A"/"B"/"C"/"D" } ], "recall_at_5": 0.000, "avg_top1_rerank": 0.000, "pass_gate": true/false } ``` --- ## Golden set file `eval/golden-set-solution_erp.jsonl` — 14 entries (immutable during trial period) **Mutation rules:** - ❌ DO NOT rephrase query mid-trial (Anti #11) - ❌ DO NOT modify expected_source_paths post-baseline (Anti #12) - ✅ Version bump v1.0 → v1.1 OK WITH lock of prior version + transparent re-author (AI_INFRA lesson §3.5) --- ## Weekly Friday execution 1. Fire 14 queries SAME (no modification) 2. Score → recall@5 + avg_rerank 3. Compare vs `eval/trial-state-lock.json` baseline 4. Check chunk_count drift (Qdrant LIVE vs baseline) 5. Update lock file milestone status 6. If recall < gate → apply §15.1 4-cause triage