[CLAUDE] Docs: setup RAG Framework v1.3 governance + eval framework
All checks were successful
Deploy SOLUTION_ERP / build-deploy (push) Successful in 3m52s

- docs/governance/README.md: Path B delegation stub → AI_INFRA canonical
  Phase/BC vocabulary documented (9 phase + 10 BC SOLUTION_ERP-specific)
- .claude/rag.json: add _decision_log block (10 rationale entries) +
  add .claude/agents/**/*.md to corpus_paths (fix Case D harvest gap)
- eval/evaluator.md: inline executor spec v1.0 (Spec A strict)
- eval/golden-set-solution_erp.jsonl: 14-entry golden set v1.1
  (5 gotcha + 3 pattern + 3 decision + 3 negative)
- eval/runs/2026-05-26-baseline-v1.0-failed.json: v1.0 attempt
  recall@5=0.455 FAIL — root cause diagnosis Case A/C/D
- eval/runs/2026-05-26-baseline-v1.1-pending.json: v1.1 attempt
  pending CLI restart for accurate numbers
- eval/trial-state-lock.json: 2-section split (quality_gate +
  drift_monitor) per v1.3 §6.2, 4-week milestones 2026-05-26 → 2026-06-23

CRITICAL lesson: bootstrap.py --project flag overrides collection name only.
Use --config D:\...\SOLUTION_ERP\.claude\rag.json for correct project root.
Old projects.json had root_path=AI_INFRA for solution_erp (Anti #24) — FIXED.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
pqhuy1987
2026-05-26 13:14:23 +07:00
parent c506919d7d
commit b223466ded
7 changed files with 342 additions and 2 deletions

95
eval/evaluator.md Normal file
View File

@ -0,0 +1,95 @@
# Eval Executor Spec — SOLUTION_ERP
> **Version:** v1.0 (2026-05-26)
> **Spec:** A — Strict (expected chunk must appear in top-5, rerank ≥ 0.7 = confident hit)
> **Framework:** RAG v1.3 §6.3 — Spec A vs B locked BEFORE first baseline
> **Companion:** `RAG-FRAMEWORK-V1.3-SETUP-GUIDE.md` §6
---
## Execution protocol
### 1. Run search_memory for each query
```python
# Fire all 14 queries in parallel (MCP tool)
mcp__rag-unified__search_memory(
query=<query>,
scope="self", # project = solution_erp
top_k=5,
use_rerank=True
)
```
### 2. Scoring per query (Spec A — Strict)
| Hit condition | Score |
|---|---|
| Expected source_path appears in top-5 AND rerank ≥ 0.7 | ✅ HIT |
| Expected source_path appears in top-5 BUT rerank < 0.7 | MISS (Case A suspect) |
| Expected source_path NOT in top-5 | MISS classify Case B/C/D |
| Negative query: 0 results OR all rerank < 0.7 | CORRECT EXCLUSION |
### 3. recall@5 calculation
```
recall@5 = hits / positive_queries
positive_queries = 11 (q01-q11, excluding 3 negative q12-q14)
gate_threshold = 0.7 → must hit ≥ 8/11
```
### 4. Case classification for failures
Per v1.3 §10:
- **Case A:** chunk in top-5 but rerank low threshold calibration
- **Case B:** chunk NOT top-5 but IS top-20 retrieval param tuning
- **Case C:** chunk NOT top-20 but verbatim phrase IS in corpus rerank context-density bias
- **Case D:** verbatim phrase NOT in corpus harvest gap
### 5. Output format
Save to `eval/runs/YYYY-MM-DD-baseline-vN.N.json`:
```json
{
"run_date": "YYYY-MM-DD",
"golden_set_version": "vN.N",
"spec": "A",
"results": [
{
"id": "q01",
"query": "...",
"expected_source": "...",
"hit": true/false,
"top_1_source": "...",
"top_1_rerank": 0.000,
"case": null/"A"/"B"/"C"/"D"
}
],
"recall_at_5": 0.000,
"avg_top1_rerank": 0.000,
"pass_gate": true/false
}
```
---
## Golden set file
`eval/golden-set-solution_erp.jsonl` 14 entries (immutable during trial period)
**Mutation rules:**
- DO NOT rephrase query mid-trial (Anti #11)
- DO NOT modify expected_source_paths post-baseline (Anti #12)
- Version bump v1.0 v1.1 OK WITH lock of prior version + transparent re-author (AI_INFRA lesson §3.5)
---
## Weekly Friday execution
1. Fire 14 queries SAME (no modification)
2. Score recall@5 + avg_rerank
3. Compare vs `eval/trial-state-lock.json` baseline
4. Check chunk_count drift (Qdrant LIVE vs baseline)
5. Update lock file milestone status
6. If recall < gate apply §15.1 4-cause triage