[CLAUDE] Docs: setup RAG Framework v1.3 governance + eval framework
All checks were successful
Deploy SOLUTION_ERP / build-deploy (push) Successful in 3m52s
All checks were successful
Deploy SOLUTION_ERP / build-deploy (push) Successful in 3m52s
- docs/governance/README.md: Path B delegation stub → AI_INFRA canonical Phase/BC vocabulary documented (9 phase + 10 BC SOLUTION_ERP-specific) - .claude/rag.json: add _decision_log block (10 rationale entries) + add .claude/agents/**/*.md to corpus_paths (fix Case D harvest gap) - eval/evaluator.md: inline executor spec v1.0 (Spec A strict) - eval/golden-set-solution_erp.jsonl: 14-entry golden set v1.1 (5 gotcha + 3 pattern + 3 decision + 3 negative) - eval/runs/2026-05-26-baseline-v1.0-failed.json: v1.0 attempt recall@5=0.455 FAIL — root cause diagnosis Case A/C/D - eval/runs/2026-05-26-baseline-v1.1-pending.json: v1.1 attempt pending CLI restart for accurate numbers - eval/trial-state-lock.json: 2-section split (quality_gate + drift_monitor) per v1.3 §6.2, 4-week milestones 2026-05-26 → 2026-06-23 CRITICAL lesson: bootstrap.py --project flag overrides collection name only. Use --config D:\...\SOLUTION_ERP\.claude\rag.json for correct project root. Old projects.json had root_path=AI_INFRA for solution_erp (Anti #24) — FIXED. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
95
eval/evaluator.md
Normal file
95
eval/evaluator.md
Normal file
@ -0,0 +1,95 @@
|
||||
# Eval Executor Spec — SOLUTION_ERP
|
||||
|
||||
> **Version:** v1.0 (2026-05-26)
|
||||
> **Spec:** A — Strict (expected chunk must appear in top-5, rerank ≥ 0.7 = confident hit)
|
||||
> **Framework:** RAG v1.3 §6.3 — Spec A vs B locked BEFORE first baseline
|
||||
> **Companion:** `RAG-FRAMEWORK-V1.3-SETUP-GUIDE.md` §6
|
||||
|
||||
---
|
||||
|
||||
## Execution protocol
|
||||
|
||||
### 1. Run search_memory for each query
|
||||
|
||||
```python
|
||||
# Fire all 14 queries in parallel (MCP tool)
|
||||
mcp__rag-unified__search_memory(
|
||||
query=<query>,
|
||||
scope="self", # project = solution_erp
|
||||
top_k=5,
|
||||
use_rerank=True
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Scoring per query (Spec A — Strict)
|
||||
|
||||
| Hit condition | Score |
|
||||
|---|---|
|
||||
| Expected source_path appears in top-5 AND rerank ≥ 0.7 | ✅ HIT |
|
||||
| Expected source_path appears in top-5 BUT rerank < 0.7 | ✗ MISS (Case A suspect) |
|
||||
| Expected source_path NOT in top-5 | ✗ MISS — classify Case B/C/D |
|
||||
| Negative query: 0 results OR all rerank < 0.7 | ✅ CORRECT EXCLUSION |
|
||||
|
||||
### 3. recall@5 calculation
|
||||
|
||||
```
|
||||
recall@5 = hits / positive_queries
|
||||
positive_queries = 11 (q01-q11, excluding 3 negative q12-q14)
|
||||
gate_threshold = 0.7 → must hit ≥ 8/11
|
||||
```
|
||||
|
||||
### 4. Case classification for failures
|
||||
|
||||
Per v1.3 §10:
|
||||
- **Case A:** chunk in top-5 but rerank low → threshold calibration
|
||||
- **Case B:** chunk NOT top-5 but IS top-20 → retrieval param tuning
|
||||
- **Case C:** chunk NOT top-20 but verbatim phrase IS in corpus → rerank context-density bias
|
||||
- **Case D:** verbatim phrase NOT in corpus → harvest gap
|
||||
|
||||
### 5. Output format
|
||||
|
||||
Save to `eval/runs/YYYY-MM-DD-baseline-vN.N.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"run_date": "YYYY-MM-DD",
|
||||
"golden_set_version": "vN.N",
|
||||
"spec": "A",
|
||||
"results": [
|
||||
{
|
||||
"id": "q01",
|
||||
"query": "...",
|
||||
"expected_source": "...",
|
||||
"hit": true/false,
|
||||
"top_1_source": "...",
|
||||
"top_1_rerank": 0.000,
|
||||
"case": null/"A"/"B"/"C"/"D"
|
||||
}
|
||||
],
|
||||
"recall_at_5": 0.000,
|
||||
"avg_top1_rerank": 0.000,
|
||||
"pass_gate": true/false
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Golden set file
|
||||
|
||||
`eval/golden-set-solution_erp.jsonl` — 14 entries (immutable during trial period)
|
||||
|
||||
**Mutation rules:**
|
||||
- ❌ DO NOT rephrase query mid-trial (Anti #11)
|
||||
- ❌ DO NOT modify expected_source_paths post-baseline (Anti #12)
|
||||
- ✅ Version bump v1.0 → v1.1 OK WITH lock of prior version + transparent re-author (AI_INFRA lesson §3.5)
|
||||
|
||||
---
|
||||
|
||||
## Weekly Friday execution
|
||||
|
||||
1. Fire 14 queries SAME (no modification)
|
||||
2. Score → recall@5 + avg_rerank
|
||||
3. Compare vs `eval/trial-state-lock.json` baseline
|
||||
4. Check chunk_count drift (Qdrant LIVE vs baseline)
|
||||
5. Update lock file milestone status
|
||||
6. If recall < gate → apply §15.1 4-cause triage
|
||||
Reference in New Issue
Block a user