Files
solution-erp/docs/rag-setup-plan.md
pqhuy1987 0a3b747612 [CLAUDE] Docs: chốt Session 21 turn 2 — RAG Hybrid setup planning + Cách A validation
Sau S21 turn 1 chốt cicd-monitor, bro clarify 5 dự án future > 1M MD tokens → discussion deep ~15 turn về RAG infrastructure. Em main solo (no SOLUTION_ERP sub-agent spawn), delegate claude-code-guide × 2 research Anthropic + community practice.

Quyết định chốt:
- Cách A defensive (giữ blanket 120K em main + RAG retrieve supplement)
- Bỏ Cách B aggressive (cắt 60-70% blanket) — vi phạm priority em main control flow strong
- Industry-validated cross 4 Anthropic blog + 5 community tools (Cursor/Continue/Cline/Aider all hybrid)
- 3-layer pattern Phase 1-3 incremental rollout (vector → +BM25 → +reranking, recall ~70% → ~92%)
- Stack: Voyage-3-large + Qdrant local + FastMCP Python + Streamlit dashboard

Multi-agent cost reality clarify (post-S21 t2):
- Em main blanket: ~120K
- 4 sub-agents spawn cumulative: ~400K
- Total billed heavy session: ~560K Cách A vs ~700K lazy
- Saving -20% từ multi-agent shared cache 70-90%
- Anthropic acknowledge 8-10× multiplier multi-agent

Files updated:
- docs/STATUS.md (Last updated S21 turn 2 + Recently Done row top)
- docs/HANDOFF.md (TL;DR Session 21 turn 2 section + Last updated)
- docs/rag-setup-plan.md (+Section 13 multi-agent cost reality + Section 14 3-layer hybrid Phase 1-3, +355 LOC)
- docs/changelog/sessions/2026-05-12-1800-s21-turn2-rag-planning.md (new session log)

Memory user-level update (outside repo, separate update):
- feedback_rag_hybrid_pattern.md (NEW cross-project pattern reusable)
- MEMORY.md index (+1 entry pointer)

Plan I NEW deferred — trigger bro confirm 5 dự án path + stack + pilot + Voyage API + disk cleanup → dedicated session 10-14h weekend (per feedback_drastic_refactor_scope rule).

Stats:
- 17 memory entries (+1 RAG hybrid)
- 1 plan file rag-setup-plan.md (1500 LOC final)
- 4 sub-agents seeds-only unchanged
- 81 test unchanged
- 4 commits S21 cumulative (f1c61c9 + 3a34831 + 1f8e9af + this)

CI skip per path filter (all .md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:50:28 +07:00

1579 lines
54 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RAG Setup Plan — Cross-project reference
> **Mục đích:** Plan setup Hybrid RAG (Option A) cho project có MD context > 1M tokens. Cross-project applicable — SOLUTION_ERP làm baseline reference, future 2 dự án bro apply pattern này.
> **Last updated:** 2026-05-12 (Session 21 turn 1+)
> **Status:** 📝 Plan saved — chưa implement, target Week 1-4 trial 2 dự án future
> **Owner:** pqhuy1987@gmail.com + Claude (em main + 4 sub-agents)
---
## 📋 Table of Contents
1. [Context + Why](#1-context--why)
2. [Architecture overview](#2-architecture-overview)
3. [BLANKET load list (~100K tokens, 28%)](#3-blanket-load-list)
4. [RAG store list (~254K tokens, 72%)](#4-rag-store-list)
5. [Tool stack recommend](#5-tool-stack-recommend)
6. [Setup scripts (copy-paste ready)](#6-setup-scripts)
7. [Audit procedure (3-tier cadence)](#7-audit-procedure)
8. [Multi-AI client access](#8-multi-ai-client-access)
9. [Timeline rollout (~10-14h dedicated)](#9-timeline-rollout)
10. [Caveats + risks](#10-caveats--risks)
11. [Success metrics + decision gate](#11-success-metrics)
12. [Future enhancements](#12-future-enhancements)
---
## 1. Context + Why
### Problem statement
```
Hiện tại lazy blanket pattern (em main + 4 agents):
- Em main vác ~120K MD upfront (35% project)
- Lazy Read khi cần — em main TỰ ĐOÁN file relevant
- 4 agents mỗi spawn ~188K cache WRITE
- Heavy session ~700K effective billed
- Lost-in-middle threshold đạt sau ~5.75h productive
Scale-up to 2 projects > 1M MD tokens each:
❌ Blanket KHÔNG khả thi (vượt 1M context cap)
❌ Lazy Read recall ~30-60% (em main miss file không nghĩ tới)
❌ 4 agents duplicate Read same files (cumulative ~240K wasted)
❌ Vietnamese-English synonym miss (grep keyword only)
❌ Cross-project context impossible without manual switching
```
### Solution
**Hybrid RAG Option A** — blanket critical + retrieve on-demand:
```
KEEP blanket: ~100K static (core stable + current state + agent + skills + memory critical)
ADD RAG layer: 70% MD remaining accessible via semantic retrieve
SHARE cache: 4 agents reuse retrieved chunks (multi-agent leverage)
```
### Benefits chốt từ analysis sessions trước
| Metric | Lazy current | Option A | Δ |
|---|---|---|---|
| Quality recall | 30-60% | **85%** | **+25-55pp** |
| Heavy session token | 700K | **560K** | -20% |
| Session productive hours | 5.75h | **7.6h** | **+1.85h** |
| Tasks before lost-in-middle | ~23 | **~38** | **+65%** |
| Net successful tasks/session | 25 | **50** | **2×** |
| Multi-agent shared cache | ❌ | **✅ 60-90% cache hit** | leverage real |
| Việt-Anh semantic search | ❌ grep only | **✅ Voyage multilingual** | unlock |
| Scale > 1M MD | ❌ break | **✅ work** | **enable** |
### Trade-off
- ⚠️ Setup cost: ~10-14h dedicated session (1 lần invest)
- ⚠️ Maintenance: ~30 phút/tuần audit
- ⚠️ Beta features (Memory tool, Files API): có thể breaking change
- ⚠️ Retrieval miss risk ~5-10% (mitigated bằng citations + fallback Read)
- ⚠️ Voyage API cost: ~$0.36 initial embed + ~$0.20/tháng delta
---
## 2. Architecture overview
```
┌─────────────────────────────────────────────────────────────┐
│ LAYER 1 — Static blanket (cache hot, 5min-1h TTL) │
├─────────────────────────────────────────────────────────────┤
│ Em main + 4 sub-agents auto-inject ~100K core context: │
│ • rules.md, architecture.md, CLAUDE.md, PROJECT-MAP │
│ • STATUS top 100 line, HANDOFF top 150 line │
│ • 5 agent .md (README + 4 agent identity) │
│ • 5 SKILL.md descriptions (auto-inject) │
│ • 5 memory entries critical cross-cutting │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 2 — Vector DB retrieve on-demand │
├─────────────────────────────────────────────────────────────┤
│ Qdrant local (~50MB binary, ~200MB index per project): │
│ • Session logs cumulative (49% MD, biggest) │
│ • Gotchas detail (chunk per entry) │
│ • Archives + Recently Done + Migration-todos │
│ • Flows + Database guides │
│ • SKILL.md detail (description đã trong blanket) │
│ • Memory entries non-critical │
│ • Guides ops conditional │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 3 — Embedding service (Voyage AI cloud) │
├─────────────────────────────────────────────────────────────┤
│ voyage-3-large multilingual 26 lang (Việt-Anh tốt): │
│ • Index time: embed chunks → vectors (one-time + delta) │
│ • Query time: embed query → search Qdrant top-K │
│ • Cost: $0.18/M tokens, ~$0.36 init + ~$0.20/month │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 4 — MCP retriever server (FastMCP Python) │
├─────────────────────────────────────────────────────────────┤
│ Tool exposed: rag_retrieve(query, scope, k, time_range) │
│ Transport: stdio (Claude Code) hoặc HTTP/SSE (multi-AI) │
│ Auth: API key per client (multi-AI mode) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 5 — Multi-AI clients │
├─────────────────────────────────────────────────────────────┤
│ Claude Code (em main + 4 agents) — primary │
│ Claude Desktop — secondary │
│ GPT-4 / Cursor / Continue / Custom agent — optional │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 6 — Re-index pipeline │
├─────────────────────────────────────────────────────────────┤
│ Pre-commit hook: delta re-index changed MD │
│ Weekly full re-index: catch missed (Saturday off-peak) │
│ Batch API 50% discount cho mass re-index │
└─────────────────────────────────────────────────────────────┘
```
### Flow time index (1 lần init + delta)
```
1. Walk filesystem → docs/ + .claude/ + memory/
2. Chunk adaptive theo doc_type (custom Python chunker)
3. Batch embed via Voyage API (128 chunks/batch)
4. Upsert Qdrant với metadata (source, doc_type, project, last_modified)
5. Total init: ~10-15 phút cho 1M MD tokens
```
### Flow query time (mỗi spawn em main hoặc agent)
```
1. Em main/agent: rag_retrieve("query keyword", scope, k)
2. MCP server: embed query → Voyage API (~100ms)
3. MCP server: Qdrant search top-K (~50ms local)
4. MCP server: return chunks với metadata + score
5. Total: ~150-200ms per query (network-bound)
6. Cache: subsequent same query → ~10ms (cache hit)
```
---
## 3. BLANKET load list
> **Total: ~100K tokens (28% project MD)**
> Auto-load mỗi spawn em main + 4 agents.
### A. Core stable docs (~30K — KHÔNG đổi thường xuyên)
| File | Token | Lý do blanket |
|---|---:|---|
| `docs/rules.md` | ~7K | Coding conventions stable, mọi task reference |
| `CLAUDE.md` (root pointer) | ~3K | Auto-inject system reminder |
| `docs/CLAUDE.md` | ~3K | Tech stack overview baseline |
| `docs/architecture.md` | ~7K | 4-layer Clean Arch baseline |
| `docs/PROJECT-MAP.md` | ~3K | Bản đồ navigate |
| `docs/workflow-contract.md` | ~4K | State machine 9 phase Contract domain core |
| `docs/forms-spec.md` | ~3K | 8 form catalog domain knowledge |
### B. Current state (~25K — em main biết direct, không cần retrieve)
| File | Strategy | Token |
|---|---|---:|
| `docs/STATUS.md` **top 100 line** | Current phase + In Progress + 1-2 Recently Done top | ~15K |
| `docs/HANDOFF.md` **top 150 line** | Last updated + TL;DR latest session + next priority | ~10K |
**Drop từ blanket:** STATUS Recently Done > 5 row cũ (retrieve nếu cần), HANDOFF TL;DR cũ > 1 tuần.
### C. Agent infrastructure (~25K — agent identity stable)
| File | Token |
|---|---:|
| `.claude/agents/README.md` | ~5K |
| `.claude/agents/investigator.md` | ~3.5K |
| `.claude/agents/implementer.md` | ~4K |
| `.claude/agents/reviewer.md` | ~3.5K |
| `.claude/agents/cicd-monitor.md` | ~5K |
| `.claude/agent-memory/{4 agents}/MEMORY.md` auto-inject 25KB first 200 lines | ~4K total |
### D. Skills descriptions (~5K — auto-inject, không SKILL.md full)
| File | Strategy | Token |
|---|---|---:|
| `.claude/skills/README.md` | Full | ~2.5K |
| 6 SKILL.md descriptions | Auto-inject by Claude Code | ~1K total |
| 6 SKILL.md detail | **KHÔNG blanket** → RAG retrieve khi triggered | — |
### E. Memory user-level critical (~15K)
| File | Token | Lý do critical |
|---|---:|---|
| `project_solution_erp.md` | ~3.5K | Project overview narrative |
| `feedback_md_compact_narrative.md` (§6.5) | ~2K | Rule cốt lõi mọi doc work |
| `feedback_uat_skip_verify.md` | ~2K | Phase 9 current mode rule |
| `feedback_multi_agent_setup.md` | ~3K | 4-agent discipline |
| `feedback_per_chunk_commit.md` | ~2K | Implementer pattern reusable |
| `feedback_audit_reuse_before_clone.md` | ~2K | Investigator natural pattern |
**Drop từ blanket:** 11 memory entries còn lại (retrieve khi pattern triggered).
### TOTAL BLANKET ≈ 100K tokens
---
## 4. RAG store list
> **Total: ~254K tokens (72% project MD)**
> Index vào Qdrant, retrieve on-demand.
### F. Session logs (~150K — biggest, 49% MD)
```
Path: docs/changelog/sessions/*.md (41+ files growing)
Chunk strategy: 1 file = 1 chunk (preserve narrative §6.5)
Metadata:
- session_date: extracted from filename
- phase: extracted from content
- topic: extracted from H1
- commit_sha_range: extracted from "Commits:" line
- doc_type: "session_log"
Scope filter: time_range="last_week|last_month|last_quarter|all"
```
### G. Gotchas (~9K — lookup per debug)
```
Path: docs/gotchas.md (44+ entries)
Chunk strategy: split per "### N. ..." numbered heading
Metadata:
- gotcha_id: integer
- category: extracted from content (tech/EF/Workflow/CICD/Security/...)
- doc_type: "gotcha"
Scope filter: scope="gotcha"
```
### H. Archives + Recently Done (~75K)
| File | Strategy | Token |
|---|---|---:|
| `docs/STATUS.md` rest beyond top 100 | Per H2 section | ~8K |
| `docs/HANDOFF.md` rest beyond top 150 | Per H2 section | ~21K |
| `docs/changelog/migration-todos.md` | Per H3 task | ~18K |
| `docs/changelog/recently-done-archive-*.md` | Per H3 phase | ~6K |
| `docs/_archive/forms-spec-raw.md` | Full file (cold archive) | ~23K |
| `docs/_archive/workflow-raw.md` | Full file (cold archive) | ~4K |
### I. Flows + Database (~17K — conditional task)
| File | Token | Khi retrieve |
|---|---:|---|
| `docs/flows/README.md` | ~1K | Index khi cần flow |
| `docs/flows/auth-flow.md` | ~1K | Task auth |
| `docs/flows/permission-flow.md` | ~1.5K | Task permission |
| `docs/flows/contract-creation-flow.md` | ~1.5K | Task Contract |
| `docs/flows/contract-approval-flow.md` | ~1.5K | Task approval |
| `docs/flows/form-render-flow.md` | ~1K | Task form |
| `docs/flows/sla-expiry-flow.md` | ~1K | Task SLA |
| `docs/database/database-guide.md` | ~3K | Task schema |
| `docs/database/schema-diagram.md` | ~12K | Task ERD |
### J. SKILL.md detail (~40K — retrieve khi skill triggered)
| File | Token |
|---|---:|
| `.claude/skills/contract-workflow/SKILL.md` | ~7K |
| `.claude/skills/form-engine/SKILL.md` | ~5K |
| `.claude/skills/permission-matrix/SKILL.md` | ~5K |
| `.claude/skills/dependency-audit-erp/SKILL.md` | ~5K |
| `.claude/skills/ef-core-migration/SKILL.md` | ~5.5K |
| `.claude/skills/iis-deploy-runbook/SKILL.md` | ~6K |
### K. Guides ops conditional (~10K)
| File | Token | Khi retrieve |
|---|---:|---|
| `docs/guides/deployment-iis.md` | ~2.5K | Task deploy |
| `docs/guides/cicd.md` | ~2K | Task CI/CD |
| `docs/guides/security-checklist.md` | ~2K | Audit security |
| `docs/guides/vps-setup.md` | ~2.5K | Setup VPS |
| `docs/guides/runbook.md` | ~1K | Ops debug |
### L. Memory entries non-critical (~50K — pattern lookup)
```
11 memory entries còn lại (user-level):
- feedback_n_stage_workflow_pattern.md (DEPRECATED post-Mig 21)
- feedback_designtime_runtime_db.md
- feedback_drastic_refactor_scope.md
- feedback_cron_monthly_limitation.md
- feedback_user_manual_style.md
- feedback_node_cicd.md
- feedback_unittest_timing.md
- feedback_responsive_laptop_breakpoint.md
- feedback_service_hook_vs_endpoint.md
- reference_session_prompts.md
- MEMORY.md index
```
### M. Audit logs (~2K, grow)
```
docs/changelog/skill-audit-{YYYY-MM}.md (monthly audit log)
```
### TOTAL RAG STORE ≈ 254K tokens
---
## 5. Tool stack recommend
| Component | Tool | Reason | Cost |
|---|---|---|---|
| **Vector DB** | **Qdrant local** | Rust binary 50MB, no Docker, fast, metadata filtering, admin UI | $0 |
| **Embedding** | **Voyage-3-large** | Anthropic partner, multilingual 26 lang, no GPU needed | $0.18/M (~$0.36 init) |
| **MCP server framework** | **FastMCP Python** | Official Anthropic SDK, ~100 LOC, auto schema | $0 |
| **Chunking** | **Custom Python adaptive** | ~50 LOC, transparent, §6.5 compliant | $0 |
| **Re-index pipeline** | **Pre-commit hook** | Native git, ~10 LOC bash | $0 |
| **Monitoring** | **Qdrant Dashboard + custom audit** | Built-in UI port 6333 | $0 |
| **Auth (multi-AI)** | **Bearer token + rate limit** | Custom middleware ~30 LOC | $0 |
| **Batch re-index** | **Voyage Batch API** | 50% discount cho mass re-embed | -50% |
### Stack rejected + lý do
| Alternative | Reason rejected |
|---|---|
| Chroma vector DB | Python ecosystem, slower than Qdrant Rust |
| pgvector | Cần PostgreSQL setup, overhead |
| OpenAI text-embedding-3-small | Vietnamese quality kém hơn Voyage |
| BGE-M3 local | Cần GPU >= 4GB (Intel Iris Xe không OK) |
| LangChain / LlamaIndex | Heavy abstraction, black-box debug khó, §6.5 chunker không tuân |
| TypeScript MCP SDK | Verbose hơn Python FastMCP |
| Pinecone cloud | Paid + vendor lock, không cần scale đó |
---
## 6. Setup scripts
### 6.1 `requirements.txt`
```text
fastmcp>=2.0
voyageai>=0.3
qdrant-client>=1.12
python-frontmatter>=1.1
```
### 6.2 `scripts/rag-indexer.py` (~120 LOC)
```python
"""
RAG Indexer — Embed MD files + upsert vào Qdrant.
Usage:
python rag-indexer.py # full index
python rag-indexer.py --files "a.md b.md" # delta re-index
"""
import os, glob, re, sys
from voyageai import Client
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
QDRANT_PATH = "./rag-data/qdrant"
COLLECTION = "project_md" # rename per project
EMBED_MODEL = "voyage-3-large"
DIM = 1024
voyage = Client(api_key=os.environ["VOYAGE_API_KEY"])
qdrant = QdrantClient(path=QDRANT_PATH)
def chunk_file(path: str) -> list[dict]:
"""Adaptive chunking theo doc type."""
content = open(path, encoding="utf-8").read()
base = {"source": path, "size_chars": len(content)}
if "/changelog/sessions/" in path:
return [{**base, "content": content, "doc_type": "session_log"}]
if path.endswith("gotchas.md"):
entries = re.split(r"^### (\d+)\.", content, flags=re.M)
return [
{**base, "content": f"### {entries[i]}.{entries[i+1]}",
"doc_type": "gotcha", "entry_id": int(entries[i])}
for i in range(1, len(entries), 2)
]
if "/skills/" in path:
return [{**base, "content": content, "doc_type": "skill"}]
if "/agents/" in path:
return [{**base, "content": content, "doc_type": "agent"}]
if path.endswith("MEMORY.md") or "/memory/" in path:
return [{**base, "content": content, "doc_type": "memory"}]
# Default: split per H2 heading
sections = re.split(r"^## ", content, flags=re.M)
return [
{**base, "content": ("## " + s) if i > 0 else s,
"doc_type": "doc", "section_idx": i}
for i, s in enumerate(sections) if len(s.strip()) > 200
]
def main(files: list[str] | None = None):
# Init collection (idempotent)
if not qdrant.collection_exists(COLLECTION):
qdrant.create_collection(
COLLECTION,
vectors_config=VectorParams(size=DIM, distance=Distance.COSINE)
)
# Determine paths
if files:
paths = files
else:
paths = (
glob.glob("docs/**/*.md", recursive=True) +
glob.glob(".claude/**/*.md", recursive=True)
)
paths = [p for p in paths
if "node_modules" not in p and "_user-guide" not in p]
# Chunk
chunks = []
for path in paths:
try:
chunks.extend(chunk_file(path))
except Exception as e:
print(f"Skip {path}: {e}")
print(f"Chunking: {len(chunks)} chunks from {len(paths)} files")
# Batch embed (Voyage max 128/batch)
texts = [c["content"] for c in chunks]
embeddings = []
for i in range(0, len(texts), 128):
batch = texts[i:i+128]
result = voyage.embed(batch, model=EMBED_MODEL, input_type="document")
embeddings.extend(result.embeddings)
print(f"Embedded {i+len(batch)}/{len(texts)}")
# Upsert (Qdrant auto-replaces by id)
points = [
PointStruct(
id=hash(c["source"] + str(c.get("section_idx", 0))) & 0xFFFFFFFF,
vector=emb,
payload=c
)
for c, emb in zip(chunks, embeddings)
]
qdrant.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks → Qdrant")
if __name__ == "__main__":
files = sys.argv[2].split() if len(sys.argv) > 2 and sys.argv[1] == "--files" else None
main(files)
```
### 6.3 `scripts/rag-mcp-server.py` (~80 LOC)
```python
"""
MCP retriever server — Expose rag_retrieve tool cho Claude Code + agents.
Run: python rag-mcp-server.py (stdio default)
python rag-mcp-server.py --http :7777 (HTTP/SSE for multi-AI)
"""
import os, sys
from fastmcp import FastMCP
from voyageai import Client
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
mcp = FastMCP("project-rag")
voyage = Client(api_key=os.environ["VOYAGE_API_KEY"])
qdrant = QdrantClient(path="./rag-data/qdrant")
COLLECTION = "project_md"
@mcp.tool()
def rag_retrieve(
query: str,
scope: str = "all",
k: int = 5
) -> list[dict]:
"""
Semantic search MD context.
Args:
query: Search query (Vietnamese hoặc English, mix OK)
scope: Filter by doc_type:
"all" | "session_log" | "gotcha" | "memory" |
"skill" | "agent" | "doc"
k: Top chunks to return (1-15, default 5)
Returns:
List[dict] với keys: content, source, doc_type, score
Use cases:
- Historical session log: rag_retrieve("Mig 26 V2", scope="session_log")
- Gotcha lookup: rag_retrieve("silent 403", scope="gotcha")
- Pattern reuse: rag_retrieve("audit clone", scope="memory")
- Cross-section: rag_retrieve("query", scope="all", k=10)
"""
k = min(max(k, 1), 15)
# Embed query
query_vec = voyage.embed(
[query], model="voyage-3-large", input_type="query"
).embeddings[0]
# Filter
filter_dict = None
if scope != "all":
filter_dict = Filter(
must=[FieldCondition(key="doc_type", match=MatchValue(value=scope))]
)
# Search
results = qdrant.search(
collection_name=COLLECTION,
query_vector=query_vec,
query_filter=filter_dict,
limit=k
)
return [
{
"content": r.payload["content"][:3000], # truncate huge
"source": r.payload["source"],
"doc_type": r.payload["doc_type"],
"score": round(r.score, 3)
}
for r in results
]
@mcp.tool()
def rag_stats() -> dict:
"""Return collection stats (for audit)."""
info = qdrant.get_collection(COLLECTION)
return {
"total_chunks": info.points_count,
"vector_dim": info.config.params.vectors.size,
"distance": info.config.params.vectors.distance.value,
"indexed_at": info.optimizer_status,
}
if __name__ == "__main__":
# Default: stdio mode for Claude Code
# HTTP/SSE mode: python rag-mcp-server.py --http :7777
if "--http" in sys.argv:
port = int(sys.argv[sys.argv.index("--http") + 1].lstrip(":"))
mcp.run(transport="sse", port=port)
else:
mcp.run() # stdio default
```
### 6.4 `.claude/settings.json` register
```jsonc
{
"mcpServers": {
"project-rag": {
"command": "python",
"args": ["scripts/rag-mcp-server.py"],
"cwd": "${workspaceFolder}",
"env": {
"VOYAGE_API_KEY": "${env:VOYAGE_API_KEY}"
}
}
}
}
```
### 6.5 Pre-commit hook
```bash
#!/bin/sh
# .git/hooks/pre-commit
# Re-index changed MD files
changed_md=$(git diff --cached --name-only --diff-filter=AMR | grep -E "\.md$")
if [ -n "$changed_md" ]; then
echo "RAG re-indexing $(echo "$changed_md" | wc -l) MD files..."
python scripts/rag-indexer.py --files "$changed_md"
fi
```
### 6.6 Agent .md frontmatter update
```yaml
# Mỗi .claude/agents/{agent}.md thêm tool:
tools: [Read, Grep, Glob, Bash, mcp__project-rag__rag_retrieve, ...]
```
System prompt section thêm:
```markdown
## RAG retriever usage (rag_retrieve tool)
**WHEN to use:**
- Historical session log lookup (> 1 tuần cũ)
- Gotcha pattern matching debug
- Memory pattern reuse "clone X sang Y"
- Cross-section semantic search
**WHEN to use Read instead:**
- Current state (STATUS + HANDOFF top) — blanket loaded
- Active file editing (cần full file)
- Architecture review (stable docs, blanket)
**Query examples:**
- rag_retrieve("silent 403 non-admin", scope="gotcha", k=3)
- rag_retrieve("PE V2 wire pattern", scope="session_log", k=5)
- rag_retrieve("audit reuse clone", scope="memory", k=3)
```
---
## 7. Audit procedure
### 7.1 Weekly quick audit (~30 phút, mỗi Saturday)
**Mục tiêu:** Check health + cost trend hàng tuần.
**Checklist:**
```bash
# 1. Index health
curl http://localhost:6333/collections/project_md
# Verify: points_count tăng + status="green"
# 2. Re-index lag
git log --since="1 week ago" --name-only --pretty=format: | grep -E "\.md$" | sort -u | wc -l
python -c "
from qdrant_client import QdrantClient
q = QdrantClient(path='./rag-data/qdrant')
# Check sources có matching files changed
"
# 3. Voyage cost
# Visit voyageai.com dashboard → check last 7 days usage
# Target: <$1/week steady state
# 4. Random query quality (manual 5 query)
# Sample queries:
# - "Recent Mig" → expect session log top
# - "silent 403" → expect gotcha #44 top
# - "audit reuse" → expect memory entry top
# Score: 1-5 mỗi query (relevant chunks trong top-5)
# 5. Storage size
du -sh ./rag-data/
# Target: <500MB per project
```
**Log:** `docs/changelog/rag-audit-weekly-{YYYY-WW}.md` (1 page)
### 7.2 Monthly deep audit (~2-3h, mỗi đầu tháng)
**Mục tiêu:** Quality benchmark + chunking review + stale cleanup.
**Checklist:**
```python
# 1. Quality benchmark — 30 query test set
test_queries = [
# Categories: state, historical, debug, pattern, cross-stack
("Phase hiện tại", "doc"),
("Mig 26 PE Level Opinions UPSERT", "session_log"),
("silent 403 non-admin Forbidden", "gotcha"),
("audit reuse trước clone B từ A", "memory"),
# ... 30 total covering all scopes
]
results = []
for query, expected_scope in test_queries:
retrieved = rag_retrieve(query, k=10)
# Manual score:
# - Recall: % expected sources trong top-10
# - Precision: % retrieved chunks actually relevant
results.append({"query": query, "recall": ..., "precision": ...})
# Target: avg recall > 80%, precision > 75%
# 2. Chunking review — sample 10 random chunks
# Check: chunks có bị cắt giữa narrative không (vi phạm §6.5)
# Action: tune chunker nếu phát hiện issues
# 3. Stale audit
# Files chưa re-index > 14 days → flag
# Files đã xóa khỏi repo nhưng còn trong Qdrant → cleanup
# 4. Cost trend
# Monthly Voyage spend vs target
# Target: <$3/month steady
# 5. Capacity check
# Total chunks vs disk space projection
# Project có grow size đáng kể (>20% MoM) → plan scale
```
**Log:** `docs/changelog/rag-audit-monthly-{YYYY-MM}.md` (2-3 pages)
### 7.3 Quarterly major audit (~4-6h, mỗi quý)
**Mục tiêu:** Strategic review + major upgrades.
**Checklist:**
1. **Embedding model upgrade decision**
- Voyage có model mới? Test side-by-side với voyage-3-large
- Quality benchmark trên 30 query test set
- Decision: upgrade nếu recall +5pp
2. **Chunking strategy iteration**
- Review 50 random chunks
- Identify patterns: cắt sai, overlap missing, metadata thiếu
- Tune chunker code → re-index full
3. **Collection re-build từ scratch**
- Backup current → drop collection → re-index all
- Mục đích: clean orphan chunks + apply new chunking
- Effort: ~30 phút for 1M MD
4. **Multi-AI client access audit**
- Active clients (Claude Code / Desktop / GPT / Cursor)
- Per-client query volume + token spend
- Security: rotate auth tokens, review rate limits
5. **Cross-project namespace audit** (nếu multi-project)
- Project isolation working correctly?
- Cross-project query intentional vs accidental?
- Adjust metadata filter rules
**Log:** `docs/changelog/rag-audit-quarterly-{YYYY-Q}.md` (5-10 pages)
### 7.4 Trigger-based audit (ad-hoc)
| Trigger | Action |
|---|---|
| Retrieval miss critical (em main báo) | Audit chunk relevant tại sao miss + tune |
| Cost spike >50% MoM | Audit query patterns + rate limit clients |
| Re-index hang >1h | Audit indexer logs + Qdrant health |
| Quality regression em main observe | Spot-check + monthly audit sớm |
| New project added | Setup namespace + initial index audit |
---
## 8. Multi-AI client access
### 8.1 MCP protocol — agnostic
MCP (Model Context Protocol) là **standard protocol**. Bất kỳ AI client nào support MCP đều consume cùng 1 server:
```
Qdrant (single source)
MCP server :7777 (HTTP/SSE)
↙ ↓ ↓ ↘
Claude Code Claude Cursor GPT-4 +
Desktop IDE custom adapter
```
### 8.2 Transport modes
| Mode | Use case | Setup |
|---|---|---|
| **stdio** | Single client (Claude Code local) — default | `python rag-mcp-server.py` |
| **HTTP/SSE** | Multi-client (network access) | `python rag-mcp-server.py --http :7777` |
| **WebSocket** | Bi-directional (rare) | Custom config |
### 8.3 Setup multi-AI mode
**Step 1: Run MCP server HTTP mode**
```bash
# Terminal 1: MCP server (keep running)
export VOYAGE_API_KEY="pa-xxxx"
python scripts/rag-mcp-server.py --http :7777
# Server endpoint: http://localhost:7777/sse
```
**Step 2: Add auth middleware (recommend cho multi-client)**
```python
# Update rag-mcp-server.py
from fastmcp import FastMCP
from fastmcp.middleware import bearer_auth
ALLOWED_TOKENS = {
"claude-code-token": "claude-code-primary",
"gpt4-token": "gpt4-cursor-integration",
"custom-agent-token": "custom-research-agent",
}
mcp = FastMCP("project-rag", middleware=[
bearer_auth(tokens=ALLOWED_TOKENS, rate_limit_per_minute=30)
])
```
**Step 3: Register per-client config**
#### Claude Code (em main + 4 agents)
```jsonc
// .claude/settings.json
{
"mcpServers": {
"project-rag": {
"transport": "sse",
"url": "http://localhost:7777/sse",
"headers": {
"Authorization": "Bearer claude-code-token"
}
}
}
}
```
#### Claude Desktop
```jsonc
// claude_desktop_config.json
{
"mcpServers": {
"project-rag": {
"transport": "sse",
"url": "http://localhost:7777/sse",
"headers": {
"Authorization": "Bearer claude-desktop-token"
}
}
}
}
```
#### Cursor IDE
```jsonc
// .cursor/settings.json
{
"mcp.servers": {
"project-rag": {
"transport": "sse",
"url": "http://localhost:7777/sse"
}
}
}
```
#### GPT-4 via custom adapter
```python
# Use OpenAI Assistants API + custom function calling
import requests
def query_project_rag(query: str, scope: str = "all", k: int = 5):
response = requests.post(
"http://localhost:7777/tool/rag_retrieve",
headers={"Authorization": "Bearer gpt4-token"},
json={"query": query, "scope": scope, "k": k}
)
return response.json()
# Register as OpenAI function tool
```
#### Continue.dev / custom agent
```yaml
# config.yaml
mcp_servers:
- name: project-rag
transport: sse
url: http://localhost:7777/sse
auth_token: custom-agent-token
```
### 8.4 Security model multi-AI
| Concern | Mitigation |
|---|---|
| Token leak | Rotate quarterly, store in env vars |
| Rate limit abuse | 30 req/min/token default, tune per client |
| Read-only enforcement | MCP server expose only `rag_retrieve` + `rag_stats` (no write tools) |
| Audit log | Log every query: timestamp + client_token + query + result_count |
| Cross-project leak | Per-collection access control (future enhancement) |
### 8.5 Cost considerations multi-AI
```
Single Claude Code client (current):
Voyage cost: ~$0.20/month (low query volume)
Qdrant: free local
4 AI clients heavy use (Claude Code + Desktop + Cursor + GPT-4):
Voyage cost: ~$2-5/month (higher query volume)
Network bandwidth: minimal (~100KB/query response)
CPU: Qdrant + Voyage embed call ~100ms total
→ Multi-AI access scale linearly với query volume, not infrastructure cost.
```
### 8.6 Recommend rollout
```
Phase 1 (Week 1-4): Single client (Claude Code only)
→ Validate quality + cost baseline
Phase 2 (Month 2+): Add Claude Desktop nếu cần mobile/casual access
→ Same auth, share collection
Phase 3 (Month 3+): Add Cursor IDE nếu work multi-IDE
→ Verify no cross-tool conflicts
Phase 4 (Future): GPT-4 / custom agent integration nếu cần
→ Custom adapter + auth strict
```
---
## 9. Timeline rollout
### Hour-by-hour breakdown (~10-14h dedicated session)
| Hour | Task | Effort |
|---|---|---|
| **1-2** | Setup pre-flight: disk cleanup + Voyage signup + Python deps install | ~2h |
| **3-4** | Write `scripts/rag-indexer.py` + run initial embed | ~2h |
| **5** | Verify Qdrant collection + manual query sanity check | ~1h |
| **6-7** | Write `scripts/rag-mcp-server.py` + register `.claude/settings.json` | ~2h |
| **8** | Test rag_retrieve qua Claude Code (em main solo) | ~1h |
| **9-10** | Update 4 agent .md frontmatter + system prompt sections | ~2h |
| **11** | Setup pre-commit hook + audit logging | ~1h |
| **12-14** | Buffer + trial 10-15 query measure quality + cost | ~3h |
### Trial 4-week plan
```
Week 1: Pilot single project (smaller of 2)
- Day 1-2: Setup + initial index
- Day 3-7: Active use + measure baseline metrics
- Deliverable: rag-audit-weekly-W1.md
Week 2: Roll out 2nd project
- Day 1: Setup separate Qdrant collection
- Day 2-7: Dual-project use measure
- Deliverable: rag-audit-weekly-W2.md
Week 3: 4-agent integration
- Day 1-2: Update 4 agent .md với rag_retrieve tool
- Day 3-7: Multi-agent task measure shared cache benefit
- Deliverable: rag-audit-weekly-W3.md
Week 4: Decision gate (keep / tune / upgrade B / rollback)
- Day 1-2: Compile metrics
- Day 3: Decision meeting (bro + em main)
- Day 4-7: Apply decision (tune embedding/chunking OR upgrade Option B OR rollback)
- Deliverable: rag-audit-monthly-M1.md + decision doc
```
### Decision gate Week 4
```
PASS criteria (continue + tune):
✅ Quality recall > 80% on 30 query benchmark
✅ Cost < $5/month total (Voyage + storage)
✅ Session lifespan tăng > 30% (heavy session)
✅ Multi-agent shared cache hit > 60%
✅ Retrieval miss critical < 10% queries
✅ Storage < 1GB per project
TUNE criteria (continue + adjust):
⚠️ Quality 70-80% → tune chunking or upgrade embedding
⚠️ Cost 5-10/mo → audit query patterns, reduce k
⚠️ Session lifespan tăng < 30% → audit blanket effectiveness
ROLLBACK criteria (archive RAG):
❌ Quality < 70%
❌ Cost > $10/mo recurring
❌ Session lifespan KHÔNG tăng or giảm
❌ Em main complain "miss context" thường xuyên
❌ Storage > 5GB per project
```
---
## 10. Caveats + risks
### 10.1 Beta features risk
| Feature | Status | Mitigation |
|---|---|---|
| Anthropic Memory tool | Beta `content-management-2025-06-27` | Defer until GA, use MEMORY.md current |
| Anthropic Files API | Beta `files-api-2025-04-14` | Optional add-on, RAG primary |
| Extended 1h prompt cache | Beta `extended-cache-ttl-2025-04-11` | Use 5min default, opt-in 1h khi heavy session |
| Voyage AI API | Stable | Production OK |
| Qdrant local | Stable | Production OK |
| FastMCP | Stable v2+ | Production OK |
### 10.2 Storage concerns
```
Bro hiện tại: 911/954 GB used = 96% full (43GB free)
RAG storage budget:
Qdrant binary: ~50MB
Per project index: ~200-500MB (depend MD volume)
Backup snapshots: ~500MB
Logs + audit: ~100MB
Per project total: ~1GB
2 projects total: ~2GB
+ buffer 1GB
= 3GB recommend free space
→ Cleanup TRƯỚC setup: target 5GB+ free
```
**Cleanup priorities:**
- `node_modules` projects cũ
- `.NET bin/obj` artifacts
- Docker images (`docker system prune -a`)
- Browser caches (Chrome/Edge ~5GB common)
- `%LOCALAPPDATA%` caches (NuGet, dotnet)
- Downloads / Videos không dùng
### 10.3 Quality monitoring
| Risk | Indicator | Action |
|---|---|---|
| Chunking break narrative | Em main report "miss context" | Review chunk strategy, tune |
| Embedding drift | Recall drop > 10pp benchmark | Re-embed full, check Voyage updates |
| Stale index | Files commit chưa re-index | Force re-index full, check hook |
| Query phrasing kém | Low precision on simple queries | Em main refine query patterns |
| Cross-language mismatch | Vietnamese query miss English content | Multilingual reranker hoặc query expansion |
### 10.4 Fallback strategy
```
Khi RAG fail / quality drop:
Layer 1: Em main fallback to Read full file (existing lazy pattern still works)
Layer 2: Em main blanket load critical file directly
Layer 3: Rollback Qdrant snapshot (weekly backup)
Layer 4: Full re-index từ scratch (~15 phút)
Layer 5: Archive RAG, return lazy current pattern (ultimate fallback)
```
Em main blanket 120K KHÔNG bị mất khi RAG fail → graceful degradation.
### 10.5 Vietnamese-English mix considerations
```
Voyage-3-large multilingual claim 26 lang coverage.
Vietnamese explicit benchmark KHÔNG public.
Risk: technical jargon Việt-Anh mix có thể miss synonym.
Ví dụ: "im lặng 403" vs "silent 403" — vector có gần nhau không?
Mitigation:
- Test 10-20 Việt-Anh mix queries trong audit benchmark
- Nếu recall low → consider voyage-multilingual-2 backup
- Hoặc add query expansion (Anthropic Contextual Retrieval pattern)
```
---
## 11. Success metrics
### 11.1 Quality metrics
| Metric | Target | Measurement |
|---|---:|---|
| Recall avg (30 query benchmark) | > 80% | Manual score weekly |
| Precision avg | > 75% | Manual score weekly |
| Retrieval miss critical rate | < 10% | Em main report cumulative |
| Cross-language query recall | > 70% | Việt-Anh mix benchmark |
### 11.2 Cost metrics
| Metric | Target | Measurement |
|---|---:|---|
| Voyage monthly spend | < $5 | Voyage dashboard |
| Total RAG infra cost | < $10/month | Sum tools |
| Cost per query | < $0.001 | Calculated |
| Disk usage per project | < 1GB | `du -sh` |
### 11.3 Performance metrics
| Metric | Target | Measurement |
|---|---:|---|
| Query latency (P50) | < 200ms | MCP server log |
| Query latency (P99) | < 500ms | MCP server log |
| Re-index lag (post-commit) | < 30s | Pre-commit hook timing |
| Cache hit rate (multi-agent) | > 60% | Custom metric |
### 11.4 Capacity metrics
| Metric | Target | Measurement |
|---|---:|---|
| Session lifespan productive | +50% vs lazy | Time tracker |
| Tasks before lost-in-middle | > 35 | Task counter |
| Heavy session token | -20% vs lazy | Anthropic dashboard |
| Multi-agent overlap saving | > 50K/session | Cumulative calc |
### 11.5 Multi-AI client metrics
| Metric | Target | Measurement |
|---|---:|---|
| Active clients | ≥ 1 stable | Audit log |
| Per-client query volume | Track baseline | Audit log per client |
| Cross-client conflict | 0 | Bug reports |
---
## 12. Future enhancements
### 12.1 Phase 2 (after Week 4 validation)
| Enhancement | Effort | Benefit |
|---|---|---|
| Upgrade Option B (drop blanket 30-40K) | 1 session | Saving +15% tokens |
| Anthropic Memory tool integration | 2-3h | Native cross-conversation memory |
| Files API integration | 2-3h | Reduce blanket re-upload cost |
| Citations enable | 1h | RAG quality trace |
### 12.2 Phase 3 (Month 2-3)
| Enhancement | Effort | Benefit |
|---|---|---|
| Hybrid BM25 + vector search (Contextual Retrieval) | 4-6h | +49-67% recall (Anthropic doc) |
| Multi-project namespace | 2-3h | Cross-project query với strict isolation |
| Reranker model (Cohere rerank-3) | 2-3h | +10-20% precision |
| Custom Streamlit audit dashboard | 4-5h | Visual quality monitoring |
### 12.3 Phase 4 (Quarter 2+)
| Enhancement | Effort | Benefit |
|---|---|---|
| Replace Voyage với Anthropic native embedding (if GA) | 2-3h | Reduce vendor count |
| Auto-tuning chunking (LLM-aided) | 1 week | Quality+ |
| Federated multi-machine setup | 1 week | Team usage |
| Time-series analytics on retrieval patterns | 1 week | Insights |
### 12.4 Defer indefinitely (over-engineering)
- ❌ LangChain / LlamaIndex framework (heavy abstraction)
- ❌ Self-host LLM (cost > value)
- ❌ Custom embedding model fine-tuning (effort > value)
- ❌ Full text + vector hybrid index (use Voyage Reranker instead)
---
## 13. Multi-agent cumulative cost reality (Anthropic 8-10× warning)
> **Added S21 turn 2 (2026-05-12)** — clarification sau khi user catch gap "120K blanket KHÔNG bao gồm 4 agents".
### Per-entity blanket breakdown
```
Em main blanket: ~120K
STATUS + HANDOFF top + rules + architecture + 5 agent .md +
4 MEMORY.md auto-inject + skills desc + memory critical +
auto-inject system reminders
Per sub-agent spawn baseline: ~80-100K each
Agent system prompt (~5K) +
3 skills preload SKILL.md full (~21K, trigger semantic) +
Auto-inject MEMORY.md 25KB first 200 lines (~7K) +
Em main pass spec task (~10-15K) +
Em main paste common context excerpt (~30-50K) +
Auto-inject project context (~10K)
= ~80-100K per sub-agent spawn (per Anthropic docs)
4 sub-agents cumulative: ~400K
(4 × ~100K each, isolated context windows)
TOTAL cumulative blanket 5 entities: ~520K
Em main + 4 sub-agents combined (isolated windows, cumulative billing)
```
### Context windows are ISOLATED
```
KHÔNG phải 5 entities share 520K trong 1 context window 1M.
Mỗi entity có context window 1M RIÊNG:
Em main → context window 1M, dùng ~120K
Investigator → context window 1M, dùng ~100K
Implementer → context window 1M, dùng ~100K
Reviewer → context window 1M, dùng ~100K
CICD Monitor → context window 1M, dùng ~100K
→ Mỗi entity LOST-IN-MIDDLE threshold riêng (~700K each)
→ Mỗi entity capacity ~58 tasks before hit hard cap riêng
NHƯNG billing là CUMULATIVE 520K across all contexts:
Anthropic billing tổng tokens across all 5 windows
→ Hit weekly cap nhanh hơn solo em main 4-5×
```
### Heavy session token compound effect (Cách A vs lazy)
**Without RAG (lazy current — 4 agents spawn):**
```
Em main:
Blanket: 120K
Lazy Read on-demand: ~50K
Reasoning + coordinate: ~30K
= ~200K subtotal
4 sub-agents (each):
Spawn blanket: ~100K
Lazy Read inside agent: ~50K
Reasoning + work: ~30K
Each agent: ~180K
──────────────
4 agents subtotal: ~720K cumulative
SendMessage iteration:
10 round trips × ~30K nominal: 300K nominal
Cache hit 70%: ~90K effective
TOTAL HEAVY SESSION (lazy):
200K + 720K + 90K = ~1010K nominal
After cache discount: ~700K effective billed
```
**With Cách A RAG:**
```
Em main:
Blanket: 120K (unchanged)
RAG retrieve replace lazy Read: ~30K (-20K saving)
Reasoning streamlined: ~25K
= ~175K subtotal (saving 25K)
4 sub-agents (each):
Spawn blanket: ~100K (unchanged)
RAG retrieve (share cache 70-90% common queries): ~15K
Reasoning streamlined: ~25K
Each agent: ~140K (saving 40K each)
──────────────
4 agents subtotal: ~560K (saving 160K total)
SendMessage iteration: ~90K effective (unchanged)
TOTAL HEAVY SESSION (Cách A):
175K + 560K + 90K = ~825K nominal
After cache discount: ~560K effective billed
SAVING: -140K (-20%)
```
### Cost saving breakdown
| Component | Lazy current | Cách A | Saving |
|---|---:|---:|---:|
| Em main blanket (fixed) | 120K | 120K | 0 |
| Em main lazy Read → RAG retrieve | 50K | 30K | -20K |
| Em main reasoning streamlined | 30K | 25K | -5K |
| 4 agents spawn blanket (fixed) | 400K | 400K | 0 |
| 4 agents lazy Read → cached retrieve | 200K | 60K | **-140K** |
| 4 agents reasoning | 120K | 100K | -20K |
| SendMessage cached | 90K | 90K | 0 |
| **TOTAL EFFECTIVE BILLED** | **~700K** | **~560K** | **-140K (-20%)** |
**Saving 80% từ 4 agents** share retrieve cache (cache hit 70-90% common queries cross-agent).
→ Em main saving chỉ 25K (blanket unchanged, chỉ optimize Read → retrieve).
### Multi-agent leverage example concrete
```
Task Plan B Contract V2 wire:
🔵 Inv query "PE V2 schema pattern" → 15K retrieve + cached
🟡 Imp query same → cache hit 90% → 1.5K effective
🔴 Rev query same → cache hit 90% → 1.5K effective
🟢 CICD query same → cache hit 90% → 1.5K effective
Em main query same → cache hit 90% → 1.5K effective
Cumulative retrieve cost: 15K + 4×1.5K = 21K
Compare to lazy:
Each agent Read PE V2 file separately
5 entities × 20K Read = 100K cumulative
→ Saving 79K just for 1 cross-agent query
```
### Optimization tips để giảm cumulative
**Option 1: Spawn ít agents hơn**
- Decision gate 6-criteria mỗi task (per `feedback_multi_agent_setup` rule)
- Solo em main đủ → KHÔNG spawn agent
- Chỉ spawn agent nào THẬT cần
- Trong S20-S21: 4 agents seeds-only, em chưa spawn lần nào → cost ~120K em main thôi
**Option 2: Tune blanket sub-agent (100K → 80K)**
- Em main pass spec gọn (~10K thay 15K)
- Em main paste common context excerpt thay full (~20K thay 50K)
- Skills preload chỉ description (~3K thay 21K full SKILL.md)
→ Trigger SKILL.md full khi semantic match
- Per sub-agent: 100K → 80K
- 4 agents cumulative: 400K → 320K
- Heavy session: 560K → 480K (-15%)
**Option 3: SendMessage cache aggressive (1h TTL beta)**
- Anthropic extended cache `extended-cache-ttl-2025-04-11`
- Static prompts cache premium WRITE 2× base
- Subsequent reads 0.1× discount
- Multi-agent cùng cache prefix → benefit lớn
- Saving 10-15% additional
---
## 14. 3-layer hybrid RAG upgrade path (Anthropic Contextual Retrieval)
> **Added S21 turn 2 (2026-05-12)** — Anthropic flagship pattern Sept 2024.
### Pattern overview
```
Anthropic Contextual Retrieval = 3 layers compound:
Layer 1: Embeddings (Voyage-3-large)
→ Semantic + synonym + multilingual catch
+ Contextual prefix (Haiku-generated context):
Add chunk-specific context BEFORE embed
"This chunk discusses... in context of..."
→ Better recall via enriched vector
Layer 2: BM25 (bm25s Python lib free local)
→ Exact identifier + technical terms (function names, error codes, Mig numbers)
+ Contextual BM25 (same prefix pattern)
Layer 3: Reranking (Voyage rerank-2)
→ Cross-attention deep relevance
→ Re-score top 30 candidates → return top 5 truly relevant
```
### Performance compound effect
```
Baseline (naive vector embeddings): ~50% recall
+ Contextual embeddings: ~67% recall (-35% failure)
+ Hybrid Contextual + BM25: ~75% recall (-49% failure)
+ Reranking: ~85% recall (-67% failure)
```
📎 Source: [Anthropic Contextual Retrieval Sept 2024](https://www.anthropic.com/news/contextual-retrieval)
### Phase rollout incremental (recommend cho bro)
| Phase | Setup | Recall | Cost/month | Effort additional |
|---|---|---:|---:|---|
| **Phase 1** (Week 1-4) | Layer 1 vector only (Voyage-3-large) | ~70% | ~$1.50 | 10-14h initial |
| **Phase 2** (Month 2) | + Layer 2 BM25 (bm25s free local) | ~78% | ~$1.50 unchanged | 2-3h |
| **Phase 3** (Month 3) | + Layer 3 Voyage rerank-2 + Contextual prefix | ~92% | ~$4-5 | 3-4h |
### Phase 1 implementation (basic vector RAG)
Đã cover trong Section 5-6 plan. Bro implement Week 1-4 trial pilot.
### Phase 2 upgrade — Add BM25 hybrid
```python
# scripts/rag-mcp-server.py — upgrade
from bm25s import BM25
bm25 = BM25.load("./rag-data/bm25_index") # pre-built
@mcp.tool()
def rag_retrieve_hybrid(query, scope="all", k=5):
# Step 1: Vector search
query_vec = voyage.embed([query], model="voyage-3-large").embeddings[0]
vector_results = qdrant.search(COLLECTION, query_vec, limit=20)
# Step 2: BM25 search (local Python lib)
bm25_results = bm25.retrieve(query, k=20)
# Step 3: Merge + dedup
candidates = merge_dedup(vector_results, bm25_results) # ~30 chunks
# Step 4: Score combine (RRF reciprocal rank fusion)
final_scores = reciprocal_rank_fusion(vector_results, bm25_results)
return final_scores[:k]
```
### Phase 3 upgrade — Full Anthropic Contextual
```python
# scripts/rag-indexer.py — upgrade với contextual prefix
import anthropic
claude_haiku = anthropic.Anthropic()
def contextualize_chunk(chunk_content, full_doc_path):
"""Generate context prefix using Claude Haiku (cheap model)."""
full_doc = open(full_doc_path).read()
response = claude_haiku.messages.create(
model="claude-haiku-4-5", # cheap ~$0.0001/chunk
max_tokens=150,
messages=[{
"role": "user",
"content": f"""<document>
{full_doc[:5000]}
</document>
<chunk>
{chunk_content}
</chunk>
Give a brief context (50-100 words) explaining what this chunk is about and where it fits in the document. Be specific."""
}]
)
return response.content[0].text
# In indexer pipeline:
for chunk in chunks:
context = contextualize_chunk(chunk["content"], chunk["source"])
chunk["content_enriched"] = f"{context}\n\n{chunk['content']}"
# Embed enriched version → better recall
```
```python
# scripts/rag-mcp-server.py — final upgrade với reranking
import voyageai
@mcp.tool()
def rag_retrieve_full(query, scope="all", k=5):
# Step 1-3: Same as Phase 2 (vector + BM25 + merge)
candidates = hybrid_search(query, scope, top=30)
# Step 4: Voyage Rerank
rerank_response = voyage.rerank(
query=query,
documents=[c.content for c in candidates],
model="voyage-rerank-2", # ~$0.05 per 1000 queries
top_k=k
)
return [candidates[r.index] for r in rerank_response.results]
```
### Cost incremental analysis
```
Phase 1 → Phase 3 incremental cost:
Phase 1 (basic vector):
Voyage embed: ~$0.36 initial + ~$0.20/mo delta
= ~$1.50/mo total
Phase 2 (+BM25):
BM25 free local (Python lib)
Embedding cost same
= ~$1.50/mo total (unchanged)
Phase 3 (+Reranking + Contextual):
Voyage rerank-2: ~$0.05 per 1000 queries
600 queries/mo × $0.05/1K = $0.03/mo
Haiku contextual prefix: ~$0.0001 per chunk
Initial 5000 chunks × $0.0001 = $0.50 one-time
Delta ~100 chunks/mo × $0.0001 = $0.01/mo
+ Voyage rerank monthly: ~$0.05/mo per 1K queries × 5 projects
+ Re-embed enriched chunks: ~$0.50/mo
= ~$4-5/mo total
→ Quality jump 70% → 92% recall = +22pp
→ Cost jump $1.50 → $4-5/mo = +$3
→ Worth it after Phase 1 validation
```
### Why incremental rollout (vs all-in Phase 3 immediate)
1. **Validate Layer 1 quality first** — nếu Voyage Vietnamese kém → upgrade Phase 2-3 vô ích
2. **Measure baseline cost** — biết exact Voyage spend trước add rerank/contextual
3. **Identify retrieval miss patterns** — Phase 1 trial reveal weakness → target Phase 2-3 fix
4. **Risk-averse setup** — mỗi phase 2-3h add, rollback dễ nếu fail
5. **§6.5 narrative preserve** — KHÔNG over-engineer, build incremental
### When to skip Phase 2-3
- Phase 1 recall already > 85% → Phase 2-3 marginal benefit (Vietnamese-specific corpus)
- Cost monthly < $5 budget → stay Phase 1 OK
- Solo dev no Vietnamese exact terms heavy → BM25 less impactful
### When to MUST upgrade Phase 2-3
- Recall < 70% on benchmark indicate Phase 1 insufficient
- Em main report "miss exact identifier" frequently Phase 2 BM25 critical
- Multi-language queries common Phase 3 reranker stabilize
- Production quality target > 90% → Phase 3 required
---
## 📚 References + tools
### Anthropic official
- [Memory tool docs](https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool)
- [Prompt caching guide](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
- [Files API](https://platform.claude.com/docs/en/build-with-claude/files)
- [Contextual Retrieval cookbook](https://platform.claude.com/cookbook/capabilities-contextual-embeddings-guide)
- [Effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
- [Agent SDK overview](https://code.claude.com/docs/en/agent-sdk/overview)
### Tools docs
- [Qdrant docs](https://qdrant.tech/documentation/)
- [Voyage AI pricing](https://docs.voyageai.com/docs/pricing)
- [FastMCP](https://github.com/jlowin/fastmcp)
- [MCP servers list](https://github.com/modelcontextprotocol/servers)
### Project memory
- `feedback_md_compact_narrative.md` (§6.5 rule — KEEP narrative)
- `feedback_multi_agent_setup.md` (4-agent discipline)
- `feedback_drastic_refactor_scope.md` (RAG setup = dedicated session)
- `feedback_uat_skip_verify.md` (Phase 9 UAT mode)
---
## ✅ Pre-implementation checklist
```
☐ Bro confirm 3 thông tin:
☐ 2 dự án path (để Investigator audit MD inventory pre-flight)
☐ Stack 2 dự án (BE: .NET/Node/Python? FE: React/Vue?)
☐ Pilot project chọn (smaller in 2)
☐ Bro prepare environment:
☐ Disk cleanup 5GB+ free (current 911/954 = 96% full)
☐ Voyage AI account signup + API key
☐ Python 3.10+ installed
☐ Git installed (cho pre-commit hook)
☐ Bro schedule dedicated session:
☐ 10-14h block 1 ngày cuối tuần (memory feedback_drastic_refactor_scope rule)
☐ Reserve weekly cap ~30% cho RAG setup spawn cost
☐ Bro review plan:
☐ Read full this file
☐ Confirm scope blanket vs RAG store match needs
☐ Confirm tool stack acceptable
☐ Approve Week 1-4 trial timeline
```
---
## 📝 Notes — keep updated
- **2026-05-12 turn 1:** Plan saved sau S21 turn 1 chốt cicd-monitor. Cross-project reference cho 2 dự án future bro > 1M MD. SOLUTION_ERP baseline ~354K MD (chưa cần RAG, defer).
- **Status:** 📝 PLAN ONLY — chưa implement
- **Next trigger:** Bro confirm 3 thông tin → spawn 🔵 Investigator audit MD inventory 2 dự án → tinh chỉnh blanket list cho từng project