# RAG Setup Plan — Cross-project reference > **Mục đích:** Plan setup Hybrid RAG (Option A) cho project có MD context > 1M tokens. Cross-project applicable — SOLUTION_ERP làm baseline reference, future 2 dự án bro apply pattern này. > **Last updated:** 2026-05-12 (Session 21 turn 1+) > **Status:** 📝 Plan saved — chưa implement, target Week 1-4 trial 2 dự án future > **Owner:** pqhuy1987@gmail.com + Claude (em main + 4 sub-agents) --- ## 📋 Table of Contents 1. [Context + Why](#1-context--why) 2. [Architecture overview](#2-architecture-overview) 3. [BLANKET load list (~100K tokens, 28%)](#3-blanket-load-list) 4. [RAG store list (~254K tokens, 72%)](#4-rag-store-list) 5. [Tool stack recommend](#5-tool-stack-recommend) 6. [Setup scripts (copy-paste ready)](#6-setup-scripts) 7. [Audit procedure (3-tier cadence)](#7-audit-procedure) 8. [Multi-AI client access](#8-multi-ai-client-access) 9. [Timeline rollout (~10-14h dedicated)](#9-timeline-rollout) 10. [Caveats + risks](#10-caveats--risks) 11. [Success metrics + decision gate](#11-success-metrics) 12. [Future enhancements](#12-future-enhancements) --- ## 1. Context + Why ### Problem statement ``` Hiện tại lazy blanket pattern (em main + 4 agents): - Em main vác ~120K MD upfront (35% project) - Lazy Read khi cần — em main TỰ ĐOÁN file relevant - 4 agents mỗi spawn ~188K cache WRITE - Heavy session ~700K effective billed - Lost-in-middle threshold đạt sau ~5.75h productive Scale-up to 2 projects > 1M MD tokens each: ❌ Blanket KHÔNG khả thi (vượt 1M context cap) ❌ Lazy Read recall ~30-60% (em main miss file không nghĩ tới) ❌ 4 agents duplicate Read same files (cumulative ~240K wasted) ❌ Vietnamese-English synonym miss (grep keyword only) ❌ Cross-project context impossible without manual switching ``` ### Solution **Hybrid RAG Option A** — blanket critical + retrieve on-demand: ``` KEEP blanket: ~100K static (core stable + current state + agent + skills + memory critical) ADD RAG layer: 70% MD remaining accessible via semantic retrieve SHARE cache: 4 agents reuse retrieved chunks (multi-agent leverage) ``` ### Benefits chốt từ analysis sessions trước | Metric | Lazy current | Option A | Δ | |---|---|---|---| | Quality recall | 30-60% | **85%** | **+25-55pp** | | Heavy session token | 700K | **560K** | -20% | | Session productive hours | 5.75h | **7.6h** | **+1.85h** | | Tasks before lost-in-middle | ~23 | **~38** | **+65%** | | Net successful tasks/session | 25 | **50** | **2×** | | Multi-agent shared cache | ❌ | **✅ 60-90% cache hit** | leverage real | | Việt-Anh semantic search | ❌ grep only | **✅ Voyage multilingual** | unlock | | Scale > 1M MD | ❌ break | **✅ work** | **enable** | ### Trade-off - ⚠️ Setup cost: ~10-14h dedicated session (1 lần invest) - ⚠️ Maintenance: ~30 phút/tuần audit - ⚠️ Beta features (Memory tool, Files API): có thể breaking change - ⚠️ Retrieval miss risk ~5-10% (mitigated bằng citations + fallback Read) - ⚠️ Voyage API cost: ~$0.36 initial embed + ~$0.20/tháng delta --- ## 2. Architecture overview ``` ┌─────────────────────────────────────────────────────────────┐ │ LAYER 1 — Static blanket (cache hot, 5min-1h TTL) │ ├─────────────────────────────────────────────────────────────┤ │ Em main + 4 sub-agents auto-inject ~100K core context: │ │ • rules.md, architecture.md, CLAUDE.md, PROJECT-MAP │ │ • STATUS top 100 line, HANDOFF top 150 line │ │ • 5 agent .md (README + 4 agent identity) │ │ • 5 SKILL.md descriptions (auto-inject) │ │ • 5 memory entries critical cross-cutting │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ LAYER 2 — Vector DB retrieve on-demand │ ├─────────────────────────────────────────────────────────────┤ │ Qdrant local (~50MB binary, ~200MB index per project): │ │ • Session logs cumulative (49% MD, biggest) │ │ • Gotchas detail (chunk per entry) │ │ • Archives + Recently Done + Migration-todos │ │ • Flows + Database guides │ │ • SKILL.md detail (description đã trong blanket) │ │ • Memory entries non-critical │ │ • Guides ops conditional │ └─────────────────────────────────────────────────────────────┘ ↑ ┌─────────────────────────────────────────────────────────────┐ │ LAYER 3 — Embedding service (Voyage AI cloud) │ ├─────────────────────────────────────────────────────────────┤ │ voyage-3-large multilingual 26 lang (Việt-Anh tốt): │ │ • Index time: embed chunks → vectors (one-time + delta) │ │ • Query time: embed query → search Qdrant top-K │ │ • Cost: $0.18/M tokens, ~$0.36 init + ~$0.20/month │ └─────────────────────────────────────────────────────────────┘ ↕ ┌─────────────────────────────────────────────────────────────┐ │ LAYER 4 — MCP retriever server (FastMCP Python) │ ├─────────────────────────────────────────────────────────────┤ │ Tool exposed: rag_retrieve(query, scope, k, time_range) │ │ Transport: stdio (Claude Code) hoặc HTTP/SSE (multi-AI) │ │ Auth: API key per client (multi-AI mode) │ └─────────────────────────────────────────────────────────────┘ ↕ ┌─────────────────────────────────────────────────────────────┐ │ LAYER 5 — Multi-AI clients │ ├─────────────────────────────────────────────────────────────┤ │ Claude Code (em main + 4 agents) — primary │ │ Claude Desktop — secondary │ │ GPT-4 / Cursor / Continue / Custom agent — optional │ └─────────────────────────────────────────────────────────────┘ ↑ ┌─────────────────────────────────────────────────────────────┐ │ LAYER 6 — Re-index pipeline │ ├─────────────────────────────────────────────────────────────┤ │ Pre-commit hook: delta re-index changed MD │ │ Weekly full re-index: catch missed (Saturday off-peak) │ │ Batch API 50% discount cho mass re-index │ └─────────────────────────────────────────────────────────────┘ ``` ### Flow time index (1 lần init + delta) ``` 1. Walk filesystem → docs/ + .claude/ + memory/ 2. Chunk adaptive theo doc_type (custom Python chunker) 3. Batch embed via Voyage API (128 chunks/batch) 4. Upsert Qdrant với metadata (source, doc_type, project, last_modified) 5. Total init: ~10-15 phút cho 1M MD tokens ``` ### Flow query time (mỗi spawn em main hoặc agent) ``` 1. Em main/agent: rag_retrieve("query keyword", scope, k) 2. MCP server: embed query → Voyage API (~100ms) 3. MCP server: Qdrant search top-K (~50ms local) 4. MCP server: return chunks với metadata + score 5. Total: ~150-200ms per query (network-bound) 6. Cache: subsequent same query → ~10ms (cache hit) ``` --- ## 3. BLANKET load list > **Total: ~100K tokens (28% project MD)** > Auto-load mỗi spawn em main + 4 agents. ### A. Core stable docs (~30K — KHÔNG đổi thường xuyên) | File | Token | Lý do blanket | |---|---:|---| | `docs/rules.md` | ~7K | Coding conventions stable, mọi task reference | | `CLAUDE.md` (root pointer) | ~3K | Auto-inject system reminder | | `docs/CLAUDE.md` | ~3K | Tech stack overview baseline | | `docs/architecture.md` | ~7K | 4-layer Clean Arch baseline | | `docs/PROJECT-MAP.md` | ~3K | Bản đồ navigate | | `docs/workflow-contract.md` | ~4K | State machine 9 phase Contract domain core | | `docs/forms-spec.md` | ~3K | 8 form catalog domain knowledge | ### B. Current state (~25K — em main biết direct, không cần retrieve) | File | Strategy | Token | |---|---|---:| | `docs/STATUS.md` **top 100 line** | Current phase + In Progress + 1-2 Recently Done top | ~15K | | `docs/HANDOFF.md` **top 150 line** | Last updated + TL;DR latest session + next priority | ~10K | → **Drop từ blanket:** STATUS Recently Done > 5 row cũ (retrieve nếu cần), HANDOFF TL;DR cũ > 1 tuần. ### C. Agent infrastructure (~25K — agent identity stable) | File | Token | |---|---:| | `.claude/agents/README.md` | ~5K | | `.claude/agents/investigator.md` | ~3.5K | | `.claude/agents/implementer.md` | ~4K | | `.claude/agents/reviewer.md` | ~3.5K | | `.claude/agents/cicd-monitor.md` | ~5K | | `.claude/agent-memory/{4 agents}/MEMORY.md` auto-inject 25KB first 200 lines | ~4K total | ### D. Skills descriptions (~5K — auto-inject, không SKILL.md full) | File | Strategy | Token | |---|---|---:| | `.claude/skills/README.md` | Full | ~2.5K | | 6 SKILL.md descriptions | Auto-inject by Claude Code | ~1K total | | 6 SKILL.md detail | **KHÔNG blanket** → RAG retrieve khi triggered | — | ### E. Memory user-level critical (~15K) | File | Token | Lý do critical | |---|---:|---| | `project_solution_erp.md` | ~3.5K | Project overview narrative | | `feedback_md_compact_narrative.md` (§6.5) | ~2K | Rule cốt lõi mọi doc work | | `feedback_uat_skip_verify.md` | ~2K | Phase 9 current mode rule | | `feedback_multi_agent_setup.md` | ~3K | 4-agent discipline | | `feedback_per_chunk_commit.md` | ~2K | Implementer pattern reusable | | `feedback_audit_reuse_before_clone.md` | ~2K | Investigator natural pattern | → **Drop từ blanket:** 11 memory entries còn lại (retrieve khi pattern triggered). ### TOTAL BLANKET ≈ 100K tokens --- ## 4. RAG store list > **Total: ~254K tokens (72% project MD)** > Index vào Qdrant, retrieve on-demand. ### F. Session logs (~150K — biggest, 49% MD) ``` Path: docs/changelog/sessions/*.md (41+ files growing) Chunk strategy: 1 file = 1 chunk (preserve narrative §6.5) Metadata: - session_date: extracted from filename - phase: extracted from content - topic: extracted from H1 - commit_sha_range: extracted from "Commits:" line - doc_type: "session_log" Scope filter: time_range="last_week|last_month|last_quarter|all" ``` ### G. Gotchas (~9K — lookup per debug) ``` Path: docs/gotchas.md (44+ entries) Chunk strategy: split per "### N. ..." numbered heading Metadata: - gotcha_id: integer - category: extracted from content (tech/EF/Workflow/CICD/Security/...) - doc_type: "gotcha" Scope filter: scope="gotcha" ``` ### H. Archives + Recently Done (~75K) | File | Strategy | Token | |---|---|---:| | `docs/STATUS.md` rest beyond top 100 | Per H2 section | ~8K | | `docs/HANDOFF.md` rest beyond top 150 | Per H2 section | ~21K | | `docs/changelog/migration-todos.md` | Per H3 task | ~18K | | `docs/changelog/recently-done-archive-*.md` | Per H3 phase | ~6K | | `docs/_archive/forms-spec-raw.md` | Full file (cold archive) | ~23K | | `docs/_archive/workflow-raw.md` | Full file (cold archive) | ~4K | ### I. Flows + Database (~17K — conditional task) | File | Token | Khi retrieve | |---|---:|---| | `docs/flows/README.md` | ~1K | Index khi cần flow | | `docs/flows/auth-flow.md` | ~1K | Task auth | | `docs/flows/permission-flow.md` | ~1.5K | Task permission | | `docs/flows/contract-creation-flow.md` | ~1.5K | Task Contract | | `docs/flows/contract-approval-flow.md` | ~1.5K | Task approval | | `docs/flows/form-render-flow.md` | ~1K | Task form | | `docs/flows/sla-expiry-flow.md` | ~1K | Task SLA | | `docs/database/database-guide.md` | ~3K | Task schema | | `docs/database/schema-diagram.md` | ~12K | Task ERD | ### J. SKILL.md detail (~40K — retrieve khi skill triggered) | File | Token | |---|---:| | `.claude/skills/contract-workflow/SKILL.md` | ~7K | | `.claude/skills/form-engine/SKILL.md` | ~5K | | `.claude/skills/permission-matrix/SKILL.md` | ~5K | | `.claude/skills/dependency-audit-erp/SKILL.md` | ~5K | | `.claude/skills/ef-core-migration/SKILL.md` | ~5.5K | | `.claude/skills/iis-deploy-runbook/SKILL.md` | ~6K | ### K. Guides ops conditional (~10K) | File | Token | Khi retrieve | |---|---:|---| | `docs/guides/deployment-iis.md` | ~2.5K | Task deploy | | `docs/guides/cicd.md` | ~2K | Task CI/CD | | `docs/guides/security-checklist.md` | ~2K | Audit security | | `docs/guides/vps-setup.md` | ~2.5K | Setup VPS | | `docs/guides/runbook.md` | ~1K | Ops debug | ### L. Memory entries non-critical (~50K — pattern lookup) ``` 11 memory entries còn lại (user-level): - feedback_n_stage_workflow_pattern.md (DEPRECATED post-Mig 21) - feedback_designtime_runtime_db.md - feedback_drastic_refactor_scope.md - feedback_cron_monthly_limitation.md - feedback_user_manual_style.md - feedback_node_cicd.md - feedback_unittest_timing.md - feedback_responsive_laptop_breakpoint.md - feedback_service_hook_vs_endpoint.md - reference_session_prompts.md - MEMORY.md index ``` ### M. Audit logs (~2K, grow) ``` docs/changelog/skill-audit-{YYYY-MM}.md (monthly audit log) ``` ### TOTAL RAG STORE ≈ 254K tokens --- ## 5. Tool stack recommend | Component | Tool | Reason | Cost | |---|---|---|---| | **Vector DB** | **Qdrant local** | Rust binary 50MB, no Docker, fast, metadata filtering, admin UI | $0 | | **Embedding** | **Voyage-3-large** | Anthropic partner, multilingual 26 lang, no GPU needed | $0.18/M (~$0.36 init) | | **MCP server framework** | **FastMCP Python** | Official Anthropic SDK, ~100 LOC, auto schema | $0 | | **Chunking** | **Custom Python adaptive** | ~50 LOC, transparent, §6.5 compliant | $0 | | **Re-index pipeline** | **Pre-commit hook** | Native git, ~10 LOC bash | $0 | | **Monitoring** | **Qdrant Dashboard + custom audit** | Built-in UI port 6333 | $0 | | **Auth (multi-AI)** | **Bearer token + rate limit** | Custom middleware ~30 LOC | $0 | | **Batch re-index** | **Voyage Batch API** | 50% discount cho mass re-embed | -50% | ### Stack rejected + lý do | Alternative | Reason rejected | |---|---| | Chroma vector DB | Python ecosystem, slower than Qdrant Rust | | pgvector | Cần PostgreSQL setup, overhead | | OpenAI text-embedding-3-small | Vietnamese quality kém hơn Voyage | | BGE-M3 local | Cần GPU >= 4GB (Intel Iris Xe không OK) | | LangChain / LlamaIndex | Heavy abstraction, black-box debug khó, §6.5 chunker không tuân | | TypeScript MCP SDK | Verbose hơn Python FastMCP | | Pinecone cloud | Paid + vendor lock, không cần scale đó | --- ## 6. Setup scripts ### 6.1 `requirements.txt` ```text fastmcp>=2.0 voyageai>=0.3 qdrant-client>=1.12 python-frontmatter>=1.1 ``` ### 6.2 `scripts/rag-indexer.py` (~120 LOC) ```python """ RAG Indexer — Embed MD files + upsert vào Qdrant. Usage: python rag-indexer.py # full index python rag-indexer.py --files "a.md b.md" # delta re-index """ import os, glob, re, sys from voyageai import Client from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct QDRANT_PATH = "./rag-data/qdrant" COLLECTION = "project_md" # rename per project EMBED_MODEL = "voyage-3-large" DIM = 1024 voyage = Client(api_key=os.environ["VOYAGE_API_KEY"]) qdrant = QdrantClient(path=QDRANT_PATH) def chunk_file(path: str) -> list[dict]: """Adaptive chunking theo doc type.""" content = open(path, encoding="utf-8").read() base = {"source": path, "size_chars": len(content)} if "/changelog/sessions/" in path: return [{**base, "content": content, "doc_type": "session_log"}] if path.endswith("gotchas.md"): entries = re.split(r"^### (\d+)\.", content, flags=re.M) return [ {**base, "content": f"### {entries[i]}.{entries[i+1]}", "doc_type": "gotcha", "entry_id": int(entries[i])} for i in range(1, len(entries), 2) ] if "/skills/" in path: return [{**base, "content": content, "doc_type": "skill"}] if "/agents/" in path: return [{**base, "content": content, "doc_type": "agent"}] if path.endswith("MEMORY.md") or "/memory/" in path: return [{**base, "content": content, "doc_type": "memory"}] # Default: split per H2 heading sections = re.split(r"^## ", content, flags=re.M) return [ {**base, "content": ("## " + s) if i > 0 else s, "doc_type": "doc", "section_idx": i} for i, s in enumerate(sections) if len(s.strip()) > 200 ] def main(files: list[str] | None = None): # Init collection (idempotent) if not qdrant.collection_exists(COLLECTION): qdrant.create_collection( COLLECTION, vectors_config=VectorParams(size=DIM, distance=Distance.COSINE) ) # Determine paths if files: paths = files else: paths = ( glob.glob("docs/**/*.md", recursive=True) + glob.glob(".claude/**/*.md", recursive=True) ) paths = [p for p in paths if "node_modules" not in p and "_user-guide" not in p] # Chunk chunks = [] for path in paths: try: chunks.extend(chunk_file(path)) except Exception as e: print(f"Skip {path}: {e}") print(f"Chunking: {len(chunks)} chunks from {len(paths)} files") # Batch embed (Voyage max 128/batch) texts = [c["content"] for c in chunks] embeddings = [] for i in range(0, len(texts), 128): batch = texts[i:i+128] result = voyage.embed(batch, model=EMBED_MODEL, input_type="document") embeddings.extend(result.embeddings) print(f"Embedded {i+len(batch)}/{len(texts)}") # Upsert (Qdrant auto-replaces by id) points = [ PointStruct( id=hash(c["source"] + str(c.get("section_idx", 0))) & 0xFFFFFFFF, vector=emb, payload=c ) for c, emb in zip(chunks, embeddings) ] qdrant.upsert(collection_name=COLLECTION, points=points) print(f"Indexed {len(points)} chunks → Qdrant") if __name__ == "__main__": files = sys.argv[2].split() if len(sys.argv) > 2 and sys.argv[1] == "--files" else None main(files) ``` ### 6.3 `scripts/rag-mcp-server.py` (~80 LOC) ```python """ MCP retriever server — Expose rag_retrieve tool cho Claude Code + agents. Run: python rag-mcp-server.py (stdio default) python rag-mcp-server.py --http :7777 (HTTP/SSE for multi-AI) """ import os, sys from fastmcp import FastMCP from voyageai import Client from qdrant_client import QdrantClient from qdrant_client.models import Filter, FieldCondition, MatchValue, Range mcp = FastMCP("project-rag") voyage = Client(api_key=os.environ["VOYAGE_API_KEY"]) qdrant = QdrantClient(path="./rag-data/qdrant") COLLECTION = "project_md" @mcp.tool() def rag_retrieve( query: str, scope: str = "all", k: int = 5 ) -> list[dict]: """ Semantic search MD context. Args: query: Search query (Vietnamese hoặc English, mix OK) scope: Filter by doc_type: "all" | "session_log" | "gotcha" | "memory" | "skill" | "agent" | "doc" k: Top chunks to return (1-15, default 5) Returns: List[dict] với keys: content, source, doc_type, score Use cases: - Historical session log: rag_retrieve("Mig 26 V2", scope="session_log") - Gotcha lookup: rag_retrieve("silent 403", scope="gotcha") - Pattern reuse: rag_retrieve("audit clone", scope="memory") - Cross-section: rag_retrieve("query", scope="all", k=10) """ k = min(max(k, 1), 15) # Embed query query_vec = voyage.embed( [query], model="voyage-3-large", input_type="query" ).embeddings[0] # Filter filter_dict = None if scope != "all": filter_dict = Filter( must=[FieldCondition(key="doc_type", match=MatchValue(value=scope))] ) # Search results = qdrant.search( collection_name=COLLECTION, query_vector=query_vec, query_filter=filter_dict, limit=k ) return [ { "content": r.payload["content"][:3000], # truncate huge "source": r.payload["source"], "doc_type": r.payload["doc_type"], "score": round(r.score, 3) } for r in results ] @mcp.tool() def rag_stats() -> dict: """Return collection stats (for audit).""" info = qdrant.get_collection(COLLECTION) return { "total_chunks": info.points_count, "vector_dim": info.config.params.vectors.size, "distance": info.config.params.vectors.distance.value, "indexed_at": info.optimizer_status, } if __name__ == "__main__": # Default: stdio mode for Claude Code # HTTP/SSE mode: python rag-mcp-server.py --http :7777 if "--http" in sys.argv: port = int(sys.argv[sys.argv.index("--http") + 1].lstrip(":")) mcp.run(transport="sse", port=port) else: mcp.run() # stdio default ``` ### 6.4 `.claude/settings.json` register ```jsonc { "mcpServers": { "project-rag": { "command": "python", "args": ["scripts/rag-mcp-server.py"], "cwd": "${workspaceFolder}", "env": { "VOYAGE_API_KEY": "${env:VOYAGE_API_KEY}" } } } } ``` ### 6.5 Pre-commit hook ```bash #!/bin/sh # .git/hooks/pre-commit # Re-index changed MD files changed_md=$(git diff --cached --name-only --diff-filter=AMR | grep -E "\.md$") if [ -n "$changed_md" ]; then echo "RAG re-indexing $(echo "$changed_md" | wc -l) MD files..." python scripts/rag-indexer.py --files "$changed_md" fi ``` ### 6.6 Agent .md frontmatter update ```yaml # Mỗi .claude/agents/{agent}.md thêm tool: tools: [Read, Grep, Glob, Bash, mcp__project-rag__rag_retrieve, ...] ``` System prompt section thêm: ```markdown ## RAG retriever usage (rag_retrieve tool) **WHEN to use:** - Historical session log lookup (> 1 tuần cũ) - Gotcha pattern matching debug - Memory pattern reuse "clone X sang Y" - Cross-section semantic search **WHEN to use Read instead:** - Current state (STATUS + HANDOFF top) — blanket loaded - Active file editing (cần full file) - Architecture review (stable docs, blanket) **Query examples:** - rag_retrieve("silent 403 non-admin", scope="gotcha", k=3) - rag_retrieve("PE V2 wire pattern", scope="session_log", k=5) - rag_retrieve("audit reuse clone", scope="memory", k=3) ``` --- ## 7. Audit procedure ### 7.1 Weekly quick audit (~30 phút, mỗi Saturday) **Mục tiêu:** Check health + cost trend hàng tuần. **Checklist:** ```bash # 1. Index health curl http://localhost:6333/collections/project_md # Verify: points_count tăng + status="green" # 2. Re-index lag git log --since="1 week ago" --name-only --pretty=format: | grep -E "\.md$" | sort -u | wc -l python -c " from qdrant_client import QdrantClient q = QdrantClient(path='./rag-data/qdrant') # Check sources có matching files changed " # 3. Voyage cost # Visit voyageai.com dashboard → check last 7 days usage # Target: <$1/week steady state # 4. Random query quality (manual 5 query) # Sample queries: # - "Recent Mig" → expect session log top # - "silent 403" → expect gotcha #44 top # - "audit reuse" → expect memory entry top # Score: 1-5 mỗi query (relevant chunks trong top-5) # 5. Storage size du -sh ./rag-data/ # Target: <500MB per project ``` **Log:** `docs/changelog/rag-audit-weekly-{YYYY-WW}.md` (1 page) ### 7.2 Monthly deep audit (~2-3h, mỗi đầu tháng) **Mục tiêu:** Quality benchmark + chunking review + stale cleanup. **Checklist:** ```python # 1. Quality benchmark — 30 query test set test_queries = [ # Categories: state, historical, debug, pattern, cross-stack ("Phase hiện tại", "doc"), ("Mig 26 PE Level Opinions UPSERT", "session_log"), ("silent 403 non-admin Forbidden", "gotcha"), ("audit reuse trước clone B từ A", "memory"), # ... 30 total covering all scopes ] results = [] for query, expected_scope in test_queries: retrieved = rag_retrieve(query, k=10) # Manual score: # - Recall: % expected sources trong top-10 # - Precision: % retrieved chunks actually relevant results.append({"query": query, "recall": ..., "precision": ...}) # Target: avg recall > 80%, precision > 75% # 2. Chunking review — sample 10 random chunks # Check: chunks có bị cắt giữa narrative không (vi phạm §6.5) # Action: tune chunker nếu phát hiện issues # 3. Stale audit # Files chưa re-index > 14 days → flag # Files đã xóa khỏi repo nhưng còn trong Qdrant → cleanup # 4. Cost trend # Monthly Voyage spend vs target # Target: <$3/month steady # 5. Capacity check # Total chunks vs disk space projection # Project có grow size đáng kể (>20% MoM) → plan scale ``` **Log:** `docs/changelog/rag-audit-monthly-{YYYY-MM}.md` (2-3 pages) ### 7.3 Quarterly major audit (~4-6h, mỗi quý) **Mục tiêu:** Strategic review + major upgrades. **Checklist:** 1. **Embedding model upgrade decision** - Voyage có model mới? Test side-by-side với voyage-3-large - Quality benchmark trên 30 query test set - Decision: upgrade nếu recall +5pp 2. **Chunking strategy iteration** - Review 50 random chunks - Identify patterns: cắt sai, overlap missing, metadata thiếu - Tune chunker code → re-index full 3. **Collection re-build từ scratch** - Backup current → drop collection → re-index all - Mục đích: clean orphan chunks + apply new chunking - Effort: ~30 phút for 1M MD 4. **Multi-AI client access audit** - Active clients (Claude Code / Desktop / GPT / Cursor) - Per-client query volume + token spend - Security: rotate auth tokens, review rate limits 5. **Cross-project namespace audit** (nếu multi-project) - Project isolation working correctly? - Cross-project query intentional vs accidental? - Adjust metadata filter rules **Log:** `docs/changelog/rag-audit-quarterly-{YYYY-Q}.md` (5-10 pages) ### 7.4 Trigger-based audit (ad-hoc) | Trigger | Action | |---|---| | Retrieval miss critical (em main báo) | Audit chunk relevant tại sao miss + tune | | Cost spike >50% MoM | Audit query patterns + rate limit clients | | Re-index hang >1h | Audit indexer logs + Qdrant health | | Quality regression em main observe | Spot-check + monthly audit sớm | | New project added | Setup namespace + initial index audit | --- ## 8. Multi-AI client access ### 8.1 MCP protocol — agnostic MCP (Model Context Protocol) là **standard protocol**. Bất kỳ AI client nào support MCP đều consume cùng 1 server: ``` Qdrant (single source) ↓ MCP server :7777 (HTTP/SSE) ↙ ↓ ↓ ↘ Claude Code Claude Cursor GPT-4 + Desktop IDE custom adapter ``` ### 8.2 Transport modes | Mode | Use case | Setup | |---|---|---| | **stdio** | Single client (Claude Code local) — default | `python rag-mcp-server.py` | | **HTTP/SSE** | Multi-client (network access) | `python rag-mcp-server.py --http :7777` | | **WebSocket** | Bi-directional (rare) | Custom config | ### 8.3 Setup multi-AI mode **Step 1: Run MCP server HTTP mode** ```bash # Terminal 1: MCP server (keep running) export VOYAGE_API_KEY="pa-xxxx" python scripts/rag-mcp-server.py --http :7777 # Server endpoint: http://localhost:7777/sse ``` **Step 2: Add auth middleware (recommend cho multi-client)** ```python # Update rag-mcp-server.py from fastmcp import FastMCP from fastmcp.middleware import bearer_auth ALLOWED_TOKENS = { "claude-code-token": "claude-code-primary", "gpt4-token": "gpt4-cursor-integration", "custom-agent-token": "custom-research-agent", } mcp = FastMCP("project-rag", middleware=[ bearer_auth(tokens=ALLOWED_TOKENS, rate_limit_per_minute=30) ]) ``` **Step 3: Register per-client config** #### Claude Code (em main + 4 agents) ```jsonc // .claude/settings.json { "mcpServers": { "project-rag": { "transport": "sse", "url": "http://localhost:7777/sse", "headers": { "Authorization": "Bearer claude-code-token" } } } } ``` #### Claude Desktop ```jsonc // claude_desktop_config.json { "mcpServers": { "project-rag": { "transport": "sse", "url": "http://localhost:7777/sse", "headers": { "Authorization": "Bearer claude-desktop-token" } } } } ``` #### Cursor IDE ```jsonc // .cursor/settings.json { "mcp.servers": { "project-rag": { "transport": "sse", "url": "http://localhost:7777/sse" } } } ``` #### GPT-4 via custom adapter ```python # Use OpenAI Assistants API + custom function calling import requests def query_project_rag(query: str, scope: str = "all", k: int = 5): response = requests.post( "http://localhost:7777/tool/rag_retrieve", headers={"Authorization": "Bearer gpt4-token"}, json={"query": query, "scope": scope, "k": k} ) return response.json() # Register as OpenAI function tool ``` #### Continue.dev / custom agent ```yaml # config.yaml mcp_servers: - name: project-rag transport: sse url: http://localhost:7777/sse auth_token: custom-agent-token ``` ### 8.4 Security model multi-AI | Concern | Mitigation | |---|---| | Token leak | Rotate quarterly, store in env vars | | Rate limit abuse | 30 req/min/token default, tune per client | | Read-only enforcement | MCP server expose only `rag_retrieve` + `rag_stats` (no write tools) | | Audit log | Log every query: timestamp + client_token + query + result_count | | Cross-project leak | Per-collection access control (future enhancement) | ### 8.5 Cost considerations multi-AI ``` Single Claude Code client (current): Voyage cost: ~$0.20/month (low query volume) Qdrant: free local 4 AI clients heavy use (Claude Code + Desktop + Cursor + GPT-4): Voyage cost: ~$2-5/month (higher query volume) Network bandwidth: minimal (~100KB/query response) CPU: Qdrant + Voyage embed call ~100ms total → Multi-AI access scale linearly với query volume, not infrastructure cost. ``` ### 8.6 Recommend rollout ``` Phase 1 (Week 1-4): Single client (Claude Code only) → Validate quality + cost baseline Phase 2 (Month 2+): Add Claude Desktop nếu cần mobile/casual access → Same auth, share collection Phase 3 (Month 3+): Add Cursor IDE nếu work multi-IDE → Verify no cross-tool conflicts Phase 4 (Future): GPT-4 / custom agent integration nếu cần → Custom adapter + auth strict ``` --- ## 9. Timeline rollout ### Hour-by-hour breakdown (~10-14h dedicated session) | Hour | Task | Effort | |---|---|---| | **1-2** | Setup pre-flight: disk cleanup + Voyage signup + Python deps install | ~2h | | **3-4** | Write `scripts/rag-indexer.py` + run initial embed | ~2h | | **5** | Verify Qdrant collection + manual query sanity check | ~1h | | **6-7** | Write `scripts/rag-mcp-server.py` + register `.claude/settings.json` | ~2h | | **8** | Test rag_retrieve qua Claude Code (em main solo) | ~1h | | **9-10** | Update 4 agent .md frontmatter + system prompt sections | ~2h | | **11** | Setup pre-commit hook + audit logging | ~1h | | **12-14** | Buffer + trial 10-15 query measure quality + cost | ~3h | ### Trial 4-week plan ``` Week 1: Pilot single project (smaller of 2) - Day 1-2: Setup + initial index - Day 3-7: Active use + measure baseline metrics - Deliverable: rag-audit-weekly-W1.md Week 2: Roll out 2nd project - Day 1: Setup separate Qdrant collection - Day 2-7: Dual-project use measure - Deliverable: rag-audit-weekly-W2.md Week 3: 4-agent integration - Day 1-2: Update 4 agent .md với rag_retrieve tool - Day 3-7: Multi-agent task measure shared cache benefit - Deliverable: rag-audit-weekly-W3.md Week 4: Decision gate (keep / tune / upgrade B / rollback) - Day 1-2: Compile metrics - Day 3: Decision meeting (bro + em main) - Day 4-7: Apply decision (tune embedding/chunking OR upgrade Option B OR rollback) - Deliverable: rag-audit-monthly-M1.md + decision doc ``` ### Decision gate Week 4 ``` PASS criteria (continue + tune): ✅ Quality recall > 80% on 30 query benchmark ✅ Cost < $5/month total (Voyage + storage) ✅ Session lifespan tăng > 30% (heavy session) ✅ Multi-agent shared cache hit > 60% ✅ Retrieval miss critical < 10% queries ✅ Storage < 1GB per project TUNE criteria (continue + adjust): ⚠️ Quality 70-80% → tune chunking or upgrade embedding ⚠️ Cost 5-10/mo → audit query patterns, reduce k ⚠️ Session lifespan tăng < 30% → audit blanket effectiveness ROLLBACK criteria (archive RAG): ❌ Quality < 70% ❌ Cost > $10/mo recurring ❌ Session lifespan KHÔNG tăng or giảm ❌ Em main complain "miss context" thường xuyên ❌ Storage > 5GB per project ``` --- ## 10. Caveats + risks ### 10.1 Beta features risk | Feature | Status | Mitigation | |---|---|---| | Anthropic Memory tool | Beta `content-management-2025-06-27` | Defer until GA, use MEMORY.md current | | Anthropic Files API | Beta `files-api-2025-04-14` | Optional add-on, RAG primary | | Extended 1h prompt cache | Beta `extended-cache-ttl-2025-04-11` | Use 5min default, opt-in 1h khi heavy session | | Voyage AI API | Stable | Production OK | | Qdrant local | Stable | Production OK | | FastMCP | Stable v2+ | Production OK | ### 10.2 Storage concerns ``` Bro hiện tại: 911/954 GB used = 96% full (43GB free) RAG storage budget: Qdrant binary: ~50MB Per project index: ~200-500MB (depend MD volume) Backup snapshots: ~500MB Logs + audit: ~100MB Per project total: ~1GB 2 projects total: ~2GB + buffer 1GB = 3GB recommend free space → Cleanup TRƯỚC setup: target 5GB+ free ``` **Cleanup priorities:** - `node_modules` projects cũ - `.NET bin/obj` artifacts - Docker images (`docker system prune -a`) - Browser caches (Chrome/Edge ~5GB common) - `%LOCALAPPDATA%` caches (NuGet, dotnet) - Downloads / Videos không dùng ### 10.3 Quality monitoring | Risk | Indicator | Action | |---|---|---| | Chunking break narrative | Em main report "miss context" | Review chunk strategy, tune | | Embedding drift | Recall drop > 10pp benchmark | Re-embed full, check Voyage updates | | Stale index | Files commit chưa re-index | Force re-index full, check hook | | Query phrasing kém | Low precision on simple queries | Em main refine query patterns | | Cross-language mismatch | Vietnamese query miss English content | Multilingual reranker hoặc query expansion | ### 10.4 Fallback strategy ``` Khi RAG fail / quality drop: Layer 1: Em main fallback to Read full file (existing lazy pattern still works) Layer 2: Em main blanket load critical file directly Layer 3: Rollback Qdrant snapshot (weekly backup) Layer 4: Full re-index từ scratch (~15 phút) Layer 5: Archive RAG, return lazy current pattern (ultimate fallback) ``` Em main blanket 120K KHÔNG bị mất khi RAG fail → graceful degradation. ### 10.5 Vietnamese-English mix considerations ``` Voyage-3-large multilingual claim 26 lang coverage. Vietnamese explicit benchmark KHÔNG public. Risk: technical jargon Việt-Anh mix có thể miss synonym. Ví dụ: "im lặng 403" vs "silent 403" — vector có gần nhau không? Mitigation: - Test 10-20 Việt-Anh mix queries trong audit benchmark - Nếu recall low → consider voyage-multilingual-2 backup - Hoặc add query expansion (Anthropic Contextual Retrieval pattern) ``` --- ## 11. Success metrics ### 11.1 Quality metrics | Metric | Target | Measurement | |---|---:|---| | Recall avg (30 query benchmark) | > 80% | Manual score weekly | | Precision avg | > 75% | Manual score weekly | | Retrieval miss critical rate | < 10% | Em main report cumulative | | Cross-language query recall | > 70% | Việt-Anh mix benchmark | ### 11.2 Cost metrics | Metric | Target | Measurement | |---|---:|---| | Voyage monthly spend | < $5 | Voyage dashboard | | Total RAG infra cost | < $10/month | Sum tools | | Cost per query | < $0.001 | Calculated | | Disk usage per project | < 1GB | `du -sh` | ### 11.3 Performance metrics | Metric | Target | Measurement | |---|---:|---| | Query latency (P50) | < 200ms | MCP server log | | Query latency (P99) | < 500ms | MCP server log | | Re-index lag (post-commit) | < 30s | Pre-commit hook timing | | Cache hit rate (multi-agent) | > 60% | Custom metric | ### 11.4 Capacity metrics | Metric | Target | Measurement | |---|---:|---| | Session lifespan productive | +50% vs lazy | Time tracker | | Tasks before lost-in-middle | > 35 | Task counter | | Heavy session token | -20% vs lazy | Anthropic dashboard | | Multi-agent overlap saving | > 50K/session | Cumulative calc | ### 11.5 Multi-AI client metrics | Metric | Target | Measurement | |---|---:|---| | Active clients | ≥ 1 stable | Audit log | | Per-client query volume | Track baseline | Audit log per client | | Cross-client conflict | 0 | Bug reports | --- ## 12. Future enhancements ### 12.1 Phase 2 (after Week 4 validation) | Enhancement | Effort | Benefit | |---|---|---| | Upgrade Option B (drop blanket 30-40K) | 1 session | Saving +15% tokens | | Anthropic Memory tool integration | 2-3h | Native cross-conversation memory | | Files API integration | 2-3h | Reduce blanket re-upload cost | | Citations enable | 1h | RAG quality trace | ### 12.2 Phase 3 (Month 2-3) | Enhancement | Effort | Benefit | |---|---|---| | Hybrid BM25 + vector search (Contextual Retrieval) | 4-6h | +49-67% recall (Anthropic doc) | | Multi-project namespace | 2-3h | Cross-project query với strict isolation | | Reranker model (Cohere rerank-3) | 2-3h | +10-20% precision | | Custom Streamlit audit dashboard | 4-5h | Visual quality monitoring | ### 12.3 Phase 4 (Quarter 2+) | Enhancement | Effort | Benefit | |---|---|---| | Replace Voyage với Anthropic native embedding (if GA) | 2-3h | Reduce vendor count | | Auto-tuning chunking (LLM-aided) | 1 week | Quality+ | | Federated multi-machine setup | 1 week | Team usage | | Time-series analytics on retrieval patterns | 1 week | Insights | ### 12.4 Defer indefinitely (over-engineering) - ❌ LangChain / LlamaIndex framework (heavy abstraction) - ❌ Self-host LLM (cost > value) - ❌ Custom embedding model fine-tuning (effort > value) - ❌ Full text + vector hybrid index (use Voyage Reranker instead) --- ## 13. Multi-agent cumulative cost reality (Anthropic 8-10× warning) > **Added S21 turn 2 (2026-05-12)** — clarification sau khi user catch gap "120K blanket KHÔNG bao gồm 4 agents". ### Per-entity blanket breakdown ``` Em main blanket: ~120K STATUS + HANDOFF top + rules + architecture + 5 agent .md + 4 MEMORY.md auto-inject + skills desc + memory critical + auto-inject system reminders Per sub-agent spawn baseline: ~80-100K each Agent system prompt (~5K) + 3 skills preload SKILL.md full (~21K, trigger semantic) + Auto-inject MEMORY.md 25KB first 200 lines (~7K) + Em main pass spec task (~10-15K) + Em main paste common context excerpt (~30-50K) + Auto-inject project context (~10K) = ~80-100K per sub-agent spawn (per Anthropic docs) 4 sub-agents cumulative: ~400K (4 × ~100K each, isolated context windows) TOTAL cumulative blanket 5 entities: ~520K Em main + 4 sub-agents combined (isolated windows, cumulative billing) ``` ### Context windows are ISOLATED ``` KHÔNG phải 5 entities share 520K trong 1 context window 1M. Mỗi entity có context window 1M RIÊNG: Em main → context window 1M, dùng ~120K Investigator → context window 1M, dùng ~100K Implementer → context window 1M, dùng ~100K Reviewer → context window 1M, dùng ~100K CICD Monitor → context window 1M, dùng ~100K → Mỗi entity LOST-IN-MIDDLE threshold riêng (~700K each) → Mỗi entity capacity ~58 tasks before hit hard cap riêng NHƯNG billing là CUMULATIVE 520K across all contexts: Anthropic billing tổng tokens across all 5 windows → Hit weekly cap nhanh hơn solo em main 4-5× ``` ### Heavy session token compound effect (Cách A vs lazy) **Without RAG (lazy current — 4 agents spawn):** ``` Em main: Blanket: 120K Lazy Read on-demand: ~50K Reasoning + coordinate: ~30K = ~200K subtotal 4 sub-agents (each): Spawn blanket: ~100K Lazy Read inside agent: ~50K Reasoning + work: ~30K Each agent: ~180K ────────────── 4 agents subtotal: ~720K cumulative SendMessage iteration: 10 round trips × ~30K nominal: 300K nominal Cache hit 70%: ~90K effective TOTAL HEAVY SESSION (lazy): 200K + 720K + 90K = ~1010K nominal After cache discount: ~700K effective billed ``` **With Cách A RAG:** ``` Em main: Blanket: 120K (unchanged) RAG retrieve replace lazy Read: ~30K (-20K saving) Reasoning streamlined: ~25K = ~175K subtotal (saving 25K) 4 sub-agents (each): Spawn blanket: ~100K (unchanged) RAG retrieve (share cache 70-90% common queries): ~15K Reasoning streamlined: ~25K Each agent: ~140K (saving 40K each) ────────────── 4 agents subtotal: ~560K (saving 160K total) SendMessage iteration: ~90K effective (unchanged) TOTAL HEAVY SESSION (Cách A): 175K + 560K + 90K = ~825K nominal After cache discount: ~560K effective billed SAVING: -140K (-20%) ``` ### Cost saving breakdown | Component | Lazy current | Cách A | Saving | |---|---:|---:|---:| | Em main blanket (fixed) | 120K | 120K | 0 | | Em main lazy Read → RAG retrieve | 50K | 30K | -20K | | Em main reasoning streamlined | 30K | 25K | -5K | | 4 agents spawn blanket (fixed) | 400K | 400K | 0 | | 4 agents lazy Read → cached retrieve | 200K | 60K | **-140K** | | 4 agents reasoning | 120K | 100K | -20K | | SendMessage cached | 90K | 90K | 0 | | **TOTAL EFFECTIVE BILLED** | **~700K** | **~560K** | **-140K (-20%)** | → **Saving 80% từ 4 agents** share retrieve cache (cache hit 70-90% common queries cross-agent). → Em main saving chỉ 25K (blanket unchanged, chỉ optimize Read → retrieve). ### Multi-agent leverage example concrete ``` Task Plan B Contract V2 wire: 🔵 Inv query "PE V2 schema pattern" → 15K retrieve + cached 🟡 Imp query same → cache hit 90% → 1.5K effective 🔴 Rev query same → cache hit 90% → 1.5K effective 🟢 CICD query same → cache hit 90% → 1.5K effective Em main query same → cache hit 90% → 1.5K effective Cumulative retrieve cost: 15K + 4×1.5K = 21K Compare to lazy: Each agent Read PE V2 file separately 5 entities × 20K Read = 100K cumulative → Saving 79K just for 1 cross-agent query ``` ### Optimization tips để giảm cumulative **Option 1: Spawn ít agents hơn** - Decision gate 6-criteria mỗi task (per `feedback_multi_agent_setup` rule) - Solo em main đủ → KHÔNG spawn agent - Chỉ spawn agent nào THẬT cần - Trong S20-S21: 4 agents seeds-only, em chưa spawn lần nào → cost ~120K em main thôi **Option 2: Tune blanket sub-agent (100K → 80K)** - Em main pass spec gọn (~10K thay 15K) - Em main paste common context excerpt thay full (~20K thay 50K) - Skills preload chỉ description (~3K thay 21K full SKILL.md) → Trigger SKILL.md full khi semantic match - Per sub-agent: 100K → 80K - 4 agents cumulative: 400K → 320K - Heavy session: 560K → 480K (-15%) **Option 3: SendMessage cache aggressive (1h TTL beta)** - Anthropic extended cache `extended-cache-ttl-2025-04-11` - Static prompts cache premium WRITE 2× base - Subsequent reads 0.1× discount - Multi-agent cùng cache prefix → benefit lớn - Saving 10-15% additional --- ## 14. 3-layer hybrid RAG upgrade path (Anthropic Contextual Retrieval) > **Added S21 turn 2 (2026-05-12)** — Anthropic flagship pattern Sept 2024. ### Pattern overview ``` Anthropic Contextual Retrieval = 3 layers compound: Layer 1: Embeddings (Voyage-3-large) → Semantic + synonym + multilingual catch + Contextual prefix (Haiku-generated context): Add chunk-specific context BEFORE embed "This chunk discusses... in context of..." → Better recall via enriched vector Layer 2: BM25 (bm25s Python lib free local) → Exact identifier + technical terms (function names, error codes, Mig numbers) + Contextual BM25 (same prefix pattern) Layer 3: Reranking (Voyage rerank-2) → Cross-attention deep relevance → Re-score top 30 candidates → return top 5 truly relevant ``` ### Performance compound effect ``` Baseline (naive vector embeddings): ~50% recall + Contextual embeddings: ~67% recall (-35% failure) + Hybrid Contextual + BM25: ~75% recall (-49% failure) + Reranking: ~85% recall (-67% failure) ``` 📎 Source: [Anthropic Contextual Retrieval Sept 2024](https://www.anthropic.com/news/contextual-retrieval) ### Phase rollout incremental (recommend cho bro) | Phase | Setup | Recall | Cost/month | Effort additional | |---|---|---:|---:|---| | **Phase 1** (Week 1-4) | Layer 1 vector only (Voyage-3-large) | ~70% | ~$1.50 | 10-14h initial | | **Phase 2** (Month 2) | + Layer 2 BM25 (bm25s free local) | ~78% | ~$1.50 unchanged | 2-3h | | **Phase 3** (Month 3) | + Layer 3 Voyage rerank-2 + Contextual prefix | ~92% | ~$4-5 | 3-4h | ### Phase 1 implementation (basic vector RAG) Đã cover trong Section 5-6 plan. Bro implement Week 1-4 trial pilot. ### Phase 2 upgrade — Add BM25 hybrid ```python # scripts/rag-mcp-server.py — upgrade from bm25s import BM25 bm25 = BM25.load("./rag-data/bm25_index") # pre-built @mcp.tool() def rag_retrieve_hybrid(query, scope="all", k=5): # Step 1: Vector search query_vec = voyage.embed([query], model="voyage-3-large").embeddings[0] vector_results = qdrant.search(COLLECTION, query_vec, limit=20) # Step 2: BM25 search (local Python lib) bm25_results = bm25.retrieve(query, k=20) # Step 3: Merge + dedup candidates = merge_dedup(vector_results, bm25_results) # ~30 chunks # Step 4: Score combine (RRF reciprocal rank fusion) final_scores = reciprocal_rank_fusion(vector_results, bm25_results) return final_scores[:k] ``` ### Phase 3 upgrade — Full Anthropic Contextual ```python # scripts/rag-indexer.py — upgrade với contextual prefix import anthropic claude_haiku = anthropic.Anthropic() def contextualize_chunk(chunk_content, full_doc_path): """Generate context prefix using Claude Haiku (cheap model).""" full_doc = open(full_doc_path).read() response = claude_haiku.messages.create( model="claude-haiku-4-5", # cheap ~$0.0001/chunk max_tokens=150, messages=[{ "role": "user", "content": f""" {full_doc[:5000]} {chunk_content} Give a brief context (50-100 words) explaining what this chunk is about and where it fits in the document. Be specific.""" }] ) return response.content[0].text # In indexer pipeline: for chunk in chunks: context = contextualize_chunk(chunk["content"], chunk["source"]) chunk["content_enriched"] = f"{context}\n\n{chunk['content']}" # Embed enriched version → better recall ``` ```python # scripts/rag-mcp-server.py — final upgrade với reranking import voyageai @mcp.tool() def rag_retrieve_full(query, scope="all", k=5): # Step 1-3: Same as Phase 2 (vector + BM25 + merge) candidates = hybrid_search(query, scope, top=30) # Step 4: Voyage Rerank rerank_response = voyage.rerank( query=query, documents=[c.content for c in candidates], model="voyage-rerank-2", # ~$0.05 per 1000 queries top_k=k ) return [candidates[r.index] for r in rerank_response.results] ``` ### Cost incremental analysis ``` Phase 1 → Phase 3 incremental cost: Phase 1 (basic vector): Voyage embed: ~$0.36 initial + ~$0.20/mo delta = ~$1.50/mo total Phase 2 (+BM25): BM25 free local (Python lib) Embedding cost same = ~$1.50/mo total (unchanged) Phase 3 (+Reranking + Contextual): Voyage rerank-2: ~$0.05 per 1000 queries 600 queries/mo × $0.05/1K = $0.03/mo Haiku contextual prefix: ~$0.0001 per chunk Initial 5000 chunks × $0.0001 = $0.50 one-time Delta ~100 chunks/mo × $0.0001 = $0.01/mo + Voyage rerank monthly: ~$0.05/mo per 1K queries × 5 projects + Re-embed enriched chunks: ~$0.50/mo = ~$4-5/mo total → Quality jump 70% → 92% recall = +22pp → Cost jump $1.50 → $4-5/mo = +$3 → Worth it after Phase 1 validation ``` ### Why incremental rollout (vs all-in Phase 3 immediate) 1. **Validate Layer 1 quality first** — nếu Voyage Vietnamese kém → upgrade Phase 2-3 vô ích 2. **Measure baseline cost** — biết exact Voyage spend trước add rerank/contextual 3. **Identify retrieval miss patterns** — Phase 1 trial reveal weakness → target Phase 2-3 fix 4. **Risk-averse setup** — mỗi phase 2-3h add, rollback dễ nếu fail 5. **§6.5 narrative preserve** — KHÔNG over-engineer, build incremental ### When to skip Phase 2-3 - Phase 1 recall already > 85% → Phase 2-3 marginal benefit (Vietnamese-specific corpus) - Cost monthly < $5 budget → stay Phase 1 OK - Solo dev no Vietnamese exact terms heavy → BM25 less impactful ### When to MUST upgrade Phase 2-3 - Recall < 70% on benchmark → indicate Phase 1 insufficient - Em main report "miss exact identifier" frequently → Phase 2 BM25 critical - Multi-language queries common → Phase 3 reranker stabilize - Production quality target > 90% → Phase 3 required --- ## 📚 References + tools ### Anthropic official - [Memory tool docs](https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool) - [Prompt caching guide](https://platform.claude.com/docs/en/build-with-claude/prompt-caching) - [Files API](https://platform.claude.com/docs/en/build-with-claude/files) - [Contextual Retrieval cookbook](https://platform.claude.com/cookbook/capabilities-contextual-embeddings-guide) - [Effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) - [Agent SDK overview](https://code.claude.com/docs/en/agent-sdk/overview) ### Tools docs - [Qdrant docs](https://qdrant.tech/documentation/) - [Voyage AI pricing](https://docs.voyageai.com/docs/pricing) - [FastMCP](https://github.com/jlowin/fastmcp) - [MCP servers list](https://github.com/modelcontextprotocol/servers) ### Project memory - `feedback_md_compact_narrative.md` (§6.5 rule — KEEP narrative) - `feedback_multi_agent_setup.md` (4-agent discipline) - `feedback_drastic_refactor_scope.md` (RAG setup = dedicated session) - `feedback_uat_skip_verify.md` (Phase 9 UAT mode) --- ## ✅ Pre-implementation checklist ``` ☐ Bro confirm 3 thông tin: ☐ 2 dự án path (để Investigator audit MD inventory pre-flight) ☐ Stack 2 dự án (BE: .NET/Node/Python? FE: React/Vue?) ☐ Pilot project chọn (smaller in 2) ☐ Bro prepare environment: ☐ Disk cleanup 5GB+ free (current 911/954 = 96% full) ☐ Voyage AI account signup + API key ☐ Python 3.10+ installed ☐ Git installed (cho pre-commit hook) ☐ Bro schedule dedicated session: ☐ 10-14h block 1 ngày cuối tuần (memory feedback_drastic_refactor_scope rule) ☐ Reserve weekly cap ~30% cho RAG setup spawn cost ☐ Bro review plan: ☐ Read full this file ☐ Confirm scope blanket vs RAG store match needs ☐ Confirm tool stack acceptable ☐ Approve Week 1-4 trial timeline ``` --- ## 📝 Notes — keep updated - **2026-05-12 turn 1:** Plan saved sau S21 turn 1 chốt cicd-monitor. Cross-project reference cho 2 dự án future bro > 1M MD. SOLUTION_ERP baseline ~354K MD (chưa cần RAG, defer). - **Status:** 📝 PLAN ONLY — chưa implement - **Next trigger:** Bro confirm 3 thông tin → spawn 🔵 Investigator audit MD inventory 2 dự án → tinh chỉnh blanket list cho từng project