Files
solution-erp/docs/rag-setup-plan.md
pqhuy1987 1f8e9af66f [CLAUDE] Docs: Save RAG setup plan chỉn chu cross-project reference
Plan comprehensive cho future 2 dự án bro > 1M MD context (SOLUTION_ERP baseline reference, chưa cần implement vì 354K < threshold).

12 sections:
1. Context + Why (problem + solution + benefits table)
2. Architecture overview (6-layer diagram: blanket + Qdrant + Voyage + MCP + multi-AI + re-index)
3. BLANKET load list ~100K (28%) — 5 categories: core stable + current state top + agent infra + skills desc + memory critical
4. RAG store list ~254K (72%) — 8 categories: session logs (49%) + gotchas + archives + flows/database + skills detail + memory non-critical + guides + audit
5. Tool stack recommend — Qdrant + Voyage-3-large + FastMCP Python + custom chunker + pre-commit hook
6. Setup scripts copy-paste ready (~250 LOC Python total: indexer + MCP server + settings + hook + agent .md update)
7. Audit procedure 3-tier cadence — weekly quick (~30min) + monthly deep (~2-3h) + quarterly major (~4-6h) + trigger-based ad-hoc
8. Multi-AI client access — MCP protocol agnostic, stdio/HTTP/SSE transport, bearer auth + rate limit, setup per client (Claude Code/Desktop/Cursor/GPT-4)
9. Timeline rollout — 10-14h dedicated session + 4-week trial plan + decision gate PASS/TUNE/ROLLBACK criteria
10. Caveats + risks — beta features + storage 96% full warning + quality monitoring + fallback graceful
11. Success metrics — quality (recall >80%, precision >75%) + cost (<$5/mo) + performance (P50<200ms) + capacity (+50% session lifespan) + multi-AI
12. Future enhancements — Phase 2 (Memory tool + Files API) → Phase 3 (Contextual Retrieval + multi-project) → defer over-engineering

Status: PLAN ONLY — chưa implement. Next trigger: bro confirm 3 thông tin (2 dự án path + stack + pilot choice) → spawn Investigator audit MD inventory pre-flight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:05:18 +07:00

1224 lines
42 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RAG Setup Plan — Cross-project reference
> **Mục đích:** Plan setup Hybrid RAG (Option A) cho project có MD context > 1M tokens. Cross-project applicable — SOLUTION_ERP làm baseline reference, future 2 dự án bro apply pattern này.
> **Last updated:** 2026-05-12 (Session 21 turn 1+)
> **Status:** 📝 Plan saved — chưa implement, target Week 1-4 trial 2 dự án future
> **Owner:** pqhuy1987@gmail.com + Claude (em main + 4 sub-agents)
---
## 📋 Table of Contents
1. [Context + Why](#1-context--why)
2. [Architecture overview](#2-architecture-overview)
3. [BLANKET load list (~100K tokens, 28%)](#3-blanket-load-list)
4. [RAG store list (~254K tokens, 72%)](#4-rag-store-list)
5. [Tool stack recommend](#5-tool-stack-recommend)
6. [Setup scripts (copy-paste ready)](#6-setup-scripts)
7. [Audit procedure (3-tier cadence)](#7-audit-procedure)
8. [Multi-AI client access](#8-multi-ai-client-access)
9. [Timeline rollout (~10-14h dedicated)](#9-timeline-rollout)
10. [Caveats + risks](#10-caveats--risks)
11. [Success metrics + decision gate](#11-success-metrics)
12. [Future enhancements](#12-future-enhancements)
---
## 1. Context + Why
### Problem statement
```
Hiện tại lazy blanket pattern (em main + 4 agents):
- Em main vác ~120K MD upfront (35% project)
- Lazy Read khi cần — em main TỰ ĐOÁN file relevant
- 4 agents mỗi spawn ~188K cache WRITE
- Heavy session ~700K effective billed
- Lost-in-middle threshold đạt sau ~5.75h productive
Scale-up to 2 projects > 1M MD tokens each:
❌ Blanket KHÔNG khả thi (vượt 1M context cap)
❌ Lazy Read recall ~30-60% (em main miss file không nghĩ tới)
❌ 4 agents duplicate Read same files (cumulative ~240K wasted)
❌ Vietnamese-English synonym miss (grep keyword only)
❌ Cross-project context impossible without manual switching
```
### Solution
**Hybrid RAG Option A** — blanket critical + retrieve on-demand:
```
KEEP blanket: ~100K static (core stable + current state + agent + skills + memory critical)
ADD RAG layer: 70% MD remaining accessible via semantic retrieve
SHARE cache: 4 agents reuse retrieved chunks (multi-agent leverage)
```
### Benefits chốt từ analysis sessions trước
| Metric | Lazy current | Option A | Δ |
|---|---|---|---|
| Quality recall | 30-60% | **85%** | **+25-55pp** |
| Heavy session token | 700K | **560K** | -20% |
| Session productive hours | 5.75h | **7.6h** | **+1.85h** |
| Tasks before lost-in-middle | ~23 | **~38** | **+65%** |
| Net successful tasks/session | 25 | **50** | **2×** |
| Multi-agent shared cache | ❌ | **✅ 60-90% cache hit** | leverage real |
| Việt-Anh semantic search | ❌ grep only | **✅ Voyage multilingual** | unlock |
| Scale > 1M MD | ❌ break | **✅ work** | **enable** |
### Trade-off
- ⚠️ Setup cost: ~10-14h dedicated session (1 lần invest)
- ⚠️ Maintenance: ~30 phút/tuần audit
- ⚠️ Beta features (Memory tool, Files API): có thể breaking change
- ⚠️ Retrieval miss risk ~5-10% (mitigated bằng citations + fallback Read)
- ⚠️ Voyage API cost: ~$0.36 initial embed + ~$0.20/tháng delta
---
## 2. Architecture overview
```
┌─────────────────────────────────────────────────────────────┐
│ LAYER 1 — Static blanket (cache hot, 5min-1h TTL) │
├─────────────────────────────────────────────────────────────┤
│ Em main + 4 sub-agents auto-inject ~100K core context: │
│ • rules.md, architecture.md, CLAUDE.md, PROJECT-MAP │
│ • STATUS top 100 line, HANDOFF top 150 line │
│ • 5 agent .md (README + 4 agent identity) │
│ • 5 SKILL.md descriptions (auto-inject) │
│ • 5 memory entries critical cross-cutting │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 2 — Vector DB retrieve on-demand │
├─────────────────────────────────────────────────────────────┤
│ Qdrant local (~50MB binary, ~200MB index per project): │
│ • Session logs cumulative (49% MD, biggest) │
│ • Gotchas detail (chunk per entry) │
│ • Archives + Recently Done + Migration-todos │
│ • Flows + Database guides │
│ • SKILL.md detail (description đã trong blanket) │
│ • Memory entries non-critical │
│ • Guides ops conditional │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 3 — Embedding service (Voyage AI cloud) │
├─────────────────────────────────────────────────────────────┤
│ voyage-3-large multilingual 26 lang (Việt-Anh tốt): │
│ • Index time: embed chunks → vectors (one-time + delta) │
│ • Query time: embed query → search Qdrant top-K │
│ • Cost: $0.18/M tokens, ~$0.36 init + ~$0.20/month │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 4 — MCP retriever server (FastMCP Python) │
├─────────────────────────────────────────────────────────────┤
│ Tool exposed: rag_retrieve(query, scope, k, time_range) │
│ Transport: stdio (Claude Code) hoặc HTTP/SSE (multi-AI) │
│ Auth: API key per client (multi-AI mode) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 5 — Multi-AI clients │
├─────────────────────────────────────────────────────────────┤
│ Claude Code (em main + 4 agents) — primary │
│ Claude Desktop — secondary │
│ GPT-4 / Cursor / Continue / Custom agent — optional │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ LAYER 6 — Re-index pipeline │
├─────────────────────────────────────────────────────────────┤
│ Pre-commit hook: delta re-index changed MD │
│ Weekly full re-index: catch missed (Saturday off-peak) │
│ Batch API 50% discount cho mass re-index │
└─────────────────────────────────────────────────────────────┘
```
### Flow time index (1 lần init + delta)
```
1. Walk filesystem → docs/ + .claude/ + memory/
2. Chunk adaptive theo doc_type (custom Python chunker)
3. Batch embed via Voyage API (128 chunks/batch)
4. Upsert Qdrant với metadata (source, doc_type, project, last_modified)
5. Total init: ~10-15 phút cho 1M MD tokens
```
### Flow query time (mỗi spawn em main hoặc agent)
```
1. Em main/agent: rag_retrieve("query keyword", scope, k)
2. MCP server: embed query → Voyage API (~100ms)
3. MCP server: Qdrant search top-K (~50ms local)
4. MCP server: return chunks với metadata + score
5. Total: ~150-200ms per query (network-bound)
6. Cache: subsequent same query → ~10ms (cache hit)
```
---
## 3. BLANKET load list
> **Total: ~100K tokens (28% project MD)**
> Auto-load mỗi spawn em main + 4 agents.
### A. Core stable docs (~30K — KHÔNG đổi thường xuyên)
| File | Token | Lý do blanket |
|---|---:|---|
| `docs/rules.md` | ~7K | Coding conventions stable, mọi task reference |
| `CLAUDE.md` (root pointer) | ~3K | Auto-inject system reminder |
| `docs/CLAUDE.md` | ~3K | Tech stack overview baseline |
| `docs/architecture.md` | ~7K | 4-layer Clean Arch baseline |
| `docs/PROJECT-MAP.md` | ~3K | Bản đồ navigate |
| `docs/workflow-contract.md` | ~4K | State machine 9 phase Contract domain core |
| `docs/forms-spec.md` | ~3K | 8 form catalog domain knowledge |
### B. Current state (~25K — em main biết direct, không cần retrieve)
| File | Strategy | Token |
|---|---|---:|
| `docs/STATUS.md` **top 100 line** | Current phase + In Progress + 1-2 Recently Done top | ~15K |
| `docs/HANDOFF.md` **top 150 line** | Last updated + TL;DR latest session + next priority | ~10K |
**Drop từ blanket:** STATUS Recently Done > 5 row cũ (retrieve nếu cần), HANDOFF TL;DR cũ > 1 tuần.
### C. Agent infrastructure (~25K — agent identity stable)
| File | Token |
|---|---:|
| `.claude/agents/README.md` | ~5K |
| `.claude/agents/investigator.md` | ~3.5K |
| `.claude/agents/implementer.md` | ~4K |
| `.claude/agents/reviewer.md` | ~3.5K |
| `.claude/agents/cicd-monitor.md` | ~5K |
| `.claude/agent-memory/{4 agents}/MEMORY.md` auto-inject 25KB first 200 lines | ~4K total |
### D. Skills descriptions (~5K — auto-inject, không SKILL.md full)
| File | Strategy | Token |
|---|---|---:|
| `.claude/skills/README.md` | Full | ~2.5K |
| 6 SKILL.md descriptions | Auto-inject by Claude Code | ~1K total |
| 6 SKILL.md detail | **KHÔNG blanket** → RAG retrieve khi triggered | — |
### E. Memory user-level critical (~15K)
| File | Token | Lý do critical |
|---|---:|---|
| `project_solution_erp.md` | ~3.5K | Project overview narrative |
| `feedback_md_compact_narrative.md` (§6.5) | ~2K | Rule cốt lõi mọi doc work |
| `feedback_uat_skip_verify.md` | ~2K | Phase 9 current mode rule |
| `feedback_multi_agent_setup.md` | ~3K | 4-agent discipline |
| `feedback_per_chunk_commit.md` | ~2K | Implementer pattern reusable |
| `feedback_audit_reuse_before_clone.md` | ~2K | Investigator natural pattern |
**Drop từ blanket:** 11 memory entries còn lại (retrieve khi pattern triggered).
### TOTAL BLANKET ≈ 100K tokens
---
## 4. RAG store list
> **Total: ~254K tokens (72% project MD)**
> Index vào Qdrant, retrieve on-demand.
### F. Session logs (~150K — biggest, 49% MD)
```
Path: docs/changelog/sessions/*.md (41+ files growing)
Chunk strategy: 1 file = 1 chunk (preserve narrative §6.5)
Metadata:
- session_date: extracted from filename
- phase: extracted from content
- topic: extracted from H1
- commit_sha_range: extracted from "Commits:" line
- doc_type: "session_log"
Scope filter: time_range="last_week|last_month|last_quarter|all"
```
### G. Gotchas (~9K — lookup per debug)
```
Path: docs/gotchas.md (44+ entries)
Chunk strategy: split per "### N. ..." numbered heading
Metadata:
- gotcha_id: integer
- category: extracted from content (tech/EF/Workflow/CICD/Security/...)
- doc_type: "gotcha"
Scope filter: scope="gotcha"
```
### H. Archives + Recently Done (~75K)
| File | Strategy | Token |
|---|---|---:|
| `docs/STATUS.md` rest beyond top 100 | Per H2 section | ~8K |
| `docs/HANDOFF.md` rest beyond top 150 | Per H2 section | ~21K |
| `docs/changelog/migration-todos.md` | Per H3 task | ~18K |
| `docs/changelog/recently-done-archive-*.md` | Per H3 phase | ~6K |
| `docs/_archive/forms-spec-raw.md` | Full file (cold archive) | ~23K |
| `docs/_archive/workflow-raw.md` | Full file (cold archive) | ~4K |
### I. Flows + Database (~17K — conditional task)
| File | Token | Khi retrieve |
|---|---:|---|
| `docs/flows/README.md` | ~1K | Index khi cần flow |
| `docs/flows/auth-flow.md` | ~1K | Task auth |
| `docs/flows/permission-flow.md` | ~1.5K | Task permission |
| `docs/flows/contract-creation-flow.md` | ~1.5K | Task Contract |
| `docs/flows/contract-approval-flow.md` | ~1.5K | Task approval |
| `docs/flows/form-render-flow.md` | ~1K | Task form |
| `docs/flows/sla-expiry-flow.md` | ~1K | Task SLA |
| `docs/database/database-guide.md` | ~3K | Task schema |
| `docs/database/schema-diagram.md` | ~12K | Task ERD |
### J. SKILL.md detail (~40K — retrieve khi skill triggered)
| File | Token |
|---|---:|
| `.claude/skills/contract-workflow/SKILL.md` | ~7K |
| `.claude/skills/form-engine/SKILL.md` | ~5K |
| `.claude/skills/permission-matrix/SKILL.md` | ~5K |
| `.claude/skills/dependency-audit-erp/SKILL.md` | ~5K |
| `.claude/skills/ef-core-migration/SKILL.md` | ~5.5K |
| `.claude/skills/iis-deploy-runbook/SKILL.md` | ~6K |
### K. Guides ops conditional (~10K)
| File | Token | Khi retrieve |
|---|---:|---|
| `docs/guides/deployment-iis.md` | ~2.5K | Task deploy |
| `docs/guides/cicd.md` | ~2K | Task CI/CD |
| `docs/guides/security-checklist.md` | ~2K | Audit security |
| `docs/guides/vps-setup.md` | ~2.5K | Setup VPS |
| `docs/guides/runbook.md` | ~1K | Ops debug |
### L. Memory entries non-critical (~50K — pattern lookup)
```
11 memory entries còn lại (user-level):
- feedback_n_stage_workflow_pattern.md (DEPRECATED post-Mig 21)
- feedback_designtime_runtime_db.md
- feedback_drastic_refactor_scope.md
- feedback_cron_monthly_limitation.md
- feedback_user_manual_style.md
- feedback_node_cicd.md
- feedback_unittest_timing.md
- feedback_responsive_laptop_breakpoint.md
- feedback_service_hook_vs_endpoint.md
- reference_session_prompts.md
- MEMORY.md index
```
### M. Audit logs (~2K, grow)
```
docs/changelog/skill-audit-{YYYY-MM}.md (monthly audit log)
```
### TOTAL RAG STORE ≈ 254K tokens
---
## 5. Tool stack recommend
| Component | Tool | Reason | Cost |
|---|---|---|---|
| **Vector DB** | **Qdrant local** | Rust binary 50MB, no Docker, fast, metadata filtering, admin UI | $0 |
| **Embedding** | **Voyage-3-large** | Anthropic partner, multilingual 26 lang, no GPU needed | $0.18/M (~$0.36 init) |
| **MCP server framework** | **FastMCP Python** | Official Anthropic SDK, ~100 LOC, auto schema | $0 |
| **Chunking** | **Custom Python adaptive** | ~50 LOC, transparent, §6.5 compliant | $0 |
| **Re-index pipeline** | **Pre-commit hook** | Native git, ~10 LOC bash | $0 |
| **Monitoring** | **Qdrant Dashboard + custom audit** | Built-in UI port 6333 | $0 |
| **Auth (multi-AI)** | **Bearer token + rate limit** | Custom middleware ~30 LOC | $0 |
| **Batch re-index** | **Voyage Batch API** | 50% discount cho mass re-embed | -50% |
### Stack rejected + lý do
| Alternative | Reason rejected |
|---|---|
| Chroma vector DB | Python ecosystem, slower than Qdrant Rust |
| pgvector | Cần PostgreSQL setup, overhead |
| OpenAI text-embedding-3-small | Vietnamese quality kém hơn Voyage |
| BGE-M3 local | Cần GPU >= 4GB (Intel Iris Xe không OK) |
| LangChain / LlamaIndex | Heavy abstraction, black-box debug khó, §6.5 chunker không tuân |
| TypeScript MCP SDK | Verbose hơn Python FastMCP |
| Pinecone cloud | Paid + vendor lock, không cần scale đó |
---
## 6. Setup scripts
### 6.1 `requirements.txt`
```text
fastmcp>=2.0
voyageai>=0.3
qdrant-client>=1.12
python-frontmatter>=1.1
```
### 6.2 `scripts/rag-indexer.py` (~120 LOC)
```python
"""
RAG Indexer — Embed MD files + upsert vào Qdrant.
Usage:
python rag-indexer.py # full index
python rag-indexer.py --files "a.md b.md" # delta re-index
"""
import os, glob, re, sys
from voyageai import Client
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
QDRANT_PATH = "./rag-data/qdrant"
COLLECTION = "project_md" # rename per project
EMBED_MODEL = "voyage-3-large"
DIM = 1024
voyage = Client(api_key=os.environ["VOYAGE_API_KEY"])
qdrant = QdrantClient(path=QDRANT_PATH)
def chunk_file(path: str) -> list[dict]:
"""Adaptive chunking theo doc type."""
content = open(path, encoding="utf-8").read()
base = {"source": path, "size_chars": len(content)}
if "/changelog/sessions/" in path:
return [{**base, "content": content, "doc_type": "session_log"}]
if path.endswith("gotchas.md"):
entries = re.split(r"^### (\d+)\.", content, flags=re.M)
return [
{**base, "content": f"### {entries[i]}.{entries[i+1]}",
"doc_type": "gotcha", "entry_id": int(entries[i])}
for i in range(1, len(entries), 2)
]
if "/skills/" in path:
return [{**base, "content": content, "doc_type": "skill"}]
if "/agents/" in path:
return [{**base, "content": content, "doc_type": "agent"}]
if path.endswith("MEMORY.md") or "/memory/" in path:
return [{**base, "content": content, "doc_type": "memory"}]
# Default: split per H2 heading
sections = re.split(r"^## ", content, flags=re.M)
return [
{**base, "content": ("## " + s) if i > 0 else s,
"doc_type": "doc", "section_idx": i}
for i, s in enumerate(sections) if len(s.strip()) > 200
]
def main(files: list[str] | None = None):
# Init collection (idempotent)
if not qdrant.collection_exists(COLLECTION):
qdrant.create_collection(
COLLECTION,
vectors_config=VectorParams(size=DIM, distance=Distance.COSINE)
)
# Determine paths
if files:
paths = files
else:
paths = (
glob.glob("docs/**/*.md", recursive=True) +
glob.glob(".claude/**/*.md", recursive=True)
)
paths = [p for p in paths
if "node_modules" not in p and "_user-guide" not in p]
# Chunk
chunks = []
for path in paths:
try:
chunks.extend(chunk_file(path))
except Exception as e:
print(f"Skip {path}: {e}")
print(f"Chunking: {len(chunks)} chunks from {len(paths)} files")
# Batch embed (Voyage max 128/batch)
texts = [c["content"] for c in chunks]
embeddings = []
for i in range(0, len(texts), 128):
batch = texts[i:i+128]
result = voyage.embed(batch, model=EMBED_MODEL, input_type="document")
embeddings.extend(result.embeddings)
print(f"Embedded {i+len(batch)}/{len(texts)}")
# Upsert (Qdrant auto-replaces by id)
points = [
PointStruct(
id=hash(c["source"] + str(c.get("section_idx", 0))) & 0xFFFFFFFF,
vector=emb,
payload=c
)
for c, emb in zip(chunks, embeddings)
]
qdrant.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks → Qdrant")
if __name__ == "__main__":
files = sys.argv[2].split() if len(sys.argv) > 2 and sys.argv[1] == "--files" else None
main(files)
```
### 6.3 `scripts/rag-mcp-server.py` (~80 LOC)
```python
"""
MCP retriever server — Expose rag_retrieve tool cho Claude Code + agents.
Run: python rag-mcp-server.py (stdio default)
python rag-mcp-server.py --http :7777 (HTTP/SSE for multi-AI)
"""
import os, sys
from fastmcp import FastMCP
from voyageai import Client
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
mcp = FastMCP("project-rag")
voyage = Client(api_key=os.environ["VOYAGE_API_KEY"])
qdrant = QdrantClient(path="./rag-data/qdrant")
COLLECTION = "project_md"
@mcp.tool()
def rag_retrieve(
query: str,
scope: str = "all",
k: int = 5
) -> list[dict]:
"""
Semantic search MD context.
Args:
query: Search query (Vietnamese hoặc English, mix OK)
scope: Filter by doc_type:
"all" | "session_log" | "gotcha" | "memory" |
"skill" | "agent" | "doc"
k: Top chunks to return (1-15, default 5)
Returns:
List[dict] với keys: content, source, doc_type, score
Use cases:
- Historical session log: rag_retrieve("Mig 26 V2", scope="session_log")
- Gotcha lookup: rag_retrieve("silent 403", scope="gotcha")
- Pattern reuse: rag_retrieve("audit clone", scope="memory")
- Cross-section: rag_retrieve("query", scope="all", k=10)
"""
k = min(max(k, 1), 15)
# Embed query
query_vec = voyage.embed(
[query], model="voyage-3-large", input_type="query"
).embeddings[0]
# Filter
filter_dict = None
if scope != "all":
filter_dict = Filter(
must=[FieldCondition(key="doc_type", match=MatchValue(value=scope))]
)
# Search
results = qdrant.search(
collection_name=COLLECTION,
query_vector=query_vec,
query_filter=filter_dict,
limit=k
)
return [
{
"content": r.payload["content"][:3000], # truncate huge
"source": r.payload["source"],
"doc_type": r.payload["doc_type"],
"score": round(r.score, 3)
}
for r in results
]
@mcp.tool()
def rag_stats() -> dict:
"""Return collection stats (for audit)."""
info = qdrant.get_collection(COLLECTION)
return {
"total_chunks": info.points_count,
"vector_dim": info.config.params.vectors.size,
"distance": info.config.params.vectors.distance.value,
"indexed_at": info.optimizer_status,
}
if __name__ == "__main__":
# Default: stdio mode for Claude Code
# HTTP/SSE mode: python rag-mcp-server.py --http :7777
if "--http" in sys.argv:
port = int(sys.argv[sys.argv.index("--http") + 1].lstrip(":"))
mcp.run(transport="sse", port=port)
else:
mcp.run() # stdio default
```
### 6.4 `.claude/settings.json` register
```jsonc
{
"mcpServers": {
"project-rag": {
"command": "python",
"args": ["scripts/rag-mcp-server.py"],
"cwd": "${workspaceFolder}",
"env": {
"VOYAGE_API_KEY": "${env:VOYAGE_API_KEY}"
}
}
}
}
```
### 6.5 Pre-commit hook
```bash
#!/bin/sh
# .git/hooks/pre-commit
# Re-index changed MD files
changed_md=$(git diff --cached --name-only --diff-filter=AMR | grep -E "\.md$")
if [ -n "$changed_md" ]; then
echo "RAG re-indexing $(echo "$changed_md" | wc -l) MD files..."
python scripts/rag-indexer.py --files "$changed_md"
fi
```
### 6.6 Agent .md frontmatter update
```yaml
# Mỗi .claude/agents/{agent}.md thêm tool:
tools: [Read, Grep, Glob, Bash, mcp__project-rag__rag_retrieve, ...]
```
System prompt section thêm:
```markdown
## RAG retriever usage (rag_retrieve tool)
**WHEN to use:**
- Historical session log lookup (> 1 tuần cũ)
- Gotcha pattern matching debug
- Memory pattern reuse "clone X sang Y"
- Cross-section semantic search
**WHEN to use Read instead:**
- Current state (STATUS + HANDOFF top) — blanket loaded
- Active file editing (cần full file)
- Architecture review (stable docs, blanket)
**Query examples:**
- rag_retrieve("silent 403 non-admin", scope="gotcha", k=3)
- rag_retrieve("PE V2 wire pattern", scope="session_log", k=5)
- rag_retrieve("audit reuse clone", scope="memory", k=3)
```
---
## 7. Audit procedure
### 7.1 Weekly quick audit (~30 phút, mỗi Saturday)
**Mục tiêu:** Check health + cost trend hàng tuần.
**Checklist:**
```bash
# 1. Index health
curl http://localhost:6333/collections/project_md
# Verify: points_count tăng + status="green"
# 2. Re-index lag
git log --since="1 week ago" --name-only --pretty=format: | grep -E "\.md$" | sort -u | wc -l
python -c "
from qdrant_client import QdrantClient
q = QdrantClient(path='./rag-data/qdrant')
# Check sources có matching files changed
"
# 3. Voyage cost
# Visit voyageai.com dashboard → check last 7 days usage
# Target: <$1/week steady state
# 4. Random query quality (manual 5 query)
# Sample queries:
# - "Recent Mig" → expect session log top
# - "silent 403" → expect gotcha #44 top
# - "audit reuse" → expect memory entry top
# Score: 1-5 mỗi query (relevant chunks trong top-5)
# 5. Storage size
du -sh ./rag-data/
# Target: <500MB per project
```
**Log:** `docs/changelog/rag-audit-weekly-{YYYY-WW}.md` (1 page)
### 7.2 Monthly deep audit (~2-3h, mỗi đầu tháng)
**Mục tiêu:** Quality benchmark + chunking review + stale cleanup.
**Checklist:**
```python
# 1. Quality benchmark — 30 query test set
test_queries = [
# Categories: state, historical, debug, pattern, cross-stack
("Phase hiện tại", "doc"),
("Mig 26 PE Level Opinions UPSERT", "session_log"),
("silent 403 non-admin Forbidden", "gotcha"),
("audit reuse trước clone B từ A", "memory"),
# ... 30 total covering all scopes
]
results = []
for query, expected_scope in test_queries:
retrieved = rag_retrieve(query, k=10)
# Manual score:
# - Recall: % expected sources trong top-10
# - Precision: % retrieved chunks actually relevant
results.append({"query": query, "recall": ..., "precision": ...})
# Target: avg recall > 80%, precision > 75%
# 2. Chunking review — sample 10 random chunks
# Check: chunks có bị cắt giữa narrative không (vi phạm §6.5)
# Action: tune chunker nếu phát hiện issues
# 3. Stale audit
# Files chưa re-index > 14 days → flag
# Files đã xóa khỏi repo nhưng còn trong Qdrant → cleanup
# 4. Cost trend
# Monthly Voyage spend vs target
# Target: <$3/month steady
# 5. Capacity check
# Total chunks vs disk space projection
# Project có grow size đáng kể (>20% MoM) → plan scale
```
**Log:** `docs/changelog/rag-audit-monthly-{YYYY-MM}.md` (2-3 pages)
### 7.3 Quarterly major audit (~4-6h, mỗi quý)
**Mục tiêu:** Strategic review + major upgrades.
**Checklist:**
1. **Embedding model upgrade decision**
- Voyage có model mới? Test side-by-side với voyage-3-large
- Quality benchmark trên 30 query test set
- Decision: upgrade nếu recall +5pp
2. **Chunking strategy iteration**
- Review 50 random chunks
- Identify patterns: cắt sai, overlap missing, metadata thiếu
- Tune chunker code → re-index full
3. **Collection re-build từ scratch**
- Backup current → drop collection → re-index all
- Mục đích: clean orphan chunks + apply new chunking
- Effort: ~30 phút for 1M MD
4. **Multi-AI client access audit**
- Active clients (Claude Code / Desktop / GPT / Cursor)
- Per-client query volume + token spend
- Security: rotate auth tokens, review rate limits
5. **Cross-project namespace audit** (nếu multi-project)
- Project isolation working correctly?
- Cross-project query intentional vs accidental?
- Adjust metadata filter rules
**Log:** `docs/changelog/rag-audit-quarterly-{YYYY-Q}.md` (5-10 pages)
### 7.4 Trigger-based audit (ad-hoc)
| Trigger | Action |
|---|---|
| Retrieval miss critical (em main báo) | Audit chunk relevant tại sao miss + tune |
| Cost spike >50% MoM | Audit query patterns + rate limit clients |
| Re-index hang >1h | Audit indexer logs + Qdrant health |
| Quality regression em main observe | Spot-check + monthly audit sớm |
| New project added | Setup namespace + initial index audit |
---
## 8. Multi-AI client access
### 8.1 MCP protocol — agnostic
MCP (Model Context Protocol) là **standard protocol**. Bất kỳ AI client nào support MCP đều consume cùng 1 server:
```
Qdrant (single source)
MCP server :7777 (HTTP/SSE)
↙ ↓ ↓ ↘
Claude Code Claude Cursor GPT-4 +
Desktop IDE custom adapter
```
### 8.2 Transport modes
| Mode | Use case | Setup |
|---|---|---|
| **stdio** | Single client (Claude Code local) — default | `python rag-mcp-server.py` |
| **HTTP/SSE** | Multi-client (network access) | `python rag-mcp-server.py --http :7777` |
| **WebSocket** | Bi-directional (rare) | Custom config |
### 8.3 Setup multi-AI mode
**Step 1: Run MCP server HTTP mode**
```bash
# Terminal 1: MCP server (keep running)
export VOYAGE_API_KEY="pa-xxxx"
python scripts/rag-mcp-server.py --http :7777
# Server endpoint: http://localhost:7777/sse
```
**Step 2: Add auth middleware (recommend cho multi-client)**
```python
# Update rag-mcp-server.py
from fastmcp import FastMCP
from fastmcp.middleware import bearer_auth
ALLOWED_TOKENS = {
"claude-code-token": "claude-code-primary",
"gpt4-token": "gpt4-cursor-integration",
"custom-agent-token": "custom-research-agent",
}
mcp = FastMCP("project-rag", middleware=[
bearer_auth(tokens=ALLOWED_TOKENS, rate_limit_per_minute=30)
])
```
**Step 3: Register per-client config**
#### Claude Code (em main + 4 agents)
```jsonc
// .claude/settings.json
{
"mcpServers": {
"project-rag": {
"transport": "sse",
"url": "http://localhost:7777/sse",
"headers": {
"Authorization": "Bearer claude-code-token"
}
}
}
}
```
#### Claude Desktop
```jsonc
// claude_desktop_config.json
{
"mcpServers": {
"project-rag": {
"transport": "sse",
"url": "http://localhost:7777/sse",
"headers": {
"Authorization": "Bearer claude-desktop-token"
}
}
}
}
```
#### Cursor IDE
```jsonc
// .cursor/settings.json
{
"mcp.servers": {
"project-rag": {
"transport": "sse",
"url": "http://localhost:7777/sse"
}
}
}
```
#### GPT-4 via custom adapter
```python
# Use OpenAI Assistants API + custom function calling
import requests
def query_project_rag(query: str, scope: str = "all", k: int = 5):
response = requests.post(
"http://localhost:7777/tool/rag_retrieve",
headers={"Authorization": "Bearer gpt4-token"},
json={"query": query, "scope": scope, "k": k}
)
return response.json()
# Register as OpenAI function tool
```
#### Continue.dev / custom agent
```yaml
# config.yaml
mcp_servers:
- name: project-rag
transport: sse
url: http://localhost:7777/sse
auth_token: custom-agent-token
```
### 8.4 Security model multi-AI
| Concern | Mitigation |
|---|---|
| Token leak | Rotate quarterly, store in env vars |
| Rate limit abuse | 30 req/min/token default, tune per client |
| Read-only enforcement | MCP server expose only `rag_retrieve` + `rag_stats` (no write tools) |
| Audit log | Log every query: timestamp + client_token + query + result_count |
| Cross-project leak | Per-collection access control (future enhancement) |
### 8.5 Cost considerations multi-AI
```
Single Claude Code client (current):
Voyage cost: ~$0.20/month (low query volume)
Qdrant: free local
4 AI clients heavy use (Claude Code + Desktop + Cursor + GPT-4):
Voyage cost: ~$2-5/month (higher query volume)
Network bandwidth: minimal (~100KB/query response)
CPU: Qdrant + Voyage embed call ~100ms total
→ Multi-AI access scale linearly với query volume, not infrastructure cost.
```
### 8.6 Recommend rollout
```
Phase 1 (Week 1-4): Single client (Claude Code only)
→ Validate quality + cost baseline
Phase 2 (Month 2+): Add Claude Desktop nếu cần mobile/casual access
→ Same auth, share collection
Phase 3 (Month 3+): Add Cursor IDE nếu work multi-IDE
→ Verify no cross-tool conflicts
Phase 4 (Future): GPT-4 / custom agent integration nếu cần
→ Custom adapter + auth strict
```
---
## 9. Timeline rollout
### Hour-by-hour breakdown (~10-14h dedicated session)
| Hour | Task | Effort |
|---|---|---|
| **1-2** | Setup pre-flight: disk cleanup + Voyage signup + Python deps install | ~2h |
| **3-4** | Write `scripts/rag-indexer.py` + run initial embed | ~2h |
| **5** | Verify Qdrant collection + manual query sanity check | ~1h |
| **6-7** | Write `scripts/rag-mcp-server.py` + register `.claude/settings.json` | ~2h |
| **8** | Test rag_retrieve qua Claude Code (em main solo) | ~1h |
| **9-10** | Update 4 agent .md frontmatter + system prompt sections | ~2h |
| **11** | Setup pre-commit hook + audit logging | ~1h |
| **12-14** | Buffer + trial 10-15 query measure quality + cost | ~3h |
### Trial 4-week plan
```
Week 1: Pilot single project (smaller of 2)
- Day 1-2: Setup + initial index
- Day 3-7: Active use + measure baseline metrics
- Deliverable: rag-audit-weekly-W1.md
Week 2: Roll out 2nd project
- Day 1: Setup separate Qdrant collection
- Day 2-7: Dual-project use measure
- Deliverable: rag-audit-weekly-W2.md
Week 3: 4-agent integration
- Day 1-2: Update 4 agent .md với rag_retrieve tool
- Day 3-7: Multi-agent task measure shared cache benefit
- Deliverable: rag-audit-weekly-W3.md
Week 4: Decision gate (keep / tune / upgrade B / rollback)
- Day 1-2: Compile metrics
- Day 3: Decision meeting (bro + em main)
- Day 4-7: Apply decision (tune embedding/chunking OR upgrade Option B OR rollback)
- Deliverable: rag-audit-monthly-M1.md + decision doc
```
### Decision gate Week 4
```
PASS criteria (continue + tune):
✅ Quality recall > 80% on 30 query benchmark
✅ Cost < $5/month total (Voyage + storage)
✅ Session lifespan tăng > 30% (heavy session)
✅ Multi-agent shared cache hit > 60%
✅ Retrieval miss critical < 10% queries
✅ Storage < 1GB per project
TUNE criteria (continue + adjust):
⚠️ Quality 70-80% → tune chunking or upgrade embedding
⚠️ Cost 5-10/mo → audit query patterns, reduce k
⚠️ Session lifespan tăng < 30% → audit blanket effectiveness
ROLLBACK criteria (archive RAG):
❌ Quality < 70%
❌ Cost > $10/mo recurring
❌ Session lifespan KHÔNG tăng or giảm
❌ Em main complain "miss context" thường xuyên
❌ Storage > 5GB per project
```
---
## 10. Caveats + risks
### 10.1 Beta features risk
| Feature | Status | Mitigation |
|---|---|---|
| Anthropic Memory tool | Beta `content-management-2025-06-27` | Defer until GA, use MEMORY.md current |
| Anthropic Files API | Beta `files-api-2025-04-14` | Optional add-on, RAG primary |
| Extended 1h prompt cache | Beta `extended-cache-ttl-2025-04-11` | Use 5min default, opt-in 1h khi heavy session |
| Voyage AI API | Stable | Production OK |
| Qdrant local | Stable | Production OK |
| FastMCP | Stable v2+ | Production OK |
### 10.2 Storage concerns
```
Bro hiện tại: 911/954 GB used = 96% full (43GB free)
RAG storage budget:
Qdrant binary: ~50MB
Per project index: ~200-500MB (depend MD volume)
Backup snapshots: ~500MB
Logs + audit: ~100MB
Per project total: ~1GB
2 projects total: ~2GB
+ buffer 1GB
= 3GB recommend free space
→ Cleanup TRƯỚC setup: target 5GB+ free
```
**Cleanup priorities:**
- `node_modules` projects cũ
- `.NET bin/obj` artifacts
- Docker images (`docker system prune -a`)
- Browser caches (Chrome/Edge ~5GB common)
- `%LOCALAPPDATA%` caches (NuGet, dotnet)
- Downloads / Videos không dùng
### 10.3 Quality monitoring
| Risk | Indicator | Action |
|---|---|---|
| Chunking break narrative | Em main report "miss context" | Review chunk strategy, tune |
| Embedding drift | Recall drop > 10pp benchmark | Re-embed full, check Voyage updates |
| Stale index | Files commit chưa re-index | Force re-index full, check hook |
| Query phrasing kém | Low precision on simple queries | Em main refine query patterns |
| Cross-language mismatch | Vietnamese query miss English content | Multilingual reranker hoặc query expansion |
### 10.4 Fallback strategy
```
Khi RAG fail / quality drop:
Layer 1: Em main fallback to Read full file (existing lazy pattern still works)
Layer 2: Em main blanket load critical file directly
Layer 3: Rollback Qdrant snapshot (weekly backup)
Layer 4: Full re-index từ scratch (~15 phút)
Layer 5: Archive RAG, return lazy current pattern (ultimate fallback)
```
Em main blanket 120K KHÔNG bị mất khi RAG fail → graceful degradation.
### 10.5 Vietnamese-English mix considerations
```
Voyage-3-large multilingual claim 26 lang coverage.
Vietnamese explicit benchmark KHÔNG public.
Risk: technical jargon Việt-Anh mix có thể miss synonym.
Ví dụ: "im lặng 403" vs "silent 403" — vector có gần nhau không?
Mitigation:
- Test 10-20 Việt-Anh mix queries trong audit benchmark
- Nếu recall low → consider voyage-multilingual-2 backup
- Hoặc add query expansion (Anthropic Contextual Retrieval pattern)
```
---
## 11. Success metrics
### 11.1 Quality metrics
| Metric | Target | Measurement |
|---|---:|---|
| Recall avg (30 query benchmark) | > 80% | Manual score weekly |
| Precision avg | > 75% | Manual score weekly |
| Retrieval miss critical rate | < 10% | Em main report cumulative |
| Cross-language query recall | > 70% | Việt-Anh mix benchmark |
### 11.2 Cost metrics
| Metric | Target | Measurement |
|---|---:|---|
| Voyage monthly spend | < $5 | Voyage dashboard |
| Total RAG infra cost | < $10/month | Sum tools |
| Cost per query | < $0.001 | Calculated |
| Disk usage per project | < 1GB | `du -sh` |
### 11.3 Performance metrics
| Metric | Target | Measurement |
|---|---:|---|
| Query latency (P50) | < 200ms | MCP server log |
| Query latency (P99) | < 500ms | MCP server log |
| Re-index lag (post-commit) | < 30s | Pre-commit hook timing |
| Cache hit rate (multi-agent) | > 60% | Custom metric |
### 11.4 Capacity metrics
| Metric | Target | Measurement |
|---|---:|---|
| Session lifespan productive | +50% vs lazy | Time tracker |
| Tasks before lost-in-middle | > 35 | Task counter |
| Heavy session token | -20% vs lazy | Anthropic dashboard |
| Multi-agent overlap saving | > 50K/session | Cumulative calc |
### 11.5 Multi-AI client metrics
| Metric | Target | Measurement |
|---|---:|---|
| Active clients | ≥ 1 stable | Audit log |
| Per-client query volume | Track baseline | Audit log per client |
| Cross-client conflict | 0 | Bug reports |
---
## 12. Future enhancements
### 12.1 Phase 2 (after Week 4 validation)
| Enhancement | Effort | Benefit |
|---|---|---|
| Upgrade Option B (drop blanket 30-40K) | 1 session | Saving +15% tokens |
| Anthropic Memory tool integration | 2-3h | Native cross-conversation memory |
| Files API integration | 2-3h | Reduce blanket re-upload cost |
| Citations enable | 1h | RAG quality trace |
### 12.2 Phase 3 (Month 2-3)
| Enhancement | Effort | Benefit |
|---|---|---|
| Hybrid BM25 + vector search (Contextual Retrieval) | 4-6h | +49-67% recall (Anthropic doc) |
| Multi-project namespace | 2-3h | Cross-project query với strict isolation |
| Reranker model (Cohere rerank-3) | 2-3h | +10-20% precision |
| Custom Streamlit audit dashboard | 4-5h | Visual quality monitoring |
### 12.3 Phase 4 (Quarter 2+)
| Enhancement | Effort | Benefit |
|---|---|---|
| Replace Voyage với Anthropic native embedding (if GA) | 2-3h | Reduce vendor count |
| Auto-tuning chunking (LLM-aided) | 1 week | Quality+ |
| Federated multi-machine setup | 1 week | Team usage |
| Time-series analytics on retrieval patterns | 1 week | Insights |
### 12.4 Defer indefinitely (over-engineering)
- ❌ LangChain / LlamaIndex framework (heavy abstraction)
- ❌ Self-host LLM (cost > value)
- ❌ Custom embedding model fine-tuning (effort > value)
- ❌ Full text + vector hybrid index (use Voyage Reranker instead)
---
## 📚 References + tools
### Anthropic official
- [Memory tool docs](https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool)
- [Prompt caching guide](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
- [Files API](https://platform.claude.com/docs/en/build-with-claude/files)
- [Contextual Retrieval cookbook](https://platform.claude.com/cookbook/capabilities-contextual-embeddings-guide)
- [Effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
- [Agent SDK overview](https://code.claude.com/docs/en/agent-sdk/overview)
### Tools docs
- [Qdrant docs](https://qdrant.tech/documentation/)
- [Voyage AI pricing](https://docs.voyageai.com/docs/pricing)
- [FastMCP](https://github.com/jlowin/fastmcp)
- [MCP servers list](https://github.com/modelcontextprotocol/servers)
### Project memory
- `feedback_md_compact_narrative.md` (§6.5 rule — KEEP narrative)
- `feedback_multi_agent_setup.md` (4-agent discipline)
- `feedback_drastic_refactor_scope.md` (RAG setup = dedicated session)
- `feedback_uat_skip_verify.md` (Phase 9 UAT mode)
---
## ✅ Pre-implementation checklist
```
☐ Bro confirm 3 thông tin:
☐ 2 dự án path (để Investigator audit MD inventory pre-flight)
☐ Stack 2 dự án (BE: .NET/Node/Python? FE: React/Vue?)
☐ Pilot project chọn (smaller in 2)
☐ Bro prepare environment:
☐ Disk cleanup 5GB+ free (current 911/954 = 96% full)
☐ Voyage AI account signup + API key
☐ Python 3.10+ installed
☐ Git installed (cho pre-commit hook)
☐ Bro schedule dedicated session:
☐ 10-14h block 1 ngày cuối tuần (memory feedback_drastic_refactor_scope rule)
☐ Reserve weekly cap ~30% cho RAG setup spawn cost
☐ Bro review plan:
☐ Read full this file
☐ Confirm scope blanket vs RAG store match needs
☐ Confirm tool stack acceptable
☐ Approve Week 1-4 trial timeline
```
---
## 📝 Notes — keep updated
- **2026-05-12 turn 1:** Plan saved sau S21 turn 1 chốt cicd-monitor. Cross-project reference cho 2 dự án future bro > 1M MD. SOLUTION_ERP baseline ~354K MD (chưa cần RAG, defer).
- **Status:** 📝 PLAN ONLY — chưa implement
- **Next trigger:** Bro confirm 3 thông tin → spawn 🔵 Investigator audit MD inventory 2 dự án → tinh chỉnh blanket list cho từng project