Files

pqhuy1987 0a3b747612 [CLAUDE] Docs: chốt Session 21 turn 2 — RAG Hybrid setup planning + Cách A validation

Sau S21 turn 1 chốt cicd-monitor, bro clarify 5 dự án future > 1M MD tokens → discussion deep ~15 turn về RAG infrastructure. Em main solo (no SOLUTION_ERP sub-agent spawn), delegate claude-code-guide × 2 research Anthropic + community practice.

Quyết định chốt:
- Cách A defensive (giữ blanket 120K em main + RAG retrieve supplement)
- Bỏ Cách B aggressive (cắt 60-70% blanket) — vi phạm priority em main control flow strong
- Industry-validated cross 4 Anthropic blog + 5 community tools (Cursor/Continue/Cline/Aider all hybrid)
- 3-layer pattern Phase 1-3 incremental rollout (vector → +BM25 → +reranking, recall ~70% → ~92%)
- Stack: Voyage-3-large + Qdrant local + FastMCP Python + Streamlit dashboard

Multi-agent cost reality clarify (post-S21 t2):
- Em main blanket: ~120K
- 4 sub-agents spawn cumulative: ~400K
- Total billed heavy session: ~560K Cách A vs ~700K lazy
- Saving -20% từ multi-agent shared cache 70-90%
- Anthropic acknowledge 8-10× multiplier multi-agent

Files updated:
- docs/STATUS.md (Last updated S21 turn 2 + Recently Done row top)
- docs/HANDOFF.md (TL;DR Session 21 turn 2 section + Last updated)
- docs/rag-setup-plan.md (+Section 13 multi-agent cost reality + Section 14 3-layer hybrid Phase 1-3, +355 LOC)
- docs/changelog/sessions/2026-05-12-1800-s21-turn2-rag-planning.md (new session log)

Memory user-level update (outside repo, separate update):
- feedback_rag_hybrid_pattern.md (NEW cross-project pattern reusable)
- MEMORY.md index (+1 entry pointer)

Plan I NEW deferred — trigger bro confirm 5 dự án path + stack + pilot + Voyage API + disk cleanup → dedicated session 10-14h weekend (per feedback_drastic_refactor_scope rule).

Stats:
- 17 memory entries (+1 RAG hybrid)
- 1 plan file rag-setup-plan.md (1500 LOC final)
- 4 sub-agents seeds-only unchanged
- 81 test unchanged
- 4 commits S21 cumulative (f1c61c9 + 3a34831 + 1f8e9af + this)

CI skip per path filter (all .md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 18:50:28 +07:00

54 KiB

Raw Blame History

RAG Setup Plan — Cross-project reference

Mục đích: Plan setup Hybrid RAG (Option A) cho project có MD context > 1M tokens. Cross-project applicable — SOLUTION_ERP làm baseline reference, future 2 dự án bro apply pattern này. Last updated: 2026-05-12 (Session 21 turn 1+) Status: 📝 Plan saved — chưa implement, target Week 1-4 trial 2 dự án future Owner: pqhuy1987@gmail.com + Claude (em main + 4 sub-agents)

1. Context + Why

Problem statement

Hiện tại lazy blanket pattern (em main + 4 agents):
  - Em main vác ~120K MD upfront (35% project)
  - Lazy Read khi cần — em main TỰ ĐOÁN file relevant
  - 4 agents mỗi spawn ~188K cache WRITE
  - Heavy session ~700K effective billed
  - Lost-in-middle threshold đạt sau ~5.75h productive
  
Scale-up to 2 projects > 1M MD tokens each:
  ❌ Blanket KHÔNG khả thi (vượt 1M context cap)
  ❌ Lazy Read recall ~30-60% (em main miss file không nghĩ tới)
  ❌ 4 agents duplicate Read same files (cumulative ~240K wasted)
  ❌ Vietnamese-English synonym miss (grep keyword only)
  ❌ Cross-project context impossible without manual switching

Solution

Hybrid RAG Option A — blanket critical + retrieve on-demand:

KEEP blanket: ~100K static (core stable + current state + agent + skills + memory critical)
ADD RAG layer: 70% MD remaining accessible via semantic retrieve
SHARE cache: 4 agents reuse retrieved chunks (multi-agent leverage)

Benefits chốt từ analysis sessions trước

Metric	Lazy current	Option A	Δ
Quality recall	30-60%	85%	+25-55pp
Heavy session token	700K	560K	-20%
Session productive hours	5.75h	7.6h	+1.85h
Tasks before lost-in-middle	~23	~38	+65%
Net successful tasks/session	25	50	2×
Multi-agent shared cache	❌	✅ 60-90% cache hit	leverage real
Việt-Anh semantic search	❌ grep only	✅ Voyage multilingual	unlock
Scale > 1M MD	❌ break	✅ work	enable

Trade-off

⚠️ Setup cost: ~10-14h dedicated session (1 lần invest)
⚠️ Maintenance: ~30 phút/tuần audit
⚠️ Beta features (Memory tool, Files API): có thể breaking change
⚠️ Retrieval miss risk ~5-10% (mitigated bằng citations + fallback Read)
⚠️ Voyage API cost: ~$0.36 initial embed + ~$0.20/tháng delta

2. Architecture overview

┌─────────────────────────────────────────────────────────────┐
│ LAYER 1 — Static blanket (cache hot, 5min-1h TTL)           │
├─────────────────────────────────────────────────────────────┤
│ Em main + 4 sub-agents auto-inject ~100K core context:      │
│   • rules.md, architecture.md, CLAUDE.md, PROJECT-MAP       │
│   • STATUS top 100 line, HANDOFF top 150 line               │
│   • 5 agent .md (README + 4 agent identity)                 │
│   • 5 SKILL.md descriptions (auto-inject)                   │
│   • 5 memory entries critical cross-cutting                  │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│ LAYER 2 — Vector DB retrieve on-demand                      │
├─────────────────────────────────────────────────────────────┤
│ Qdrant local (~50MB binary, ~200MB index per project):      │
│   • Session logs cumulative (49% MD, biggest)               │
│   • Gotchas detail (chunk per entry)                        │
│   • Archives + Recently Done + Migration-todos              │
│   • Flows + Database guides                                 │
│   • SKILL.md detail (description đã trong blanket)          │
│   • Memory entries non-critical                             │
│   • Guides ops conditional                                  │
└─────────────────────────────────────────────────────────────┘
                              ↑
┌─────────────────────────────────────────────────────────────┐
│ LAYER 3 — Embedding service (Voyage AI cloud)               │
├─────────────────────────────────────────────────────────────┤
│ voyage-3-large multilingual 26 lang (Việt-Anh tốt):         │
│   • Index time: embed chunks → vectors (one-time + delta)   │
│   • Query time: embed query → search Qdrant top-K           │
│   • Cost: $0.18/M tokens, ~$0.36 init + ~$0.20/month        │
└─────────────────────────────────────────────────────────────┘
                              ↕
┌─────────────────────────────────────────────────────────────┐
│ LAYER 4 — MCP retriever server (FastMCP Python)             │
├─────────────────────────────────────────────────────────────┤
│ Tool exposed: rag_retrieve(query, scope, k, time_range)     │
│ Transport: stdio (Claude Code) hoặc HTTP/SSE (multi-AI)     │
│ Auth: API key per client (multi-AI mode)                    │
└─────────────────────────────────────────────────────────────┘
                              ↕
┌─────────────────────────────────────────────────────────────┐
│ LAYER 5 — Multi-AI clients                                  │
├─────────────────────────────────────────────────────────────┤
│ Claude Code (em main + 4 agents) — primary                  │
│ Claude Desktop — secondary                                  │
│ GPT-4 / Cursor / Continue / Custom agent — optional         │
└─────────────────────────────────────────────────────────────┘
                              ↑
┌─────────────────────────────────────────────────────────────┐
│ LAYER 6 — Re-index pipeline                                 │
├─────────────────────────────────────────────────────────────┤
│ Pre-commit hook: delta re-index changed MD                  │
│ Weekly full re-index: catch missed (Saturday off-peak)      │
│ Batch API 50% discount cho mass re-index                    │
└─────────────────────────────────────────────────────────────┘

Flow time index (1 lần init + delta)

1. Walk filesystem → docs/ + .claude/ + memory/
2. Chunk adaptive theo doc_type (custom Python chunker)
3. Batch embed via Voyage API (128 chunks/batch)
4. Upsert Qdrant với metadata (source, doc_type, project, last_modified)
5. Total init: ~10-15 phút cho 1M MD tokens

Flow query time (mỗi spawn em main hoặc agent)

1. Em main/agent: rag_retrieve("query keyword", scope, k)
2. MCP server: embed query → Voyage API (~100ms)
3. MCP server: Qdrant search top-K (~50ms local)
4. MCP server: return chunks với metadata + score
5. Total: ~150-200ms per query (network-bound)
6. Cache: subsequent same query → ~10ms (cache hit)

3. BLANKET load list

Total: ~100K tokens (28% project MD) Auto-load mỗi spawn em main + 4 agents.

A. Core stable docs (~30K — KHÔNG đổi thường xuyên)

File	Token	Lý do blanket
`docs/rules.md`	~7K	Coding conventions stable, mọi task reference
`CLAUDE.md` (root pointer)	~3K	Auto-inject system reminder
`docs/CLAUDE.md`	~3K	Tech stack overview baseline
`docs/architecture.md`	~7K	4-layer Clean Arch baseline
`docs/PROJECT-MAP.md`	~3K	Bản đồ navigate
`docs/workflow-contract.md`	~4K	State machine 9 phase Contract domain core
`docs/forms-spec.md`	~3K	8 form catalog domain knowledge

B. Current state (~25K — em main biết direct, không cần retrieve)

File	Strategy	Token
`docs/STATUS.md` top 100 line	Current phase + In Progress + 1-2 Recently Done top	~15K
`docs/HANDOFF.md` top 150 line	Last updated + TL;DR latest session + next priority	~10K

→ Drop từ blanket: STATUS Recently Done > 5 row cũ (retrieve nếu cần), HANDOFF TL;DR cũ > 1 tuần.

C. Agent infrastructure (~25K — agent identity stable)

File	Token
`.claude/agents/README.md`	~5K
`.claude/agents/investigator.md`	~3.5K
`.claude/agents/implementer.md`	~4K
`.claude/agents/reviewer.md`	~3.5K
`.claude/agents/cicd-monitor.md`	~5K
`.claude/agent-memory/{4 agents}/MEMORY.md` auto-inject 25KB first 200 lines	~4K total

D. Skills descriptions (~5K — auto-inject, không SKILL.md full)

File	Strategy	Token
`.claude/skills/README.md`	Full	~2.5K
6 SKILL.md descriptions	Auto-inject by Claude Code	~1K total
6 SKILL.md detail	KHÔNG blanket → RAG retrieve khi triggered	—

E. Memory user-level critical (~15K)

File	Token	Lý do critical
`project_solution_erp.md`	~3.5K	Project overview narrative
`feedback_md_compact_narrative.md` (§6.5)	~2K	Rule cốt lõi mọi doc work
`feedback_uat_skip_verify.md`	~2K	Phase 9 current mode rule
`feedback_multi_agent_setup.md`	~3K	4-agent discipline
`feedback_per_chunk_commit.md`	~2K	Implementer pattern reusable
`feedback_audit_reuse_before_clone.md`	~2K	Investigator natural pattern

→ Drop từ blanket: 11 memory entries còn lại (retrieve khi pattern triggered).

TOTAL BLANKET ≈ 100K tokens

4. RAG store list

Total: ~254K tokens (72% project MD) Index vào Qdrant, retrieve on-demand.

F. Session logs (~150K — biggest, 49% MD)

Path: docs/changelog/sessions/*.md (41+ files growing)
Chunk strategy: 1 file = 1 chunk (preserve narrative §6.5)
Metadata:
  - session_date: extracted from filename
  - phase: extracted from content
  - topic: extracted from H1
  - commit_sha_range: extracted from "Commits:" line
  - doc_type: "session_log"
Scope filter: time_range="last_week|last_month|last_quarter|all"

G. Gotchas (~9K — lookup per debug)

Path: docs/gotchas.md (44+ entries)
Chunk strategy: split per "### N. ..." numbered heading
Metadata:
  - gotcha_id: integer
  - category: extracted from content (tech/EF/Workflow/CICD/Security/...)
  - doc_type: "gotcha"
Scope filter: scope="gotcha"

H. Archives + Recently Done (~75K)

File	Strategy	Token
`docs/STATUS.md` rest beyond top 100	Per H2 section	~8K
`docs/HANDOFF.md` rest beyond top 150	Per H2 section	~21K
`docs/changelog/migration-todos.md`	Per H3 task	~18K
`docs/changelog/recently-done-archive-*.md`	Per H3 phase	~6K
`docs/_archive/forms-spec-raw.md`	Full file (cold archive)	~23K
`docs/_archive/workflow-raw.md`	Full file (cold archive)	~4K

I. Flows + Database (~17K — conditional task)

File	Token	Khi retrieve
`docs/flows/README.md`	~1K	Index khi cần flow
`docs/flows/auth-flow.md`	~1K	Task auth
`docs/flows/permission-flow.md`	~1.5K	Task permission
`docs/flows/contract-creation-flow.md`	~1.5K	Task Contract
`docs/flows/contract-approval-flow.md`	~1.5K	Task approval
`docs/flows/form-render-flow.md`	~1K	Task form
`docs/flows/sla-expiry-flow.md`	~1K	Task SLA
`docs/database/database-guide.md`	~3K	Task schema
`docs/database/schema-diagram.md`	~12K	Task ERD

J. SKILL.md detail (~40K — retrieve khi skill triggered)

File	Token
`.claude/skills/contract-workflow/SKILL.md`	~7K
`.claude/skills/form-engine/SKILL.md`	~5K
`.claude/skills/permission-matrix/SKILL.md`	~5K
`.claude/skills/dependency-audit-erp/SKILL.md`	~5K
`.claude/skills/ef-core-migration/SKILL.md`	~5.5K
`.claude/skills/iis-deploy-runbook/SKILL.md`	~6K

K. Guides ops conditional (~10K)

File	Token	Khi retrieve
`docs/guides/deployment-iis.md`	~2.5K	Task deploy
`docs/guides/cicd.md`	~2K	Task CI/CD
`docs/guides/security-checklist.md`	~2K	Audit security
`docs/guides/vps-setup.md`	~2.5K	Setup VPS
`docs/guides/runbook.md`	~1K	Ops debug

L. Memory entries non-critical (~50K — pattern lookup)

11 memory entries còn lại (user-level):
  - feedback_n_stage_workflow_pattern.md (DEPRECATED post-Mig 21)
  - feedback_designtime_runtime_db.md
  - feedback_drastic_refactor_scope.md
  - feedback_cron_monthly_limitation.md
  - feedback_user_manual_style.md
  - feedback_node_cicd.md
  - feedback_unittest_timing.md
  - feedback_responsive_laptop_breakpoint.md
  - feedback_service_hook_vs_endpoint.md
  - reference_session_prompts.md
  - MEMORY.md index

M. Audit logs (~2K, grow)

docs/changelog/skill-audit-{YYYY-MM}.md (monthly audit log)

TOTAL RAG STORE ≈ 254K tokens

Component	Tool	Reason	Cost
Vector DB	Qdrant local	Rust binary 50MB, no Docker, fast, metadata filtering, admin UI	$0
Embedding	Voyage-3-large	Anthropic partner, multilingual 26 lang, no GPU needed	$0.18/M (~$0.36 init)
MCP server framework	FastMCP Python	Official Anthropic SDK, ~100 LOC, auto schema	$0
Chunking	Custom Python adaptive	~50 LOC, transparent, §6.5 compliant	$0
Re-index pipeline	Pre-commit hook	Native git, ~10 LOC bash	$0
Monitoring	Qdrant Dashboard + custom audit	Built-in UI port 6333	$0
Auth (multi-AI)	Bearer token + rate limit	Custom middleware ~30 LOC	$0
Batch re-index	Voyage Batch API	50% discount cho mass re-embed	-50%

Stack rejected + lý do

Alternative	Reason rejected
Chroma vector DB	Python ecosystem, slower than Qdrant Rust
pgvector	Cần PostgreSQL setup, overhead
OpenAI text-embedding-3-small	Vietnamese quality kém hơn Voyage
BGE-M3 local	Cần GPU >= 4GB (Intel Iris Xe không OK)
LangChain / LlamaIndex	Heavy abstraction, black-box debug khó, §6.5 chunker không tuân
TypeScript MCP SDK	Verbose hơn Python FastMCP
Pinecone cloud	Paid + vendor lock, không cần scale đó

6. Setup scripts

6.1 `requirements.txt`

fastmcp>=2.0
voyageai>=0.3
qdrant-client>=1.12
python-frontmatter>=1.1

6.2 `scripts/rag-indexer.py` (~120 LOC)

"""
RAG Indexer — Embed MD files + upsert vào Qdrant.

Usage:
  python rag-indexer.py                    # full index
  python rag-indexer.py --files "a.md b.md"  # delta re-index
"""
import os, glob, re, sys
from voyageai import Client
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

QDRANT_PATH = "./rag-data/qdrant"
COLLECTION = "project_md"  # rename per project
EMBED_MODEL = "voyage-3-large"
DIM = 1024

voyage = Client(api_key=os.environ["VOYAGE_API_KEY"])
qdrant = QdrantClient(path=QDRANT_PATH)

def chunk_file(path: str) -> list[dict]:
    """Adaptive chunking theo doc type."""
    content = open(path, encoding="utf-8").read()
    base = {"source": path, "size_chars": len(content)}
    
    if "/changelog/sessions/" in path:
        return [{**base, "content": content, "doc_type": "session_log"}]
    
    if path.endswith("gotchas.md"):
        entries = re.split(r"^### (\d+)\.", content, flags=re.M)
        return [
            {**base, "content": f"### {entries[i]}.{entries[i+1]}",
             "doc_type": "gotcha", "entry_id": int(entries[i])}
            for i in range(1, len(entries), 2)
        ]
    
    if "/skills/" in path:
        return [{**base, "content": content, "doc_type": "skill"}]
    
    if "/agents/" in path:
        return [{**base, "content": content, "doc_type": "agent"}]
    
    if path.endswith("MEMORY.md") or "/memory/" in path:
        return [{**base, "content": content, "doc_type": "memory"}]
    
    # Default: split per H2 heading
    sections = re.split(r"^## ", content, flags=re.M)
    return [
        {**base, "content": ("## " + s) if i > 0 else s,
         "doc_type": "doc", "section_idx": i}
        for i, s in enumerate(sections) if len(s.strip()) > 200
    ]

def main(files: list[str] | None = None):
    # Init collection (idempotent)
    if not qdrant.collection_exists(COLLECTION):
        qdrant.create_collection(
            COLLECTION,
            vectors_config=VectorParams(size=DIM, distance=Distance.COSINE)
        )
    
    # Determine paths
    if files:
        paths = files
    else:
        paths = (
            glob.glob("docs/**/*.md", recursive=True) +
            glob.glob(".claude/**/*.md", recursive=True)
        )
        paths = [p for p in paths
                 if "node_modules" not in p and "_user-guide" not in p]
    
    # Chunk
    chunks = []
    for path in paths:
        try:
            chunks.extend(chunk_file(path))
        except Exception as e:
            print(f"Skip {path}: {e}")
    print(f"Chunking: {len(chunks)} chunks from {len(paths)} files")
    
    # Batch embed (Voyage max 128/batch)
    texts = [c["content"] for c in chunks]
    embeddings = []
    for i in range(0, len(texts), 128):
        batch = texts[i:i+128]
        result = voyage.embed(batch, model=EMBED_MODEL, input_type="document")
        embeddings.extend(result.embeddings)
        print(f"Embedded {i+len(batch)}/{len(texts)}")
    
    # Upsert (Qdrant auto-replaces by id)
    points = [
        PointStruct(
            id=hash(c["source"] + str(c.get("section_idx", 0))) & 0xFFFFFFFF,
            vector=emb,
            payload=c
        )
        for c, emb in zip(chunks, embeddings)
    ]
    qdrant.upsert(collection_name=COLLECTION, points=points)
    print(f"Indexed {len(points)} chunks → Qdrant")

if __name__ == "__main__":
    files = sys.argv[2].split() if len(sys.argv) > 2 and sys.argv[1] == "--files" else None
    main(files)

6.3 `scripts/rag-mcp-server.py` (~80 LOC)

"""
MCP retriever server — Expose rag_retrieve tool cho Claude Code + agents.

Run: python rag-mcp-server.py  (stdio default)
     python rag-mcp-server.py --http :7777  (HTTP/SSE for multi-AI)
"""
import os, sys
from fastmcp import FastMCP
from voyageai import Client
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

mcp = FastMCP("project-rag")
voyage = Client(api_key=os.environ["VOYAGE_API_KEY"])
qdrant = QdrantClient(path="./rag-data/qdrant")
COLLECTION = "project_md"

@mcp.tool()
def rag_retrieve(
    query: str,
    scope: str = "all",
    k: int = 5
) -> list[dict]:
    """
    Semantic search MD context.
    
    Args:
        query: Search query (Vietnamese hoặc English, mix OK)
        scope: Filter by doc_type:
               "all" | "session_log" | "gotcha" | "memory" | 
               "skill" | "agent" | "doc"
        k: Top chunks to return (1-15, default 5)
    
    Returns:
        List[dict] với keys: content, source, doc_type, score
    
    Use cases:
        - Historical session log: rag_retrieve("Mig 26 V2", scope="session_log")
        - Gotcha lookup: rag_retrieve("silent 403", scope="gotcha")
        - Pattern reuse: rag_retrieve("audit clone", scope="memory")
        - Cross-section: rag_retrieve("query", scope="all", k=10)
    """
    k = min(max(k, 1), 15)
    
    # Embed query
    query_vec = voyage.embed(
        [query], model="voyage-3-large", input_type="query"
    ).embeddings[0]
    
    # Filter
    filter_dict = None
    if scope != "all":
        filter_dict = Filter(
            must=[FieldCondition(key="doc_type", match=MatchValue(value=scope))]
        )
    
    # Search
    results = qdrant.search(
        collection_name=COLLECTION,
        query_vector=query_vec,
        query_filter=filter_dict,
        limit=k
    )
    
    return [
        {
            "content": r.payload["content"][:3000],  # truncate huge
            "source": r.payload["source"],
            "doc_type": r.payload["doc_type"],
            "score": round(r.score, 3)
        }
        for r in results
    ]

@mcp.tool()
def rag_stats() -> dict:
    """Return collection stats (for audit)."""
    info = qdrant.get_collection(COLLECTION)
    return {
        "total_chunks": info.points_count,
        "vector_dim": info.config.params.vectors.size,
        "distance": info.config.params.vectors.distance.value,
        "indexed_at": info.optimizer_status,
    }

if __name__ == "__main__":
    # Default: stdio mode for Claude Code
    # HTTP/SSE mode: python rag-mcp-server.py --http :7777
    if "--http" in sys.argv:
        port = int(sys.argv[sys.argv.index("--http") + 1].lstrip(":"))
        mcp.run(transport="sse", port=port)
    else:
        mcp.run()  # stdio default

6.4 `.claude/settings.json` register

{
  "mcpServers": {
    "project-rag": {
      "command": "python",
      "args": ["scripts/rag-mcp-server.py"],
      "cwd": "${workspaceFolder}",
      "env": {
        "VOYAGE_API_KEY": "${env:VOYAGE_API_KEY}"
      }
    }
  }
}

6.5 Pre-commit hook

#!/bin/sh
# .git/hooks/pre-commit
# Re-index changed MD files
changed_md=$(git diff --cached --name-only --diff-filter=AMR | grep -E "\.md$")
if [ -n "$changed_md" ]; then
    echo "RAG re-indexing $(echo "$changed_md" | wc -l) MD files..."
    python scripts/rag-indexer.py --files "$changed_md"
fi

6.6 Agent .md frontmatter update

# Mỗi .claude/agents/{agent}.md thêm tool:
tools: [Read, Grep, Glob, Bash, mcp__project-rag__rag_retrieve, ...]

System prompt section thêm:

## RAG retriever usage (rag_retrieve tool)

**WHEN to use:**
- Historical session log lookup (> 1 tuần cũ)
- Gotcha pattern matching debug
- Memory pattern reuse "clone X sang Y"
- Cross-section semantic search

**WHEN to use Read instead:**
- Current state (STATUS + HANDOFF top) — blanket loaded
- Active file editing (cần full file)
- Architecture review (stable docs, blanket)

**Query examples:**
- rag_retrieve("silent 403 non-admin", scope="gotcha", k=3)
- rag_retrieve("PE V2 wire pattern", scope="session_log", k=5)
- rag_retrieve("audit reuse clone", scope="memory", k=3)

7. Audit procedure

7.1 Weekly quick audit (~30 phút, mỗi Saturday)

Mục tiêu: Check health + cost trend hàng tuần.

Checklist:

# 1. Index health
curl http://localhost:6333/collections/project_md
# Verify: points_count tăng + status="green"

# 2. Re-index lag
git log --since="1 week ago" --name-only --pretty=format: | grep -E "\.md$" | sort -u | wc -l
python -c "
from qdrant_client import QdrantClient
q = QdrantClient(path='./rag-data/qdrant')
# Check sources có matching files changed
"

# 3. Voyage cost
# Visit voyageai.com dashboard → check last 7 days usage
# Target: <$1/week steady state

# 4. Random query quality (manual 5 query)
# Sample queries:
#   - "Recent Mig" → expect session log top
#   - "silent 403" → expect gotcha #44 top
#   - "audit reuse" → expect memory entry top
# Score: 1-5 mỗi query (relevant chunks trong top-5)

# 5. Storage size
du -sh ./rag-data/
# Target: <500MB per project

Log: docs/changelog/rag-audit-weekly-{YYYY-WW}.md (1 page)

7.2 Monthly deep audit (~2-3h, mỗi đầu tháng)

Mục tiêu: Quality benchmark + chunking review + stale cleanup.

Checklist:

# 1. Quality benchmark — 30 query test set
test_queries = [
    # Categories: state, historical, debug, pattern, cross-stack
    ("Phase hiện tại", "doc"),
    ("Mig 26 PE Level Opinions UPSERT", "session_log"),
    ("silent 403 non-admin Forbidden", "gotcha"),
    ("audit reuse trước clone B từ A", "memory"),
    # ... 30 total covering all scopes
]

results = []
for query, expected_scope in test_queries:
    retrieved = rag_retrieve(query, k=10)
    # Manual score:
    # - Recall: % expected sources trong top-10
    # - Precision: % retrieved chunks actually relevant
    results.append({"query": query, "recall": ..., "precision": ...})

# Target: avg recall > 80%, precision > 75%

# 2. Chunking review — sample 10 random chunks
# Check: chunks có bị cắt giữa narrative không (vi phạm §6.5)
# Action: tune chunker nếu phát hiện issues

# 3. Stale audit
# Files chưa re-index > 14 days → flag
# Files đã xóa khỏi repo nhưng còn trong Qdrant → cleanup

# 4. Cost trend
# Monthly Voyage spend vs target
# Target: <$3/month steady

# 5. Capacity check
# Total chunks vs disk space projection
# Project có grow size đáng kể (>20% MoM) → plan scale

Log: docs/changelog/rag-audit-monthly-{YYYY-MM}.md (2-3 pages)

7.3 Quarterly major audit (~4-6h, mỗi quý)

Mục tiêu: Strategic review + major upgrades.

Checklist:

Embedding model upgrade decision
- Voyage có model mới? Test side-by-side với voyage-3-large
- Quality benchmark trên 30 query test set
- Decision: upgrade nếu recall +5pp
Chunking strategy iteration
- Review 50 random chunks
- Identify patterns: cắt sai, overlap missing, metadata thiếu
- Tune chunker code → re-index full
Collection re-build từ scratch
- Backup current → drop collection → re-index all
- Mục đích: clean orphan chunks + apply new chunking
- Effort: ~30 phút for 1M MD
Multi-AI client access audit
- Active clients (Claude Code / Desktop / GPT / Cursor)
- Per-client query volume + token spend
- Security: rotate auth tokens, review rate limits
Cross-project namespace audit (nếu multi-project)
- Project isolation working correctly?
- Cross-project query intentional vs accidental?
- Adjust metadata filter rules

Log: docs/changelog/rag-audit-quarterly-{YYYY-Q}.md (5-10 pages)

7.4 Trigger-based audit (ad-hoc)

Trigger	Action
Retrieval miss critical (em main báo)	Audit chunk relevant tại sao miss + tune
Cost spike >50% MoM	Audit query patterns + rate limit clients
Re-index hang >1h	Audit indexer logs + Qdrant health
Quality regression em main observe	Spot-check + monthly audit sớm
New project added	Setup namespace + initial index audit

8. Multi-AI client access

8.1 MCP protocol — agnostic

MCP (Model Context Protocol) là standard protocol. Bất kỳ AI client nào support MCP đều consume cùng 1 server:

              Qdrant (single source)
                    ↓
            MCP server :7777 (HTTP/SSE)
       ↙          ↓           ↓          ↘
  Claude Code  Claude     Cursor     GPT-4 +
              Desktop      IDE       custom adapter

8.2 Transport modes

Mode	Use case	Setup
stdio	Single client (Claude Code local) — default	`python rag-mcp-server.py`
HTTP/SSE	Multi-client (network access)	`python rag-mcp-server.py --http :7777`
WebSocket	Bi-directional (rare)	Custom config

8.3 Setup multi-AI mode

Step 1: Run MCP server HTTP mode

# Terminal 1: MCP server (keep running)
export VOYAGE_API_KEY="pa-xxxx"
python scripts/rag-mcp-server.py --http :7777

# Server endpoint: http://localhost:7777/sse

Step 2: Add auth middleware (recommend cho multi-client)

# Update rag-mcp-server.py
from fastmcp import FastMCP
from fastmcp.middleware import bearer_auth

ALLOWED_TOKENS = {
    "claude-code-token": "claude-code-primary",
    "gpt4-token": "gpt4-cursor-integration",
    "custom-agent-token": "custom-research-agent",
}

mcp = FastMCP("project-rag", middleware=[
    bearer_auth(tokens=ALLOWED_TOKENS, rate_limit_per_minute=30)
])

Step 3: Register per-client config

Claude Code (em main + 4 agents)

// .claude/settings.json
{
  "mcpServers": {
    "project-rag": {
      "transport": "sse",
      "url": "http://localhost:7777/sse",
      "headers": {
        "Authorization": "Bearer claude-code-token"
      }
    }
  }
}

Claude Desktop

// claude_desktop_config.json
{
  "mcpServers": {
    "project-rag": {
      "transport": "sse",
      "url": "http://localhost:7777/sse",
      "headers": {
        "Authorization": "Bearer claude-desktop-token"
      }
    }
  }
}

Cursor IDE

// .cursor/settings.json
{
  "mcp.servers": {
    "project-rag": {
      "transport": "sse",
      "url": "http://localhost:7777/sse"
    }
  }
}

GPT-4 via custom adapter

# Use OpenAI Assistants API + custom function calling
import requests

def query_project_rag(query: str, scope: str = "all", k: int = 5):
    response = requests.post(
        "http://localhost:7777/tool/rag_retrieve",
        headers={"Authorization": "Bearer gpt4-token"},
        json={"query": query, "scope": scope, "k": k}
    )
    return response.json()

# Register as OpenAI function tool

Continue.dev / custom agent

# config.yaml
mcp_servers:
  - name: project-rag
    transport: sse
    url: http://localhost:7777/sse
    auth_token: custom-agent-token

8.4 Security model multi-AI

Concern	Mitigation
Token leak	Rotate quarterly, store in env vars
Rate limit abuse	30 req/min/token default, tune per client
Read-only enforcement	MCP server expose only `rag_retrieve` + `rag_stats` (no write tools)
Audit log	Log every query: timestamp + client_token + query + result_count
Cross-project leak	Per-collection access control (future enhancement)

8.5 Cost considerations multi-AI

Single Claude Code client (current):
  Voyage cost: ~$0.20/month (low query volume)
  Qdrant: free local

4 AI clients heavy use (Claude Code + Desktop + Cursor + GPT-4):
  Voyage cost: ~$2-5/month (higher query volume)
  Network bandwidth: minimal (~100KB/query response)
  CPU: Qdrant + Voyage embed call ~100ms total
  
→ Multi-AI access scale linearly với query volume, not infrastructure cost.

Phase 1 (Week 1-4): Single client (Claude Code only)
  → Validate quality + cost baseline
  
Phase 2 (Month 2+): Add Claude Desktop nếu cần mobile/casual access
  → Same auth, share collection
  
Phase 3 (Month 3+): Add Cursor IDE nếu work multi-IDE
  → Verify no cross-tool conflicts
  
Phase 4 (Future): GPT-4 / custom agent integration nếu cần
  → Custom adapter + auth strict

9. Timeline rollout

Hour-by-hour breakdown (~10-14h dedicated session)

Hour	Task	Effort
1-2	Setup pre-flight: disk cleanup + Voyage signup + Python deps install	~2h
3-4	Write `scripts/rag-indexer.py` + run initial embed	~2h
5	Verify Qdrant collection + manual query sanity check	~1h
6-7	Write `scripts/rag-mcp-server.py` + register `.claude/settings.json`	~2h
8	Test rag_retrieve qua Claude Code (em main solo)	~1h
9-10	Update 4 agent .md frontmatter + system prompt sections	~2h
11	Setup pre-commit hook + audit logging	~1h
12-14	Buffer + trial 10-15 query measure quality + cost	~3h

Trial 4-week plan

Week 1: Pilot single project (smaller of 2)
  - Day 1-2: Setup + initial index
  - Day 3-7: Active use + measure baseline metrics
  - Deliverable: rag-audit-weekly-W1.md

Week 2: Roll out 2nd project
  - Day 1: Setup separate Qdrant collection
  - Day 2-7: Dual-project use measure
  - Deliverable: rag-audit-weekly-W2.md

Week 3: 4-agent integration
  - Day 1-2: Update 4 agent .md với rag_retrieve tool
  - Day 3-7: Multi-agent task measure shared cache benefit
  - Deliverable: rag-audit-weekly-W3.md

Week 4: Decision gate (keep / tune / upgrade B / rollback)
  - Day 1-2: Compile metrics
  - Day 3: Decision meeting (bro + em main)
  - Day 4-7: Apply decision (tune embedding/chunking OR upgrade Option B OR rollback)
  - Deliverable: rag-audit-monthly-M1.md + decision doc

Decision gate Week 4

PASS criteria (continue + tune):
  ✅ Quality recall > 80% on 30 query benchmark
  ✅ Cost < $5/month total (Voyage + storage)
  ✅ Session lifespan tăng > 30% (heavy session)
  ✅ Multi-agent shared cache hit > 60%
  ✅ Retrieval miss critical < 10% queries
  ✅ Storage < 1GB per project

TUNE criteria (continue + adjust):
  ⚠️ Quality 70-80% → tune chunking or upgrade embedding
  ⚠️ Cost 5-10/mo → audit query patterns, reduce k
  ⚠️ Session lifespan tăng < 30% → audit blanket effectiveness

ROLLBACK criteria (archive RAG):
  ❌ Quality < 70%
  ❌ Cost > $10/mo recurring
  ❌ Session lifespan KHÔNG tăng or giảm
  ❌ Em main complain "miss context" thường xuyên
  ❌ Storage > 5GB per project

10. Caveats + risks

10.1 Beta features risk

Feature	Status	Mitigation
Anthropic Memory tool	Beta `content-management-2025-06-27`	Defer until GA, use MEMORY.md current
Anthropic Files API	Beta `files-api-2025-04-14`	Optional add-on, RAG primary
Extended 1h prompt cache	Beta `extended-cache-ttl-2025-04-11`	Use 5min default, opt-in 1h khi heavy session
Voyage AI API	Stable	Production OK
Qdrant local	Stable	Production OK
FastMCP	Stable v2+	Production OK

10.2 Storage concerns

Bro hiện tại: 911/954 GB used = 96% full (43GB free)

RAG storage budget:
  Qdrant binary: ~50MB
  Per project index: ~200-500MB (depend MD volume)
  Backup snapshots: ~500MB
  Logs + audit: ~100MB
  
Per project total: ~1GB
2 projects total: ~2GB
+ buffer 1GB
= 3GB recommend free space

→ Cleanup TRƯỚC setup: target 5GB+ free

Cleanup priorities:

node_modules projects cũ
.NET bin/obj artifacts
Docker images (docker system prune -a)
Browser caches (Chrome/Edge ~5GB common)
%LOCALAPPDATA% caches (NuGet, dotnet)
Downloads / Videos không dùng

10.3 Quality monitoring

Risk	Indicator	Action
Chunking break narrative	Em main report "miss context"	Review chunk strategy, tune
Embedding drift	Recall drop > 10pp benchmark	Re-embed full, check Voyage updates
Stale index	Files commit chưa re-index	Force re-index full, check hook
Query phrasing kém	Low precision on simple queries	Em main refine query patterns
Cross-language mismatch	Vietnamese query miss English content	Multilingual reranker hoặc query expansion

10.4 Fallback strategy

Khi RAG fail / quality drop:
  Layer 1: Em main fallback to Read full file (existing lazy pattern still works)
  Layer 2: Em main blanket load critical file directly
  Layer 3: Rollback Qdrant snapshot (weekly backup)
  Layer 4: Full re-index từ scratch (~15 phút)
  Layer 5: Archive RAG, return lazy current pattern (ultimate fallback)

Em main blanket 120K KHÔNG bị mất khi RAG fail → graceful degradation.

10.5 Vietnamese-English mix considerations

Voyage-3-large multilingual claim 26 lang coverage.
Vietnamese explicit benchmark KHÔNG public.

Risk: technical jargon Việt-Anh mix có thể miss synonym.
  Ví dụ: "im lặng 403" vs "silent 403" — vector có gần nhau không?
  
Mitigation:
  - Test 10-20 Việt-Anh mix queries trong audit benchmark
  - Nếu recall low → consider voyage-multilingual-2 backup
  - Hoặc add query expansion (Anthropic Contextual Retrieval pattern)

11. Success metrics

11.1 Quality metrics

Metric	Target	Measurement
Recall avg (30 query benchmark)	> 80%	Manual score weekly
Precision avg	> 75%	Manual score weekly
Retrieval miss critical rate	< 10%	Em main report cumulative
Cross-language query recall	> 70%	Việt-Anh mix benchmark

11.2 Cost metrics

Metric	Target	Measurement
Voyage monthly spend	< $5	Voyage dashboard
Total RAG infra cost	< $10/month	Sum tools
Cost per query	< $0.001	Calculated
Disk usage per project	< 1GB	`du -sh`

11.3 Performance metrics

Metric	Target	Measurement
Query latency (P50)	< 200ms	MCP server log
Query latency (P99)	< 500ms	MCP server log
Re-index lag (post-commit)	< 30s	Pre-commit hook timing
Cache hit rate (multi-agent)	> 60%	Custom metric

11.4 Capacity metrics

Metric	Target	Measurement
Session lifespan productive	+50% vs lazy	Time tracker
Tasks before lost-in-middle	> 35	Task counter
Heavy session token	-20% vs lazy	Anthropic dashboard
Multi-agent overlap saving	> 50K/session	Cumulative calc

11.5 Multi-AI client metrics

Metric	Target	Measurement
Active clients	≥ 1 stable	Audit log
Per-client query volume	Track baseline	Audit log per client
Cross-client conflict	0	Bug reports

12. Future enhancements

12.1 Phase 2 (after Week 4 validation)

Enhancement	Effort	Benefit
Upgrade Option B (drop blanket 30-40K)	1 session	Saving +15% tokens
Anthropic Memory tool integration	2-3h	Native cross-conversation memory
Files API integration	2-3h	Reduce blanket re-upload cost
Citations enable	1h	RAG quality trace

12.2 Phase 3 (Month 2-3)

Enhancement	Effort	Benefit
Hybrid BM25 + vector search (Contextual Retrieval)	4-6h	+49-67% recall (Anthropic doc)
Multi-project namespace	2-3h	Cross-project query với strict isolation
Reranker model (Cohere rerank-3)	2-3h	+10-20% precision
Custom Streamlit audit dashboard	4-5h	Visual quality monitoring

12.3 Phase 4 (Quarter 2+)

Enhancement	Effort	Benefit
Replace Voyage với Anthropic native embedding (if GA)	2-3h	Reduce vendor count
Auto-tuning chunking (LLM-aided)	1 week	Quality+
Federated multi-machine setup	1 week	Team usage
Time-series analytics on retrieval patterns	1 week	Insights

12.4 Defer indefinitely (over-engineering)

❌ LangChain / LlamaIndex framework (heavy abstraction)
❌ Self-host LLM (cost > value)
❌ Custom embedding model fine-tuning (effort > value)
❌ Full text + vector hybrid index (use Voyage Reranker instead)

13. Multi-agent cumulative cost reality (Anthropic 8-10× warning)

Added S21 turn 2 (2026-05-12) — clarification sau khi user catch gap "120K blanket KHÔNG bao gồm 4 agents".

Per-entity blanket breakdown

Em main blanket:                    ~120K
  STATUS + HANDOFF top + rules + architecture + 5 agent .md + 
  4 MEMORY.md auto-inject + skills desc + memory critical + 
  auto-inject system reminders

Per sub-agent spawn baseline:       ~80-100K each
  Agent system prompt (~5K) +
  3 skills preload SKILL.md full (~21K, trigger semantic) +
  Auto-inject MEMORY.md 25KB first 200 lines (~7K) +
  Em main pass spec task (~10-15K) +
  Em main paste common context excerpt (~30-50K) +
  Auto-inject project context (~10K)
  = ~80-100K per sub-agent spawn (per Anthropic docs)
  
4 sub-agents cumulative:            ~400K
  (4 × ~100K each, isolated context windows)

TOTAL cumulative blanket 5 entities: ~520K
  Em main + 4 sub-agents combined (isolated windows, cumulative billing)

Context windows are ISOLATED

KHÔNG phải 5 entities share 520K trong 1 context window 1M.

Mỗi entity có context window 1M RIÊNG:
  Em main      → context window 1M, dùng ~120K
  Investigator → context window 1M, dùng ~100K
  Implementer  → context window 1M, dùng ~100K
  Reviewer     → context window 1M, dùng ~100K
  CICD Monitor → context window 1M, dùng ~100K
  
→ Mỗi entity LOST-IN-MIDDLE threshold riêng (~700K each)
→ Mỗi entity capacity ~58 tasks before hit hard cap riêng

NHƯNG billing là CUMULATIVE 520K across all contexts:
  Anthropic billing tổng tokens across all 5 windows
  → Hit weekly cap nhanh hơn solo em main 4-5×

Heavy session token compound effect (Cách A vs lazy)

Without RAG (lazy current — 4 agents spawn):

Em main:
  Blanket: 120K
  Lazy Read on-demand: ~50K
  Reasoning + coordinate: ~30K
  = ~200K subtotal

4 sub-agents (each):
  Spawn blanket: ~100K
  Lazy Read inside agent: ~50K
  Reasoning + work: ~30K
  Each agent: ~180K
  ──────────────
  4 agents subtotal: ~720K cumulative

SendMessage iteration:
  10 round trips × ~30K nominal: 300K nominal
  Cache hit 70%: ~90K effective

TOTAL HEAVY SESSION (lazy):
  200K + 720K + 90K = ~1010K nominal
  After cache discount: ~700K effective billed

With Cách A RAG:

Em main:
  Blanket: 120K (unchanged)
  RAG retrieve replace lazy Read: ~30K (-20K saving)
  Reasoning streamlined: ~25K
  = ~175K subtotal (saving 25K)

4 sub-agents (each):
  Spawn blanket: ~100K (unchanged)
  RAG retrieve (share cache 70-90% common queries): ~15K
  Reasoning streamlined: ~25K
  Each agent: ~140K (saving 40K each)
  ──────────────
  4 agents subtotal: ~560K (saving 160K total)

SendMessage iteration: ~90K effective (unchanged)

TOTAL HEAVY SESSION (Cách A):
  175K + 560K + 90K = ~825K nominal
  After cache discount: ~560K effective billed
  
SAVING: -140K (-20%)

Cost saving breakdown

Component	Lazy current	Cách A	Saving
Em main blanket (fixed)	120K	120K	0
Em main lazy Read → RAG retrieve	50K	30K	-20K
Em main reasoning streamlined	30K	25K	-5K
4 agents spawn blanket (fixed)	400K	400K	0
4 agents lazy Read → cached retrieve	200K	60K	-140K
4 agents reasoning	120K	100K	-20K
SendMessage cached	90K	90K	0
TOTAL EFFECTIVE BILLED	~700K	~560K	-140K (-20%)

→ Saving 80% từ 4 agents share retrieve cache (cache hit 70-90% common queries cross-agent).

→ Em main saving chỉ 25K (blanket unchanged, chỉ optimize Read → retrieve).

Multi-agent leverage example concrete

Task Plan B Contract V2 wire:
  🔵 Inv query "PE V2 schema pattern" → 15K retrieve + cached
  🟡 Imp query same → cache hit 90% → 1.5K effective
  🔴 Rev query same → cache hit 90% → 1.5K effective
  🟢 CICD query same → cache hit 90% → 1.5K effective
  Em main query same → cache hit 90% → 1.5K effective
  
  Cumulative retrieve cost: 15K + 4×1.5K = 21K
  
Compare to lazy:
  Each agent Read PE V2 file separately
  5 entities × 20K Read = 100K cumulative
  
  → Saving 79K just for 1 cross-agent query

Optimization tips để giảm cumulative

Option 1: Spawn ít agents hơn

Decision gate 6-criteria mỗi task (per feedback_multi_agent_setup rule)
Solo em main đủ → KHÔNG spawn agent
Chỉ spawn agent nào THẬT cần
Trong S20-S21: 4 agents seeds-only, em chưa spawn lần nào → cost ~120K em main thôi

Option 2: Tune blanket sub-agent (100K → 80K)

Em main pass spec gọn (~10K thay 15K)
Em main paste common context excerpt thay full (~20K thay 50K)
Skills preload chỉ description (~3K thay 21K full SKILL.md) → Trigger SKILL.md full khi semantic match
Per sub-agent: 100K → 80K
4 agents cumulative: 400K → 320K
Heavy session: 560K → 480K (-15%)

Option 3: SendMessage cache aggressive (1h TTL beta)

Anthropic extended cache extended-cache-ttl-2025-04-11
Static prompts cache premium WRITE 2× base
Subsequent reads 0.1× discount
Multi-agent cùng cache prefix → benefit lớn
Saving 10-15% additional

14. 3-layer hybrid RAG upgrade path (Anthropic Contextual Retrieval)

Added S21 turn 2 (2026-05-12) — Anthropic flagship pattern Sept 2024.

Pattern overview

Anthropic Contextual Retrieval = 3 layers compound:

Layer 1: Embeddings (Voyage-3-large)
  → Semantic + synonym + multilingual catch
  
+ Contextual prefix (Haiku-generated context):
  Add chunk-specific context BEFORE embed
  "This chunk discusses... in context of..."
  → Better recall via enriched vector

Layer 2: BM25 (bm25s Python lib free local)
  → Exact identifier + technical terms (function names, error codes, Mig numbers)
  
+ Contextual BM25 (same prefix pattern)

Layer 3: Reranking (Voyage rerank-2)
  → Cross-attention deep relevance
  → Re-score top 30 candidates → return top 5 truly relevant

Performance compound effect

Baseline (naive vector embeddings):       ~50% recall

+ Contextual embeddings:                  ~67% recall (-35% failure)

+ Hybrid Contextual + BM25:               ~75% recall (-49% failure)

+ Reranking:                              ~85% recall (-67% failure)

📎 Source: Anthropic Contextual Retrieval Sept 2024

Phase	Setup	Recall	Cost/month	Effort additional
Phase 1 (Week 1-4)	Layer 1 vector only (Voyage-3-large)	~70%	~$1.50	10-14h initial
Phase 2 (Month 2)	+ Layer 2 BM25 (bm25s free local)	~78%	~$1.50 unchanged	2-3h
Phase 3 (Month 3)	+ Layer 3 Voyage rerank-2 + Contextual prefix	~92%	~$4-5	3-4h

Phase 1 implementation (basic vector RAG)

Đã cover trong Section 5-6 plan. Bro implement Week 1-4 trial pilot.

Phase 2 upgrade — Add BM25 hybrid

# scripts/rag-mcp-server.py — upgrade
from bm25s import BM25

bm25 = BM25.load("./rag-data/bm25_index")  # pre-built

@mcp.tool()
def rag_retrieve_hybrid(query, scope="all", k=5):
    # Step 1: Vector search
    query_vec = voyage.embed([query], model="voyage-3-large").embeddings[0]
    vector_results = qdrant.search(COLLECTION, query_vec, limit=20)
    
    # Step 2: BM25 search (local Python lib)
    bm25_results = bm25.retrieve(query, k=20)
    
    # Step 3: Merge + dedup
    candidates = merge_dedup(vector_results, bm25_results)  # ~30 chunks
    
    # Step 4: Score combine (RRF reciprocal rank fusion)
    final_scores = reciprocal_rank_fusion(vector_results, bm25_results)
    
    return final_scores[:k]

Phase 3 upgrade — Full Anthropic Contextual

# scripts/rag-indexer.py — upgrade với contextual prefix
import anthropic

claude_haiku = anthropic.Anthropic()

def contextualize_chunk(chunk_content, full_doc_path):
    """Generate context prefix using Claude Haiku (cheap model)."""
    full_doc = open(full_doc_path).read()
    
    response = claude_haiku.messages.create(
        model="claude-haiku-4-5",  # cheap ~$0.0001/chunk
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": f"""<document>
{full_doc[:5000]}
</document>

<chunk>
{chunk_content}
</chunk>

Give a brief context (50-100 words) explaining what this chunk is about and where it fits in the document. Be specific."""
        }]
    )
    
    return response.content[0].text

# In indexer pipeline:
for chunk in chunks:
    context = contextualize_chunk(chunk["content"], chunk["source"])
    chunk["content_enriched"] = f"{context}\n\n{chunk['content']}"
    # Embed enriched version → better recall

# scripts/rag-mcp-server.py — final upgrade với reranking
import voyageai

@mcp.tool()
def rag_retrieve_full(query, scope="all", k=5):
    # Step 1-3: Same as Phase 2 (vector + BM25 + merge)
    candidates = hybrid_search(query, scope, top=30)
    
    # Step 4: Voyage Rerank
    rerank_response = voyage.rerank(
        query=query,
        documents=[c.content for c in candidates],
        model="voyage-rerank-2",  # ~$0.05 per 1000 queries
        top_k=k
    )
    
    return [candidates[r.index] for r in rerank_response.results]

Cost incremental analysis

Phase 1 → Phase 3 incremental cost:

Phase 1 (basic vector):
  Voyage embed: ~$0.36 initial + ~$0.20/mo delta
  = ~$1.50/mo total
  
Phase 2 (+BM25):
  BM25 free local (Python lib)
  Embedding cost same
  = ~$1.50/mo total (unchanged)

Phase 3 (+Reranking + Contextual):
  Voyage rerank-2: ~$0.05 per 1000 queries
  600 queries/mo × $0.05/1K = $0.03/mo
  
  Haiku contextual prefix: ~$0.0001 per chunk
  Initial 5000 chunks × $0.0001 = $0.50 one-time
  Delta ~100 chunks/mo × $0.0001 = $0.01/mo
  
  + Voyage rerank monthly: ~$0.05/mo per 1K queries × 5 projects
  + Re-embed enriched chunks: ~$0.50/mo
  = ~$4-5/mo total

→ Quality jump 70% → 92% recall = +22pp
→ Cost jump $1.50 → $4-5/mo = +$3
→ Worth it after Phase 1 validation

Why incremental rollout (vs all-in Phase 3 immediate)

Validate Layer 1 quality first — nếu Voyage Vietnamese kém → upgrade Phase 2-3 vô ích
Measure baseline cost — biết exact Voyage spend trước add rerank/contextual
Identify retrieval miss patterns — Phase 1 trial reveal weakness → target Phase 2-3 fix
Risk-averse setup — mỗi phase 2-3h add, rollback dễ nếu fail
§6.5 narrative preserve — KHÔNG over-engineer, build incremental

When to skip Phase 2-3

Phase 1 recall already > 85% → Phase 2-3 marginal benefit (Vietnamese-specific corpus)
Cost monthly < $5 budget → stay Phase 1 OK
Solo dev no Vietnamese exact terms heavy → BM25 less impactful

When to MUST upgrade Phase 2-3

Recall < 70% on benchmark → indicate Phase 1 insufficient
Em main report "miss exact identifier" frequently → Phase 2 BM25 critical
Multi-language queries common → Phase 3 reranker stabilize
Production quality target > 90% → Phase 3 required

📚 References + tools

Anthropic official

Tools docs

Project memory

feedback_md_compact_narrative.md (§6.5 rule — KEEP narrative)
feedback_multi_agent_setup.md (4-agent discipline)
feedback_drastic_refactor_scope.md (RAG setup = dedicated session)
feedback_uat_skip_verify.md (Phase 9 UAT mode)

✅ Pre-implementation checklist

☐ Bro confirm 3 thông tin:
  ☐ 2 dự án path (để Investigator audit MD inventory pre-flight)
  ☐ Stack 2 dự án (BE: .NET/Node/Python? FE: React/Vue?)
  ☐ Pilot project chọn (smaller in 2)
  
☐ Bro prepare environment:
  ☐ Disk cleanup 5GB+ free (current 911/954 = 96% full)
  ☐ Voyage AI account signup + API key
  ☐ Python 3.10+ installed
  ☐ Git installed (cho pre-commit hook)
  
☐ Bro schedule dedicated session:
  ☐ 10-14h block 1 ngày cuối tuần (memory feedback_drastic_refactor_scope rule)
  ☐ Reserve weekly cap ~30% cho RAG setup spawn cost
  
☐ Bro review plan:
  ☐ Read full this file
  ☐ Confirm scope blanket vs RAG store match needs
  ☐ Confirm tool stack acceptable
  ☐ Approve Week 1-4 trial timeline

📝 Notes — keep updated

2026-05-12 turn 1: Plan saved sau S21 turn 1 chốt cicd-monitor. Cross-project reference cho 2 dự án future bro > 1M MD. SOLUTION_ERP baseline ~354K MD (chưa cần RAG, defer).
Status: 📝 PLAN ONLY — chưa implement
Next trigger: Bro confirm 3 thông tin → spawn 🔵 Investigator audit MD inventory 2 dự án → tinh chỉnh blanket list cho từng project

54 KiB Raw Blame History Unescape Escape

RAG Setup Plan — Cross-project reference

📋 Table of Contents

1. Context + Why

Problem statement

Solution

Benefits chốt từ analysis sessions trước

Trade-off

2. Architecture overview

Flow time index (1 lần init + delta)

Flow query time (mỗi spawn em main hoặc agent)

3. BLANKET load list

A. Core stable docs (~30K — KHÔNG đổi thường xuyên)

B. Current state (~25K — em main biết direct, không cần retrieve)

C. Agent infrastructure (~25K — agent identity stable)

D. Skills descriptions (~5K — auto-inject, không SKILL.md full)

E. Memory user-level critical (~15K)

TOTAL BLANKET ≈ 100K tokens

4. RAG store list

F. Session logs (~150K — biggest, 49% MD)

G. Gotchas (~9K — lookup per debug)

H. Archives + Recently Done (~75K)

I. Flows + Database (~17K — conditional task)

J. SKILL.md detail (~40K — retrieve khi skill triggered)

K. Guides ops conditional (~10K)

L. Memory entries non-critical (~50K — pattern lookup)

M. Audit logs (~2K, grow)

TOTAL RAG STORE ≈ 254K tokens

5. Tool stack recommend

Stack rejected + lý do

6. Setup scripts

6.1 requirements.txt

6.2 scripts/rag-indexer.py (~120 LOC)

6.3 scripts/rag-mcp-server.py (~80 LOC)

6.4 .claude/settings.json register

6.5 Pre-commit hook

6.6 Agent .md frontmatter update

7. Audit procedure

7.1 Weekly quick audit (~30 phút, mỗi Saturday)

7.2 Monthly deep audit (~2-3h, mỗi đầu tháng)

7.3 Quarterly major audit (~4-6h, mỗi quý)

7.4 Trigger-based audit (ad-hoc)

8. Multi-AI client access

8.1 MCP protocol — agnostic

8.2 Transport modes

8.3 Setup multi-AI mode

Claude Code (em main + 4 agents)

Claude Desktop

Cursor IDE

GPT-4 via custom adapter

Continue.dev / custom agent

8.4 Security model multi-AI

8.5 Cost considerations multi-AI

8.6 Recommend rollout

9. Timeline rollout

Hour-by-hour breakdown (~10-14h dedicated session)

Trial 4-week plan

Decision gate Week 4

10. Caveats + risks

10.1 Beta features risk

10.2 Storage concerns

10.3 Quality monitoring

10.4 Fallback strategy

10.5 Vietnamese-English mix considerations

11. Success metrics

11.1 Quality metrics

11.2 Cost metrics

11.3 Performance metrics

11.4 Capacity metrics

11.5 Multi-AI client metrics

12. Future enhancements

12.1 Phase 2 (after Week 4 validation)

12.2 Phase 3 (Month 2-3)

12.3 Phase 4 (Quarter 2+)

12.4 Defer indefinitely (over-engineering)

13. Multi-agent cumulative cost reality (Anthropic 8-10× warning)

Per-entity blanket breakdown

Context windows are ISOLATED

Heavy session token compound effect (Cách A vs lazy)

Cost saving breakdown

54 KiB

Raw Blame History

6.1 `requirements.txt`

6.2 `scripts/rag-indexer.py` (~120 LOC)

6.3 `scripts/rag-mcp-server.py` (~80 LOC)

6.4 `.claude/settings.json` register