[CLAUDE] Docs: chốt Session 21 turn 2 — RAG Hybrid setup planning + Cách A validation

Sau S21 turn 1 chốt cicd-monitor, bro clarify 5 dự án future > 1M MD tokens → discussion deep ~15 turn về RAG infrastructure. Em main solo (no SOLUTION_ERP sub-agent spawn), delegate claude-code-guide × 2 research Anthropic + community practice. Quyết định chốt: - Cách A defensive (giữ blanket 120K em main + RAG retrieve supplement) - Bỏ Cách B aggressive (cắt 60-70% blanket) — vi phạm priority em main control flow strong - Industry-validated cross 4 Anthropic blog + 5 community tools (Cursor/Continue/Cline/Aider all hybrid) - 3-layer pattern Phase 1-3 incremental rollout (vector → +BM25 → +reranking, recall ~70% → ~92%) - Stack: Voyage-3-large + Qdrant local + FastMCP Python + Streamlit dashboard Multi-agent cost reality clarify (post-S21 t2): - Em main blanket: ~120K - 4 sub-agents spawn cumulative: ~400K - Total billed heavy session: ~560K Cách A vs ~700K lazy - Saving -20% từ multi-agent shared cache 70-90% - Anthropic acknowledge 8-10× multiplier multi-agent Files updated: - docs/STATUS.md (Last updated S21 turn 2 + Recently Done row top) - docs/HANDOFF.md (TL;DR Session 21 turn 2 section + Last updated) - docs/rag-setup-plan.md (+Section 13 multi-agent cost reality + Section 14 3-layer hybrid Phase 1-3, +355 LOC) - docs/changelog/sessions/2026-05-12-1800-s21-turn2-rag-planning.md (new session log) Memory user-level update (outside repo, separate update): - feedback_rag_hybrid_pattern.md (NEW cross-project pattern reusable) - MEMORY.md index (+1 entry pointer) Plan I NEW deferred — trigger bro confirm 5 dự án path + stack + pilot + Voyage API + disk cleanup → dedicated session 10-14h weekend (per feedback_drastic_refactor_scope rule). Stats: - 17 memory entries (+1 RAG hybrid) - 1 plan file rag-setup-plan.md (1500 LOC final) - 4 sub-agents seeds-only unchanged - 81 test unchanged - 4 commits S21 cumulative (f1c61c9 + 3a34831 + 1f8e9af + this) CI skip per path filter (all .md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:50:28 +07:00
parent 1f8e9af66f
commit 0a3b747612
4 changed files with 783 additions and 2 deletions
--- a/docs/rag-setup-plan.md
+++ b/docs/rag-setup-plan.md
@ -1165,6 +1165,361 @@ Mitigation:

 ---

+## 13. Multi-agent cumulative cost reality (Anthropic 8-10× warning)
+
+> **Added S21 turn 2 (2026-05-12)** — clarification sau khi user catch gap "120K blanket KHÔNG bao gồm 4 agents".
+
+### Per-entity blanket breakdown
+
+```
+Em main blanket:                    ~120K
+  STATUS + HANDOFF top + rules + architecture + 5 agent .md + 
+  4 MEMORY.md auto-inject + skills desc + memory critical + 
+  auto-inject system reminders
+
+Per sub-agent spawn baseline:       ~80-100K each
+  Agent system prompt (~5K) +
+  3 skills preload SKILL.md full (~21K, trigger semantic) +
+  Auto-inject MEMORY.md 25KB first 200 lines (~7K) +
+  Em main pass spec task (~10-15K) +
+  Em main paste common context excerpt (~30-50K) +
+  Auto-inject project context (~10K)
+  = ~80-100K per sub-agent spawn (per Anthropic docs)
+  
+4 sub-agents cumulative:            ~400K
+  (4 × ~100K each, isolated context windows)
+
+TOTAL cumulative blanket 5 entities: ~520K
+  Em main + 4 sub-agents combined (isolated windows, cumulative billing)
+```
+
+### Context windows are ISOLATED
+
+```
+KHÔNG phải 5 entities share 520K trong 1 context window 1M.
+
+Mỗi entity có context window 1M RIÊNG:
+  Em main      → context window 1M, dùng ~120K
+  Investigator → context window 1M, dùng ~100K
+  Implementer  → context window 1M, dùng ~100K
+  Reviewer     → context window 1M, dùng ~100K
+  CICD Monitor → context window 1M, dùng ~100K
+  
+→ Mỗi entity LOST-IN-MIDDLE threshold riêng (~700K each)
+→ Mỗi entity capacity ~58 tasks before hit hard cap riêng
+
+NHƯNG billing là CUMULATIVE 520K across all contexts:
+  Anthropic billing tổng tokens across all 5 windows
+  → Hit weekly cap nhanh hơn solo em main 4-5×
+```
+
+### Heavy session token compound effect (Cách A vs lazy)
+
+**Without RAG (lazy current — 4 agents spawn):**
+
+```
+Em main:
+  Blanket: 120K
+  Lazy Read on-demand: ~50K
+  Reasoning + coordinate: ~30K
+  = ~200K subtotal
+
+4 sub-agents (each):
+  Spawn blanket: ~100K
+  Lazy Read inside agent: ~50K
+  Reasoning + work: ~30K
+  Each agent: ~180K
+  ──────────────
+  4 agents subtotal: ~720K cumulative
+
+SendMessage iteration:
+  10 round trips × ~30K nominal: 300K nominal
+  Cache hit 70%: ~90K effective
+
+TOTAL HEAVY SESSION (lazy):
+  200K + 720K + 90K = ~1010K nominal
+  After cache discount: ~700K effective billed
+```
+
+**With Cách A RAG:**
+
+```
+Em main:
+  Blanket: 120K (unchanged)
+  RAG retrieve replace lazy Read: ~30K (-20K saving)
+  Reasoning streamlined: ~25K
+  = ~175K subtotal (saving 25K)
+
+4 sub-agents (each):
+  Spawn blanket: ~100K (unchanged)
+  RAG retrieve (share cache 70-90% common queries): ~15K
+  Reasoning streamlined: ~25K
+  Each agent: ~140K (saving 40K each)
+  ──────────────
+  4 agents subtotal: ~560K (saving 160K total)
+
+SendMessage iteration: ~90K effective (unchanged)
+
+TOTAL HEAVY SESSION (Cách A):
+  175K + 560K + 90K = ~825K nominal
+  After cache discount: ~560K effective billed
+  
+SAVING: -140K (-20%)
+```
+
+### Cost saving breakdown
+
+| Component | Lazy current | Cách A | Saving |
+|---|---:|---:|---:|
+| Em main blanket (fixed) | 120K | 120K | 0 |
+| Em main lazy Read → RAG retrieve | 50K | 30K | -20K |
+| Em main reasoning streamlined | 30K | 25K | -5K |
+| 4 agents spawn blanket (fixed) | 400K | 400K | 0 |
+| 4 agents lazy Read → cached retrieve | 200K | 60K | **-140K** |
+| 4 agents reasoning | 120K | 100K | -20K |
+| SendMessage cached | 90K | 90K | 0 |
+| **TOTAL EFFECTIVE BILLED** | **~700K** | **~560K** | **-140K (-20%)** |
+
+→ **Saving 80% từ 4 agents** share retrieve cache (cache hit 70-90% common queries cross-agent).
+
+→ Em main saving chỉ 25K (blanket unchanged, chỉ optimize Read → retrieve).
+
+### Multi-agent leverage example concrete
+
+```
+Task Plan B Contract V2 wire:
+  🔵 Inv query "PE V2 schema pattern" → 15K retrieve + cached
+  🟡 Imp query same → cache hit 90% → 1.5K effective
+  🔴 Rev query same → cache hit 90% → 1.5K effective
+  🟢 CICD query same → cache hit 90% → 1.5K effective
+  Em main query same → cache hit 90% → 1.5K effective
+  
+  Cumulative retrieve cost: 15K + 4×1.5K = 21K
+  
+Compare to lazy:
+  Each agent Read PE V2 file separately
+  5 entities × 20K Read = 100K cumulative
+  
+  → Saving 79K just for 1 cross-agent query
+```
+
+### Optimization tips để giảm cumulative
+
+**Option 1: Spawn ít agents hơn**
+- Decision gate 6-criteria mỗi task (per `feedback_multi_agent_setup` rule)
+- Solo em main đủ → KHÔNG spawn agent
+- Chỉ spawn agent nào THẬT cần
+- Trong S20-S21: 4 agents seeds-only, em chưa spawn lần nào → cost ~120K em main thôi
+
+**Option 2: Tune blanket sub-agent (100K → 80K)**
+- Em main pass spec gọn (~10K thay 15K)
+- Em main paste common context excerpt thay full (~20K thay 50K)
+- Skills preload chỉ description (~3K thay 21K full SKILL.md)
+  → Trigger SKILL.md full khi semantic match
+- Per sub-agent: 100K → 80K
+- 4 agents cumulative: 400K → 320K
+- Heavy session: 560K → 480K (-15%)
+
+**Option 3: SendMessage cache aggressive (1h TTL beta)**
+- Anthropic extended cache `extended-cache-ttl-2025-04-11`
+- Static prompts cache premium WRITE 2× base
+- Subsequent reads 0.1× discount
+- Multi-agent cùng cache prefix → benefit lớn
+- Saving 10-15% additional
+
+---
+
+## 14. 3-layer hybrid RAG upgrade path (Anthropic Contextual Retrieval)
+
+> **Added S21 turn 2 (2026-05-12)** — Anthropic flagship pattern Sept 2024.
+
+### Pattern overview
+
+```
+Anthropic Contextual Retrieval = 3 layers compound:
+
+Layer 1: Embeddings (Voyage-3-large)
+  → Semantic + synonym + multilingual catch
+  
+ Contextual prefix (Haiku-generated context):
+  Add chunk-specific context BEFORE embed
+  "This chunk discusses... in context of..."
+  → Better recall via enriched vector
+
+Layer 2: BM25 (bm25s Python lib free local)
+  → Exact identifier + technical terms (function names, error codes, Mig numbers)
+  
+ Contextual BM25 (same prefix pattern)
+
+Layer 3: Reranking (Voyage rerank-2)
+  → Cross-attention deep relevance
+  → Re-score top 30 candidates → return top 5 truly relevant
+```
+
+### Performance compound effect
+
+```
+Baseline (naive vector embeddings):       ~50% recall
+
+ Contextual embeddings:                  ~67% recall (-35% failure)
+
+ Hybrid Contextual + BM25:               ~75% recall (-49% failure)
+
+ Reranking:                              ~85% recall (-67% failure)
+```
+
+📎 Source: [Anthropic Contextual Retrieval Sept 2024](https://www.anthropic.com/news/contextual-retrieval)
+
+### Phase rollout incremental (recommend cho bro)
+
+| Phase | Setup | Recall | Cost/month | Effort additional |
+|---|---|---:|---:|---|
+| **Phase 1** (Week 1-4) | Layer 1 vector only (Voyage-3-large) | ~70% | ~$1.50 | 10-14h initial |
+| **Phase 2** (Month 2) | + Layer 2 BM25 (bm25s free local) | ~78% | ~$1.50 unchanged | 2-3h |
+| **Phase 3** (Month 3) | + Layer 3 Voyage rerank-2 + Contextual prefix | ~92% | ~$4-5 | 3-4h |
+
+### Phase 1 implementation (basic vector RAG)
+
+Đã cover trong Section 5-6 plan. Bro implement Week 1-4 trial pilot.
+
+### Phase 2 upgrade — Add BM25 hybrid
+
+```python
+# scripts/rag-mcp-server.py — upgrade
+from bm25s import BM25
+
+bm25 = BM25.load("./rag-data/bm25_index")  # pre-built
+
+@mcp.tool()
+def rag_retrieve_hybrid(query, scope="all", k=5):
+    # Step 1: Vector search
+    query_vec = voyage.embed([query], model="voyage-3-large").embeddings[0]
+    vector_results = qdrant.search(COLLECTION, query_vec, limit=20)
+    
+    # Step 2: BM25 search (local Python lib)
+    bm25_results = bm25.retrieve(query, k=20)
+    
+    # Step 3: Merge + dedup
+    candidates = merge_dedup(vector_results, bm25_results)  # ~30 chunks
+    
+    # Step 4: Score combine (RRF reciprocal rank fusion)
+    final_scores = reciprocal_rank_fusion(vector_results, bm25_results)
+    
+    return final_scores[:k]
+```
+
+### Phase 3 upgrade — Full Anthropic Contextual
+
+```python
+# scripts/rag-indexer.py — upgrade với contextual prefix
+import anthropic
+
+claude_haiku = anthropic.Anthropic()
+
+def contextualize_chunk(chunk_content, full_doc_path):
+    """Generate context prefix using Claude Haiku (cheap model)."""
+    full_doc = open(full_doc_path).read()
+    
+    response = claude_haiku.messages.create(
+        model="claude-haiku-4-5",  # cheap ~$0.0001/chunk
+        max_tokens=150,
+        messages=[{
+            "role": "user",
+            "content": f"""<document>
+{full_doc[:5000]}
+</document>
+
+<chunk>
+{chunk_content}
+</chunk>
+
+Give a brief context (50-100 words) explaining what this chunk is about and where it fits in the document. Be specific."""
+        }]
+    )
+    
+    return response.content[0].text
+
+# In indexer pipeline:
+for chunk in chunks:
+    context = contextualize_chunk(chunk["content"], chunk["source"])
+    chunk["content_enriched"] = f"{context}\n\n{chunk['content']}"
+    # Embed enriched version → better recall
+```
+
+```python
+# scripts/rag-mcp-server.py — final upgrade với reranking
+import voyageai
+
+@mcp.tool()
+def rag_retrieve_full(query, scope="all", k=5):
+    # Step 1-3: Same as Phase 2 (vector + BM25 + merge)
+    candidates = hybrid_search(query, scope, top=30)
+    
+    # Step 4: Voyage Rerank
+    rerank_response = voyage.rerank(
+        query=query,
+        documents=[c.content for c in candidates],
+        model="voyage-rerank-2",  # ~$0.05 per 1000 queries
+        top_k=k
+    )
+    
+    return [candidates[r.index] for r in rerank_response.results]
+```
+
+### Cost incremental analysis
+
+```
+Phase 1 → Phase 3 incremental cost:
+
+Phase 1 (basic vector):
+  Voyage embed: ~$0.36 initial + ~$0.20/mo delta
+  = ~$1.50/mo total
+  
+Phase 2 (+BM25):
+  BM25 free local (Python lib)
+  Embedding cost same
+  = ~$1.50/mo total (unchanged)
+
+Phase 3 (+Reranking + Contextual):
+  Voyage rerank-2: ~$0.05 per 1000 queries
+  600 queries/mo × $0.05/1K = $0.03/mo
+  
+  Haiku contextual prefix: ~$0.0001 per chunk
+  Initial 5000 chunks × $0.0001 = $0.50 one-time
+  Delta ~100 chunks/mo × $0.0001 = $0.01/mo
+  
+  + Voyage rerank monthly: ~$0.05/mo per 1K queries × 5 projects
+  + Re-embed enriched chunks: ~$0.50/mo
+  = ~$4-5/mo total
+
+→ Quality jump 70% → 92% recall = +22pp
+→ Cost jump $1.50 → $4-5/mo = +$3
+→ Worth it after Phase 1 validation
+```
+
+### Why incremental rollout (vs all-in Phase 3 immediate)
+
+1. **Validate Layer 1 quality first** — nếu Voyage Vietnamese kém → upgrade Phase 2-3 vô ích
+2. **Measure baseline cost** — biết exact Voyage spend trước add rerank/contextual
+3. **Identify retrieval miss patterns** — Phase 1 trial reveal weakness → target Phase 2-3 fix
+4. **Risk-averse setup** — mỗi phase 2-3h add, rollback dễ nếu fail
+5. **§6.5 narrative preserve** — KHÔNG over-engineer, build incremental
+
+### When to skip Phase 2-3
+
+- Phase 1 recall already > 85% → Phase 2-3 marginal benefit (Vietnamese-specific corpus)
+- Cost monthly < $5 budget → stay Phase 1 OK
+- Solo dev no Vietnamese exact terms heavy → BM25 less impactful
+
+### When to MUST upgrade Phase 2-3
+
+- Recall < 70% on benchmark → indicate Phase 1 insufficient
+- Em main report "miss exact identifier" frequently → Phase 2 BM25 critical
+- Multi-language queries common → Phase 3 reranker stabilize
+- Production quality target > 90% → Phase 3 required
+
+---
+
 ## 📚 References + tools

 ### Anthropic official