[CLAUDE] Docs: chốt Session 21 turn 2 — RAG Hybrid setup planning + Cách A validation

Sau S21 turn 1 chốt cicd-monitor, bro clarify 5 dự án future > 1M MD tokens → discussion deep ~15 turn về RAG infrastructure. Em main solo (no SOLUTION_ERP sub-agent spawn), delegate claude-code-guide × 2 research Anthropic + community practice.

Quyết định chốt:
- Cách A defensive (giữ blanket 120K em main + RAG retrieve supplement)
- Bỏ Cách B aggressive (cắt 60-70% blanket) — vi phạm priority em main control flow strong
- Industry-validated cross 4 Anthropic blog + 5 community tools (Cursor/Continue/Cline/Aider all hybrid)
- 3-layer pattern Phase 1-3 incremental rollout (vector → +BM25 → +reranking, recall ~70% → ~92%)
- Stack: Voyage-3-large + Qdrant local + FastMCP Python + Streamlit dashboard

Multi-agent cost reality clarify (post-S21 t2):
- Em main blanket: ~120K
- 4 sub-agents spawn cumulative: ~400K
- Total billed heavy session: ~560K Cách A vs ~700K lazy
- Saving -20% từ multi-agent shared cache 70-90%
- Anthropic acknowledge 8-10× multiplier multi-agent

Files updated:
- docs/STATUS.md (Last updated S21 turn 2 + Recently Done row top)
- docs/HANDOFF.md (TL;DR Session 21 turn 2 section + Last updated)
- docs/rag-setup-plan.md (+Section 13 multi-agent cost reality + Section 14 3-layer hybrid Phase 1-3, +355 LOC)
- docs/changelog/sessions/2026-05-12-1800-s21-turn2-rag-planning.md (new session log)

Memory user-level update (outside repo, separate update):
- feedback_rag_hybrid_pattern.md (NEW cross-project pattern reusable)
- MEMORY.md index (+1 entry pointer)

Plan I NEW deferred — trigger bro confirm 5 dự án path + stack + pilot + Voyage API + disk cleanup → dedicated session 10-14h weekend (per feedback_drastic_refactor_scope rule).

Stats:
- 17 memory entries (+1 RAG hybrid)
- 1 plan file rag-setup-plan.md (1500 LOC final)
- 4 sub-agents seeds-only unchanged
- 81 test unchanged
- 4 commits S21 cumulative (f1c61c9 + 3a34831 + 1f8e9af + this)

CI skip per path filter (all .md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
pqhuy1987
2026-05-12 18:50:28 +07:00
parent 1f8e9af66f
commit 0a3b747612
4 changed files with 783 additions and 2 deletions

View File

@ -1165,6 +1165,361 @@ Mitigation:
---
## 13. Multi-agent cumulative cost reality (Anthropic 8-10× warning)
> **Added S21 turn 2 (2026-05-12)** — clarification sau khi user catch gap "120K blanket KHÔNG bao gồm 4 agents".
### Per-entity blanket breakdown
```
Em main blanket: ~120K
STATUS + HANDOFF top + rules + architecture + 5 agent .md +
4 MEMORY.md auto-inject + skills desc + memory critical +
auto-inject system reminders
Per sub-agent spawn baseline: ~80-100K each
Agent system prompt (~5K) +
3 skills preload SKILL.md full (~21K, trigger semantic) +
Auto-inject MEMORY.md 25KB first 200 lines (~7K) +
Em main pass spec task (~10-15K) +
Em main paste common context excerpt (~30-50K) +
Auto-inject project context (~10K)
= ~80-100K per sub-agent spawn (per Anthropic docs)
4 sub-agents cumulative: ~400K
(4 × ~100K each, isolated context windows)
TOTAL cumulative blanket 5 entities: ~520K
Em main + 4 sub-agents combined (isolated windows, cumulative billing)
```
### Context windows are ISOLATED
```
KHÔNG phải 5 entities share 520K trong 1 context window 1M.
Mỗi entity có context window 1M RIÊNG:
Em main → context window 1M, dùng ~120K
Investigator → context window 1M, dùng ~100K
Implementer → context window 1M, dùng ~100K
Reviewer → context window 1M, dùng ~100K
CICD Monitor → context window 1M, dùng ~100K
→ Mỗi entity LOST-IN-MIDDLE threshold riêng (~700K each)
→ Mỗi entity capacity ~58 tasks before hit hard cap riêng
NHƯNG billing là CUMULATIVE 520K across all contexts:
Anthropic billing tổng tokens across all 5 windows
→ Hit weekly cap nhanh hơn solo em main 4-5×
```
### Heavy session token compound effect (Cách A vs lazy)
**Without RAG (lazy current — 4 agents spawn):**
```
Em main:
Blanket: 120K
Lazy Read on-demand: ~50K
Reasoning + coordinate: ~30K
= ~200K subtotal
4 sub-agents (each):
Spawn blanket: ~100K
Lazy Read inside agent: ~50K
Reasoning + work: ~30K
Each agent: ~180K
──────────────
4 agents subtotal: ~720K cumulative
SendMessage iteration:
10 round trips × ~30K nominal: 300K nominal
Cache hit 70%: ~90K effective
TOTAL HEAVY SESSION (lazy):
200K + 720K + 90K = ~1010K nominal
After cache discount: ~700K effective billed
```
**With Cách A RAG:**
```
Em main:
Blanket: 120K (unchanged)
RAG retrieve replace lazy Read: ~30K (-20K saving)
Reasoning streamlined: ~25K
= ~175K subtotal (saving 25K)
4 sub-agents (each):
Spawn blanket: ~100K (unchanged)
RAG retrieve (share cache 70-90% common queries): ~15K
Reasoning streamlined: ~25K
Each agent: ~140K (saving 40K each)
──────────────
4 agents subtotal: ~560K (saving 160K total)
SendMessage iteration: ~90K effective (unchanged)
TOTAL HEAVY SESSION (Cách A):
175K + 560K + 90K = ~825K nominal
After cache discount: ~560K effective billed
SAVING: -140K (-20%)
```
### Cost saving breakdown
| Component | Lazy current | Cách A | Saving |
|---|---:|---:|---:|
| Em main blanket (fixed) | 120K | 120K | 0 |
| Em main lazy Read → RAG retrieve | 50K | 30K | -20K |
| Em main reasoning streamlined | 30K | 25K | -5K |
| 4 agents spawn blanket (fixed) | 400K | 400K | 0 |
| 4 agents lazy Read → cached retrieve | 200K | 60K | **-140K** |
| 4 agents reasoning | 120K | 100K | -20K |
| SendMessage cached | 90K | 90K | 0 |
| **TOTAL EFFECTIVE BILLED** | **~700K** | **~560K** | **-140K (-20%)** |
**Saving 80% từ 4 agents** share retrieve cache (cache hit 70-90% common queries cross-agent).
→ Em main saving chỉ 25K (blanket unchanged, chỉ optimize Read → retrieve).
### Multi-agent leverage example concrete
```
Task Plan B Contract V2 wire:
🔵 Inv query "PE V2 schema pattern" → 15K retrieve + cached
🟡 Imp query same → cache hit 90% → 1.5K effective
🔴 Rev query same → cache hit 90% → 1.5K effective
🟢 CICD query same → cache hit 90% → 1.5K effective
Em main query same → cache hit 90% → 1.5K effective
Cumulative retrieve cost: 15K + 4×1.5K = 21K
Compare to lazy:
Each agent Read PE V2 file separately
5 entities × 20K Read = 100K cumulative
→ Saving 79K just for 1 cross-agent query
```
### Optimization tips để giảm cumulative
**Option 1: Spawn ít agents hơn**
- Decision gate 6-criteria mỗi task (per `feedback_multi_agent_setup` rule)
- Solo em main đủ → KHÔNG spawn agent
- Chỉ spawn agent nào THẬT cần
- Trong S20-S21: 4 agents seeds-only, em chưa spawn lần nào → cost ~120K em main thôi
**Option 2: Tune blanket sub-agent (100K → 80K)**
- Em main pass spec gọn (~10K thay 15K)
- Em main paste common context excerpt thay full (~20K thay 50K)
- Skills preload chỉ description (~3K thay 21K full SKILL.md)
→ Trigger SKILL.md full khi semantic match
- Per sub-agent: 100K → 80K
- 4 agents cumulative: 400K → 320K
- Heavy session: 560K → 480K (-15%)
**Option 3: SendMessage cache aggressive (1h TTL beta)**
- Anthropic extended cache `extended-cache-ttl-2025-04-11`
- Static prompts cache premium WRITE 2× base
- Subsequent reads 0.1× discount
- Multi-agent cùng cache prefix → benefit lớn
- Saving 10-15% additional
---
## 14. 3-layer hybrid RAG upgrade path (Anthropic Contextual Retrieval)
> **Added S21 turn 2 (2026-05-12)** — Anthropic flagship pattern Sept 2024.
### Pattern overview
```
Anthropic Contextual Retrieval = 3 layers compound:
Layer 1: Embeddings (Voyage-3-large)
→ Semantic + synonym + multilingual catch
+ Contextual prefix (Haiku-generated context):
Add chunk-specific context BEFORE embed
"This chunk discusses... in context of..."
→ Better recall via enriched vector
Layer 2: BM25 (bm25s Python lib free local)
→ Exact identifier + technical terms (function names, error codes, Mig numbers)
+ Contextual BM25 (same prefix pattern)
Layer 3: Reranking (Voyage rerank-2)
→ Cross-attention deep relevance
→ Re-score top 30 candidates → return top 5 truly relevant
```
### Performance compound effect
```
Baseline (naive vector embeddings): ~50% recall
+ Contextual embeddings: ~67% recall (-35% failure)
+ Hybrid Contextual + BM25: ~75% recall (-49% failure)
+ Reranking: ~85% recall (-67% failure)
```
📎 Source: [Anthropic Contextual Retrieval Sept 2024](https://www.anthropic.com/news/contextual-retrieval)
### Phase rollout incremental (recommend cho bro)
| Phase | Setup | Recall | Cost/month | Effort additional |
|---|---|---:|---:|---|
| **Phase 1** (Week 1-4) | Layer 1 vector only (Voyage-3-large) | ~70% | ~$1.50 | 10-14h initial |
| **Phase 2** (Month 2) | + Layer 2 BM25 (bm25s free local) | ~78% | ~$1.50 unchanged | 2-3h |
| **Phase 3** (Month 3) | + Layer 3 Voyage rerank-2 + Contextual prefix | ~92% | ~$4-5 | 3-4h |
### Phase 1 implementation (basic vector RAG)
Đã cover trong Section 5-6 plan. Bro implement Week 1-4 trial pilot.
### Phase 2 upgrade — Add BM25 hybrid
```python
# scripts/rag-mcp-server.py — upgrade
from bm25s import BM25
bm25 = BM25.load("./rag-data/bm25_index") # pre-built
@mcp.tool()
def rag_retrieve_hybrid(query, scope="all", k=5):
# Step 1: Vector search
query_vec = voyage.embed([query], model="voyage-3-large").embeddings[0]
vector_results = qdrant.search(COLLECTION, query_vec, limit=20)
# Step 2: BM25 search (local Python lib)
bm25_results = bm25.retrieve(query, k=20)
# Step 3: Merge + dedup
candidates = merge_dedup(vector_results, bm25_results) # ~30 chunks
# Step 4: Score combine (RRF reciprocal rank fusion)
final_scores = reciprocal_rank_fusion(vector_results, bm25_results)
return final_scores[:k]
```
### Phase 3 upgrade — Full Anthropic Contextual
```python
# scripts/rag-indexer.py — upgrade với contextual prefix
import anthropic
claude_haiku = anthropic.Anthropic()
def contextualize_chunk(chunk_content, full_doc_path):
"""Generate context prefix using Claude Haiku (cheap model)."""
full_doc = open(full_doc_path).read()
response = claude_haiku.messages.create(
model="claude-haiku-4-5", # cheap ~$0.0001/chunk
max_tokens=150,
messages=[{
"role": "user",
"content": f"""<document>
{full_doc[:5000]}
</document>
<chunk>
{chunk_content}
</chunk>
Give a brief context (50-100 words) explaining what this chunk is about and where it fits in the document. Be specific."""
}]
)
return response.content[0].text
# In indexer pipeline:
for chunk in chunks:
context = contextualize_chunk(chunk["content"], chunk["source"])
chunk["content_enriched"] = f"{context}\n\n{chunk['content']}"
# Embed enriched version → better recall
```
```python
# scripts/rag-mcp-server.py — final upgrade với reranking
import voyageai
@mcp.tool()
def rag_retrieve_full(query, scope="all", k=5):
# Step 1-3: Same as Phase 2 (vector + BM25 + merge)
candidates = hybrid_search(query, scope, top=30)
# Step 4: Voyage Rerank
rerank_response = voyage.rerank(
query=query,
documents=[c.content for c in candidates],
model="voyage-rerank-2", # ~$0.05 per 1000 queries
top_k=k
)
return [candidates[r.index] for r in rerank_response.results]
```
### Cost incremental analysis
```
Phase 1 → Phase 3 incremental cost:
Phase 1 (basic vector):
Voyage embed: ~$0.36 initial + ~$0.20/mo delta
= ~$1.50/mo total
Phase 2 (+BM25):
BM25 free local (Python lib)
Embedding cost same
= ~$1.50/mo total (unchanged)
Phase 3 (+Reranking + Contextual):
Voyage rerank-2: ~$0.05 per 1000 queries
600 queries/mo × $0.05/1K = $0.03/mo
Haiku contextual prefix: ~$0.0001 per chunk
Initial 5000 chunks × $0.0001 = $0.50 one-time
Delta ~100 chunks/mo × $0.0001 = $0.01/mo
+ Voyage rerank monthly: ~$0.05/mo per 1K queries × 5 projects
+ Re-embed enriched chunks: ~$0.50/mo
= ~$4-5/mo total
→ Quality jump 70% → 92% recall = +22pp
→ Cost jump $1.50 → $4-5/mo = +$3
→ Worth it after Phase 1 validation
```
### Why incremental rollout (vs all-in Phase 3 immediate)
1. **Validate Layer 1 quality first** — nếu Voyage Vietnamese kém → upgrade Phase 2-3 vô ích
2. **Measure baseline cost** — biết exact Voyage spend trước add rerank/contextual
3. **Identify retrieval miss patterns** — Phase 1 trial reveal weakness → target Phase 2-3 fix
4. **Risk-averse setup** — mỗi phase 2-3h add, rollback dễ nếu fail
5. **§6.5 narrative preserve** — KHÔNG over-engineer, build incremental
### When to skip Phase 2-3
- Phase 1 recall already > 85% → Phase 2-3 marginal benefit (Vietnamese-specific corpus)
- Cost monthly < $5 budget → stay Phase 1 OK
- Solo dev no Vietnamese exact terms heavy → BM25 less impactful
### When to MUST upgrade Phase 2-3
- Recall < 70% on benchmark indicate Phase 1 insufficient
- Em main report "miss exact identifier" frequently Phase 2 BM25 critical
- Multi-language queries common Phase 3 reranker stabilize
- Production quality target > 90% → Phase 3 required
---
## 📚 References + tools
### Anthropic official