[CLAUDE] Docs: chốt Session 21 turn 2 — RAG Hybrid setup planning + Cách A validation
Sau S21 turn 1 chốt cicd-monitor, bro clarify 5 dự án future > 1M MD tokens → discussion deep ~15 turn về RAG infrastructure. Em main solo (no SOLUTION_ERP sub-agent spawn), delegate claude-code-guide × 2 research Anthropic + community practice. Quyết định chốt: - Cách A defensive (giữ blanket 120K em main + RAG retrieve supplement) - Bỏ Cách B aggressive (cắt 60-70% blanket) — vi phạm priority em main control flow strong - Industry-validated cross 4 Anthropic blog + 5 community tools (Cursor/Continue/Cline/Aider all hybrid) - 3-layer pattern Phase 1-3 incremental rollout (vector → +BM25 → +reranking, recall ~70% → ~92%) - Stack: Voyage-3-large + Qdrant local + FastMCP Python + Streamlit dashboard Multi-agent cost reality clarify (post-S21 t2): - Em main blanket: ~120K - 4 sub-agents spawn cumulative: ~400K - Total billed heavy session: ~560K Cách A vs ~700K lazy - Saving -20% từ multi-agent shared cache 70-90% - Anthropic acknowledge 8-10× multiplier multi-agent Files updated: - docs/STATUS.md (Last updated S21 turn 2 + Recently Done row top) - docs/HANDOFF.md (TL;DR Session 21 turn 2 section + Last updated) - docs/rag-setup-plan.md (+Section 13 multi-agent cost reality + Section 14 3-layer hybrid Phase 1-3, +355 LOC) - docs/changelog/sessions/2026-05-12-1800-s21-turn2-rag-planning.md (new session log) Memory user-level update (outside repo, separate update): - feedback_rag_hybrid_pattern.md (NEW cross-project pattern reusable) - MEMORY.md index (+1 entry pointer) Plan I NEW deferred — trigger bro confirm 5 dự án path + stack + pilot + Voyage API + disk cleanup → dedicated session 10-14h weekend (per feedback_drastic_refactor_scope rule). Stats: - 17 memory entries (+1 RAG hybrid) - 1 plan file rag-setup-plan.md (1500 LOC final) - 4 sub-agents seeds-only unchanged - 81 test unchanged - 4 commits S21 cumulative (f1c61c9+3a34831+1f8e9af+ this) CI skip per path filter (all .md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -1165,6 +1165,361 @@ Mitigation:
|
||||
|
||||
---
|
||||
|
||||
## 13. Multi-agent cumulative cost reality (Anthropic 8-10× warning)
|
||||
|
||||
> **Added S21 turn 2 (2026-05-12)** — clarification sau khi user catch gap "120K blanket KHÔNG bao gồm 4 agents".
|
||||
|
||||
### Per-entity blanket breakdown
|
||||
|
||||
```
|
||||
Em main blanket: ~120K
|
||||
STATUS + HANDOFF top + rules + architecture + 5 agent .md +
|
||||
4 MEMORY.md auto-inject + skills desc + memory critical +
|
||||
auto-inject system reminders
|
||||
|
||||
Per sub-agent spawn baseline: ~80-100K each
|
||||
Agent system prompt (~5K) +
|
||||
3 skills preload SKILL.md full (~21K, trigger semantic) +
|
||||
Auto-inject MEMORY.md 25KB first 200 lines (~7K) +
|
||||
Em main pass spec task (~10-15K) +
|
||||
Em main paste common context excerpt (~30-50K) +
|
||||
Auto-inject project context (~10K)
|
||||
= ~80-100K per sub-agent spawn (per Anthropic docs)
|
||||
|
||||
4 sub-agents cumulative: ~400K
|
||||
(4 × ~100K each, isolated context windows)
|
||||
|
||||
TOTAL cumulative blanket 5 entities: ~520K
|
||||
Em main + 4 sub-agents combined (isolated windows, cumulative billing)
|
||||
```
|
||||
|
||||
### Context windows are ISOLATED
|
||||
|
||||
```
|
||||
KHÔNG phải 5 entities share 520K trong 1 context window 1M.
|
||||
|
||||
Mỗi entity có context window 1M RIÊNG:
|
||||
Em main → context window 1M, dùng ~120K
|
||||
Investigator → context window 1M, dùng ~100K
|
||||
Implementer → context window 1M, dùng ~100K
|
||||
Reviewer → context window 1M, dùng ~100K
|
||||
CICD Monitor → context window 1M, dùng ~100K
|
||||
|
||||
→ Mỗi entity LOST-IN-MIDDLE threshold riêng (~700K each)
|
||||
→ Mỗi entity capacity ~58 tasks before hit hard cap riêng
|
||||
|
||||
NHƯNG billing là CUMULATIVE 520K across all contexts:
|
||||
Anthropic billing tổng tokens across all 5 windows
|
||||
→ Hit weekly cap nhanh hơn solo em main 4-5×
|
||||
```
|
||||
|
||||
### Heavy session token compound effect (Cách A vs lazy)
|
||||
|
||||
**Without RAG (lazy current — 4 agents spawn):**
|
||||
|
||||
```
|
||||
Em main:
|
||||
Blanket: 120K
|
||||
Lazy Read on-demand: ~50K
|
||||
Reasoning + coordinate: ~30K
|
||||
= ~200K subtotal
|
||||
|
||||
4 sub-agents (each):
|
||||
Spawn blanket: ~100K
|
||||
Lazy Read inside agent: ~50K
|
||||
Reasoning + work: ~30K
|
||||
Each agent: ~180K
|
||||
──────────────
|
||||
4 agents subtotal: ~720K cumulative
|
||||
|
||||
SendMessage iteration:
|
||||
10 round trips × ~30K nominal: 300K nominal
|
||||
Cache hit 70%: ~90K effective
|
||||
|
||||
TOTAL HEAVY SESSION (lazy):
|
||||
200K + 720K + 90K = ~1010K nominal
|
||||
After cache discount: ~700K effective billed
|
||||
```
|
||||
|
||||
**With Cách A RAG:**
|
||||
|
||||
```
|
||||
Em main:
|
||||
Blanket: 120K (unchanged)
|
||||
RAG retrieve replace lazy Read: ~30K (-20K saving)
|
||||
Reasoning streamlined: ~25K
|
||||
= ~175K subtotal (saving 25K)
|
||||
|
||||
4 sub-agents (each):
|
||||
Spawn blanket: ~100K (unchanged)
|
||||
RAG retrieve (share cache 70-90% common queries): ~15K
|
||||
Reasoning streamlined: ~25K
|
||||
Each agent: ~140K (saving 40K each)
|
||||
──────────────
|
||||
4 agents subtotal: ~560K (saving 160K total)
|
||||
|
||||
SendMessage iteration: ~90K effective (unchanged)
|
||||
|
||||
TOTAL HEAVY SESSION (Cách A):
|
||||
175K + 560K + 90K = ~825K nominal
|
||||
After cache discount: ~560K effective billed
|
||||
|
||||
SAVING: -140K (-20%)
|
||||
```
|
||||
|
||||
### Cost saving breakdown
|
||||
|
||||
| Component | Lazy current | Cách A | Saving |
|
||||
|---|---:|---:|---:|
|
||||
| Em main blanket (fixed) | 120K | 120K | 0 |
|
||||
| Em main lazy Read → RAG retrieve | 50K | 30K | -20K |
|
||||
| Em main reasoning streamlined | 30K | 25K | -5K |
|
||||
| 4 agents spawn blanket (fixed) | 400K | 400K | 0 |
|
||||
| 4 agents lazy Read → cached retrieve | 200K | 60K | **-140K** |
|
||||
| 4 agents reasoning | 120K | 100K | -20K |
|
||||
| SendMessage cached | 90K | 90K | 0 |
|
||||
| **TOTAL EFFECTIVE BILLED** | **~700K** | **~560K** | **-140K (-20%)** |
|
||||
|
||||
→ **Saving 80% từ 4 agents** share retrieve cache (cache hit 70-90% common queries cross-agent).
|
||||
|
||||
→ Em main saving chỉ 25K (blanket unchanged, chỉ optimize Read → retrieve).
|
||||
|
||||
### Multi-agent leverage example concrete
|
||||
|
||||
```
|
||||
Task Plan B Contract V2 wire:
|
||||
🔵 Inv query "PE V2 schema pattern" → 15K retrieve + cached
|
||||
🟡 Imp query same → cache hit 90% → 1.5K effective
|
||||
🔴 Rev query same → cache hit 90% → 1.5K effective
|
||||
🟢 CICD query same → cache hit 90% → 1.5K effective
|
||||
Em main query same → cache hit 90% → 1.5K effective
|
||||
|
||||
Cumulative retrieve cost: 15K + 4×1.5K = 21K
|
||||
|
||||
Compare to lazy:
|
||||
Each agent Read PE V2 file separately
|
||||
5 entities × 20K Read = 100K cumulative
|
||||
|
||||
→ Saving 79K just for 1 cross-agent query
|
||||
```
|
||||
|
||||
### Optimization tips để giảm cumulative
|
||||
|
||||
**Option 1: Spawn ít agents hơn**
|
||||
- Decision gate 6-criteria mỗi task (per `feedback_multi_agent_setup` rule)
|
||||
- Solo em main đủ → KHÔNG spawn agent
|
||||
- Chỉ spawn agent nào THẬT cần
|
||||
- Trong S20-S21: 4 agents seeds-only, em chưa spawn lần nào → cost ~120K em main thôi
|
||||
|
||||
**Option 2: Tune blanket sub-agent (100K → 80K)**
|
||||
- Em main pass spec gọn (~10K thay 15K)
|
||||
- Em main paste common context excerpt thay full (~20K thay 50K)
|
||||
- Skills preload chỉ description (~3K thay 21K full SKILL.md)
|
||||
→ Trigger SKILL.md full khi semantic match
|
||||
- Per sub-agent: 100K → 80K
|
||||
- 4 agents cumulative: 400K → 320K
|
||||
- Heavy session: 560K → 480K (-15%)
|
||||
|
||||
**Option 3: SendMessage cache aggressive (1h TTL beta)**
|
||||
- Anthropic extended cache `extended-cache-ttl-2025-04-11`
|
||||
- Static prompts cache premium WRITE 2× base
|
||||
- Subsequent reads 0.1× discount
|
||||
- Multi-agent cùng cache prefix → benefit lớn
|
||||
- Saving 10-15% additional
|
||||
|
||||
---
|
||||
|
||||
## 14. 3-layer hybrid RAG upgrade path (Anthropic Contextual Retrieval)
|
||||
|
||||
> **Added S21 turn 2 (2026-05-12)** — Anthropic flagship pattern Sept 2024.
|
||||
|
||||
### Pattern overview
|
||||
|
||||
```
|
||||
Anthropic Contextual Retrieval = 3 layers compound:
|
||||
|
||||
Layer 1: Embeddings (Voyage-3-large)
|
||||
→ Semantic + synonym + multilingual catch
|
||||
|
||||
+ Contextual prefix (Haiku-generated context):
|
||||
Add chunk-specific context BEFORE embed
|
||||
"This chunk discusses... in context of..."
|
||||
→ Better recall via enriched vector
|
||||
|
||||
Layer 2: BM25 (bm25s Python lib free local)
|
||||
→ Exact identifier + technical terms (function names, error codes, Mig numbers)
|
||||
|
||||
+ Contextual BM25 (same prefix pattern)
|
||||
|
||||
Layer 3: Reranking (Voyage rerank-2)
|
||||
→ Cross-attention deep relevance
|
||||
→ Re-score top 30 candidates → return top 5 truly relevant
|
||||
```
|
||||
|
||||
### Performance compound effect
|
||||
|
||||
```
|
||||
Baseline (naive vector embeddings): ~50% recall
|
||||
|
||||
+ Contextual embeddings: ~67% recall (-35% failure)
|
||||
|
||||
+ Hybrid Contextual + BM25: ~75% recall (-49% failure)
|
||||
|
||||
+ Reranking: ~85% recall (-67% failure)
|
||||
```
|
||||
|
||||
📎 Source: [Anthropic Contextual Retrieval Sept 2024](https://www.anthropic.com/news/contextual-retrieval)
|
||||
|
||||
### Phase rollout incremental (recommend cho bro)
|
||||
|
||||
| Phase | Setup | Recall | Cost/month | Effort additional |
|
||||
|---|---|---:|---:|---|
|
||||
| **Phase 1** (Week 1-4) | Layer 1 vector only (Voyage-3-large) | ~70% | ~$1.50 | 10-14h initial |
|
||||
| **Phase 2** (Month 2) | + Layer 2 BM25 (bm25s free local) | ~78% | ~$1.50 unchanged | 2-3h |
|
||||
| **Phase 3** (Month 3) | + Layer 3 Voyage rerank-2 + Contextual prefix | ~92% | ~$4-5 | 3-4h |
|
||||
|
||||
### Phase 1 implementation (basic vector RAG)
|
||||
|
||||
Đã cover trong Section 5-6 plan. Bro implement Week 1-4 trial pilot.
|
||||
|
||||
### Phase 2 upgrade — Add BM25 hybrid
|
||||
|
||||
```python
|
||||
# scripts/rag-mcp-server.py — upgrade
|
||||
from bm25s import BM25
|
||||
|
||||
bm25 = BM25.load("./rag-data/bm25_index") # pre-built
|
||||
|
||||
@mcp.tool()
|
||||
def rag_retrieve_hybrid(query, scope="all", k=5):
|
||||
# Step 1: Vector search
|
||||
query_vec = voyage.embed([query], model="voyage-3-large").embeddings[0]
|
||||
vector_results = qdrant.search(COLLECTION, query_vec, limit=20)
|
||||
|
||||
# Step 2: BM25 search (local Python lib)
|
||||
bm25_results = bm25.retrieve(query, k=20)
|
||||
|
||||
# Step 3: Merge + dedup
|
||||
candidates = merge_dedup(vector_results, bm25_results) # ~30 chunks
|
||||
|
||||
# Step 4: Score combine (RRF reciprocal rank fusion)
|
||||
final_scores = reciprocal_rank_fusion(vector_results, bm25_results)
|
||||
|
||||
return final_scores[:k]
|
||||
```
|
||||
|
||||
### Phase 3 upgrade — Full Anthropic Contextual
|
||||
|
||||
```python
|
||||
# scripts/rag-indexer.py — upgrade với contextual prefix
|
||||
import anthropic
|
||||
|
||||
claude_haiku = anthropic.Anthropic()
|
||||
|
||||
def contextualize_chunk(chunk_content, full_doc_path):
|
||||
"""Generate context prefix using Claude Haiku (cheap model)."""
|
||||
full_doc = open(full_doc_path).read()
|
||||
|
||||
response = claude_haiku.messages.create(
|
||||
model="claude-haiku-4-5", # cheap ~$0.0001/chunk
|
||||
max_tokens=150,
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": f"""<document>
|
||||
{full_doc[:5000]}
|
||||
</document>
|
||||
|
||||
<chunk>
|
||||
{chunk_content}
|
||||
</chunk>
|
||||
|
||||
Give a brief context (50-100 words) explaining what this chunk is about and where it fits in the document. Be specific."""
|
||||
}]
|
||||
)
|
||||
|
||||
return response.content[0].text
|
||||
|
||||
# In indexer pipeline:
|
||||
for chunk in chunks:
|
||||
context = contextualize_chunk(chunk["content"], chunk["source"])
|
||||
chunk["content_enriched"] = f"{context}\n\n{chunk['content']}"
|
||||
# Embed enriched version → better recall
|
||||
```
|
||||
|
||||
```python
|
||||
# scripts/rag-mcp-server.py — final upgrade với reranking
|
||||
import voyageai
|
||||
|
||||
@mcp.tool()
|
||||
def rag_retrieve_full(query, scope="all", k=5):
|
||||
# Step 1-3: Same as Phase 2 (vector + BM25 + merge)
|
||||
candidates = hybrid_search(query, scope, top=30)
|
||||
|
||||
# Step 4: Voyage Rerank
|
||||
rerank_response = voyage.rerank(
|
||||
query=query,
|
||||
documents=[c.content for c in candidates],
|
||||
model="voyage-rerank-2", # ~$0.05 per 1000 queries
|
||||
top_k=k
|
||||
)
|
||||
|
||||
return [candidates[r.index] for r in rerank_response.results]
|
||||
```
|
||||
|
||||
### Cost incremental analysis
|
||||
|
||||
```
|
||||
Phase 1 → Phase 3 incremental cost:
|
||||
|
||||
Phase 1 (basic vector):
|
||||
Voyage embed: ~$0.36 initial + ~$0.20/mo delta
|
||||
= ~$1.50/mo total
|
||||
|
||||
Phase 2 (+BM25):
|
||||
BM25 free local (Python lib)
|
||||
Embedding cost same
|
||||
= ~$1.50/mo total (unchanged)
|
||||
|
||||
Phase 3 (+Reranking + Contextual):
|
||||
Voyage rerank-2: ~$0.05 per 1000 queries
|
||||
600 queries/mo × $0.05/1K = $0.03/mo
|
||||
|
||||
Haiku contextual prefix: ~$0.0001 per chunk
|
||||
Initial 5000 chunks × $0.0001 = $0.50 one-time
|
||||
Delta ~100 chunks/mo × $0.0001 = $0.01/mo
|
||||
|
||||
+ Voyage rerank monthly: ~$0.05/mo per 1K queries × 5 projects
|
||||
+ Re-embed enriched chunks: ~$0.50/mo
|
||||
= ~$4-5/mo total
|
||||
|
||||
→ Quality jump 70% → 92% recall = +22pp
|
||||
→ Cost jump $1.50 → $4-5/mo = +$3
|
||||
→ Worth it after Phase 1 validation
|
||||
```
|
||||
|
||||
### Why incremental rollout (vs all-in Phase 3 immediate)
|
||||
|
||||
1. **Validate Layer 1 quality first** — nếu Voyage Vietnamese kém → upgrade Phase 2-3 vô ích
|
||||
2. **Measure baseline cost** — biết exact Voyage spend trước add rerank/contextual
|
||||
3. **Identify retrieval miss patterns** — Phase 1 trial reveal weakness → target Phase 2-3 fix
|
||||
4. **Risk-averse setup** — mỗi phase 2-3h add, rollback dễ nếu fail
|
||||
5. **§6.5 narrative preserve** — KHÔNG over-engineer, build incremental
|
||||
|
||||
### When to skip Phase 2-3
|
||||
|
||||
- Phase 1 recall already > 85% → Phase 2-3 marginal benefit (Vietnamese-specific corpus)
|
||||
- Cost monthly < $5 budget → stay Phase 1 OK
|
||||
- Solo dev no Vietnamese exact terms heavy → BM25 less impactful
|
||||
|
||||
### When to MUST upgrade Phase 2-3
|
||||
|
||||
- Recall < 70% on benchmark → indicate Phase 1 insufficient
|
||||
- Em main report "miss exact identifier" frequently → Phase 2 BM25 critical
|
||||
- Multi-language queries common → Phase 3 reranker stabilize
|
||||
- Production quality target > 90% → Phase 3 required
|
||||
|
||||
---
|
||||
|
||||
## 📚 References + tools
|
||||
|
||||
### Anthropic official
|
||||
|
||||
Reference in New Issue
Block a user