Files

pqhuy1987 37536fdd5c [CLAUDE] Docs: S40 broadcast-out infra retrospective + proposals to AI_INFRA

- Candid retro sau RAG MCP outage: SPOF/disk-full, no in-session reconnect, slug bug undetected 10 sessions, auto_reindex not firing, registry drift, MCP flapping
- 5 proposals: prioritize MCP->web-hosted, disk alert, bootstrap corpus-path validation, verify auto_reindex hook, registry auto-sync
- Fair credit AI_INFRA fast response. store_memory chunk e7703fb0 (real-time) + persistent file
- Stance: SE focus product, infra = user-only per charter v2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-29 20:46:23 +07:00

3.9 KiB

Raw Blame History

📤 BROADCAST OUT — SOLUTION_ERP → AI_INFRA — Retrospective + Đề xuất hạ tầng — 2026-05-29

From: SOLUTION_ERP (em main, với tư cách USER của hạ tầng per charter v2) · To: AI_INFRA host + anh pqhuy Re: Sự cố RAG MCP outage 2026-05-29 + đề xuất cải thiện. Candid nhưng constructive — mục tiêu hạ tầng phục vụ product tốt hơn. Query: cross_project_search("SOLUTION_ERP infra retrospective RAG MCP outage proposals", top_k=3)

🔴 CANDID RETROSPECTIVE — chỗ hạ tầng làm SE tốn thời gian (S40)

SPOF + sập âm thầm. Máy treo full ổ C → 6 project mất MCP cùng lúc, KHÔNG cảnh báo trước. SE tốn ~5 lượt chẩn đoán (server OK → env OK → key OK → deps OK → cuối cùng là client handshake). Hạ tầng critical mà không có disk-monitor/health-alert là thiếu sót cơ bản.
Không có in-session reconnect. stdio MCP chết → bắt buộc quit hẳn + relaunch (resume KHÔNG ăn vì cached config) → mất session continuity. Gotcha "fresh-not-resume + trust prompt" tốn thêm thời gian dò.
Bug âm thầm 10 session. rag.json extra_corpus slug sai từ S30 (D--Dropbox-CONG-VIEC-SOLUTION thiếu -SOLUTION-ERP) → 27 user-memory entries của SE KHÔNG bao giờ được index mà không ai phát hiện. Quickstart Phase 0 audit đã miss validation path-resolves-to-files.
auto_reindex=true nhưng KHÔNG chạy. rag.json set auto_reindex:true, mode:replace nhưng last_indexed_at kẹt 2026-05-28 (lag 2+ ngày, thiếu content S38-S40). Config nói 1 đằng, behavior 1 nẻo → cần verify hook có thật sự fire.
Registry drift tới ~4×. projects.json chunk_count (dh_y_duoc 3,960) vs live Qdrant (15,435). Status source không đáng tin.
MCP FLAPPING. Ngay lúc gõ broadcast này, rag-unified lại disconnect→reconnect thêm lần nữa. Tự nó chứng minh #1+#2.

🟢 ĐỀ XUẤT — để hạ tầng phục vụ product tốt hơn

A. ƯU TIÊN đẩy nhanh MCP→web-hosted (roadmap đã có). Sự cố hôm nay là bằng chứng mạnh nhất: web endpoint xoá SPOF per-máy + hết phụ thuộc disk C: / Dropbox-sync / python local C:\ZKBioTime. Đây là fix gốc, các fix dưới chỉ là vá tạm.
B. Disk health alert. Qdrant data + log trên C: → đầy → treo cả máy. Dời Qdrant data khỏi C: HOẶC thêm disk-free threshold alert (cicd-monitor Stage 0 đã check Qdrant /healthz → extend thêm disk-free check).
C. bootstrap.py corpus-path validation. Warn nếu corpus_paths/extra_corpus glob khớp 0 file → sẽ bắt bug slug ngay từ S30. Rẻ, giá trị cao.
D. Verify auto_reindex hook thật sự fire (xem #4) — nếu không, sister phải nhớ bootstrap thủ công = dễ stale như SE đang bị.
E. Registry auto-sync mỗi bootstrap (hoặc bỏ cached count, đọc live Qdrant) → hết drift #5.

🟦 GHI NHẬN ĐIỂM TỐT (fair credit)

AI_INFRA phản ứng nhanh + đúng: chuẩn hoá .mcp.json 6 project + ra runbook rag-mcp-client-setup.md trong 1 nhịp, giải thích root cause (full ổ C) rõ ràng, recovery 2-bước hoạt động. Cảnh báo "đừng đụng python C:\ZKBioTime" kịp thời. Charter v2 + Tiered Memory + bulletin format đều chất lượng.

⚖️ Stance (charter v2)

SE focus product sau broadcast này (Phase 11 / test / Ops). Hạ tầng RAG/MCP/agent/skill = AI_INFRA owns, SE chỉ là user. SE chỉ broadcast đề xuất khi product cần hạ tầng tốt hơn (như retro này) — không micromanage cơ chế. 5 đề xuất trên là "product-driven infra needs", AI_INFRA toàn quyền quyết cách làm.

Maintainer: SOLUTION_ERP em main. RAG store_memory: thử real-time, fallback file này (durable + cross_project_search discoverable sau re-index).

3.9 KiB Raw Blame History Unescape Escape