🚀 DGX Spark 完整模型部署棧

LLM + Embedding + Reranker — Production stack benchmarks (NFCorpus / SciFact / FiQA / ArguAna / CmedQA)

📍 NVIDIA GB10 • DGX Spark 💾 121 GB Unified Memory 📅 2026-05-06 🔧 buun-llama-cpp + FlagEmbedding

LLM (Qwen vs Nemotron) Embedding 模型 Reranker 模型 Bench 對照完整 RAM 賬本 Production 建議

① LLM — Qwen3.6-35B vs Nemotron-Nano-Omni

Qwen3.6-35B-A3B

純 Transformer MoE

總參數34.66B

Active~3B (A3B)

層數40 (全 attention)

QuantQ4_K_M + TBQ3 KV

Context131 K

Slots8 並發

應用TENET 翻譯 / 對話

Nemotron-Nano-Omni

Hybrid Mamba MoE + Vision

總參數30B

Active~3B (A3B, 6 experts)

層數52 (46 Mamba + 6 attn)

QuantNVFP4 attn + Q8 hack

Context65 K (max 1 M)

Slots4 並發

應用圖片 + 文字 multimodal

記憶體項目	Qwen3.6-35B	Nemotron-Omni
模型檔（磁碟）	22.13 GB	18.90 GB + 1.59 GB mmproj
模型 buffer (VRAM)	20.58 GB	17.66 GB
KV cache buffer	490 MiB	84 MiB ⚡ Mamba 沒 KV
Compute buffer	493 MiB	271 MiB
總 unified mem	~21.6 GB	~19.5 GB (+ 1.59 mmproj)

① Mamba 的 KV cache 超小

Nemotron 52 層只有 6 attention，KV 才 84 MiB（Qwen 是 490 MiB）。Mamba 設計用 SSM state 換掉 KV，達到對長 context 的線性成本。

② Qwen TBQ3 KV 量化超猛

35B × 8 並發 × 131K context，KV 才 490 MiB（normal Q8 KV 約 1.5 GB）。TBQ3 是我們自家 fork 量化，比 llama.cpp 原生 Q8 KV 省 3×。

③ Mamba 的代價：cache 失效

Nemotron log 一直 warning「forcing full prompt re-processing... hybrid/recurrent memory」。Mamba state 不能跨 slot 共享，每 slot 獨立 SSM state，對話場景沒辦法 prefix cache 受益。

② Embedding 模型 (Retrieval)

MemSifter v3.2 production stack 用 BGE-M3 + Granite-emb-278m + FTS5 三路 RRF，benchmark 後發現：加上 Granite-r2 為第 4 路 RRF（英文 only, langid 路由）是性價比最高升級。

模型	Params	Dim	多語?	NFCorpus 醫學	SciFact 科學	FiQA 金融	ArguAna 論證	CmedQA 中文	狀態
BGE-M3 dense	568M	1024	✓ 多語	0.314	0.648	0.408	0.387	0.456	production
BGE-M3 combined (dense + sparse + colbert)	568M	1024+稀疏+多向量	✓ 多語	0.343	—	—	—	—	paper level ✓
Granite-emb-278m-multi	278M	768	✓ 多語	0.289	0.653	0.349	0.398	0.302	production
Granite-emb-english-r2	278M	768	✗ 純英	0.375	0.756	0.453	0.421	0.154 ❌	建議升級 (英文 only)
Granite-emb-small-english-r2	30M	384	✗ 純英	0.337	—	—	—	0.161 ❌	finetune 友善

論文真相：BGE-M3 NDCG@10 = 0.343 是 combined 模式

純 dense 只有 0.314，加上 sparse (0.291) + colbert (0.342) 加權才到 0.343。產線一律用 SentenceTransformer 只拿 dense → 永遠到不了論文水準。要對齊論文必須改 FlagEmbedding 套件 + 三向量加權。

最大發現：Granite-r2 在英文場景單路勝過 BGE-M3

FiQA 0.453 / SciFact 0.756 / NFCorpus 0.375，**全面贏 BGE-M3**。但中文場景（CmedQA 0.154）大崩潰 — 純英文 embedder 對中文直接掉 -0.30 NDCG。解法：langid 路由（英文 query 走 r2，中文 query 不走）。

4-way RRF 組合 (BGE + 278m + r2 + FTS) — bench 全勝

組合	NFCorpus	SciFact	FiQA	ArguAna	提升 vs production
BGE+278m+FTS (現在 production)	0.337	0.671	0.359	0.410	baseline
BGE+r2+FTS (英文升級)	0.357	0.712	0.395	0.422	+0.020 / +0.041 / +0.036 / +0.012
BGE+278m+r2+FTS 4-way (建議)	0.357	0.710	0.402	0.424	+0.020 / +0.039 / +0.043 / +0.014

③ Reranker 模型

三類 reranker 試過：cross-encoder（small, finetuned）、LLM-as-judge（zero-shot prompt）、finetuned proxy（MemSifter-4B-Thinking）。結論：cross-encoder 邊際收益小、LLM zero-shot 全部反害、proxy 太慢。

Reranker	Type	Params	NFCorpus Δ NDCG@10	延遲	結論
bge-reranker-v2-m3	Cross-encoder	568M	+0.002	470 ms/q	不划算
bge-reranker-base	Cross-encoder	278M	-0.038	150 ms/q	反害
granite-4.0-h-micro Q4	LLM yes/no judge	3B (hybrid Mamba)	-0.098	6.7 s/q	反害（base 沒 finetune）
granite-4.1-8B Q4	LLM 1-5 score judge	8B (hybrid Mamba)	0.000	3.3 s/q	ties 沒效
granite-h-micro BF16 (mamba-ssm fast)	LLM yes/no judge	3B	-0.056	4.3 s/q	速度修了但仍反害
SenseNova-U1-8B Q8 (Qwen3 backbone)	LLM yes/no judge	8B (Qwen3)	—	—	本來生圖用
MemSifter-4B-Thinking	RL-trained proxy (listwise)	4B (Qwen3)	未測完	115 s/q ❌	太慢、sanity 異常

為什麼 zero-shot LLM-as-judge 全部失敗？

「Is this passage relevant?」對 NFCorpus 醫療專業 query 太粗。模型對相關 doc 多打 0.95、邊緣 doc 也 0.7、無關 0.05 → 排序被噪聲洗亂。論文 MemSifter 用 RL outcome reward 訓練 proxy（reward = working LLM 任務完成度），不是用 base model prompt。

Cross-encoder 為什麼 bge-reranker-base 反害？

278M 是英文訓練 baseline，對 NFCorpus 多義詞處理不佳，把對的 doc 推下去。bge-reranker-v2-m3 是多語擴大版（568M），相對保守、+0.002 微提升但 470ms/q latency 不划算。

④ 完整 Bench 對照（5 datasets × 4 stacks）

Stack	NFCorpus 醫學	SciFact 科學	FiQA 金融	ArguAna 論證	CmedQA 中文醫	平均提升
Production v3.2 (BGE+278m+FTS)	0.337	0.671	0.359	0.410	0.399	baseline
v3.3 4-way RRF (+ r2, langid)	0.357	0.712	0.402	0.424	0.399	+0.030 (英文) / +0.000 (中文)
v3.2 + bge-reranker-v2-m3	0.338	—	—	—	—	+0.002
v3.2 + LLM-as-judge (3B-8B)	0.287-0.302	—	—	—	—	-0.05 ~ -0.10

⑤ DGX Spark 完整 RAM 賬本 (121 GB)

Qwen 22GB

Nemotron 21GB

ASR

其他

buff/cache 18GB

~46 GB free

服務	Port	RAM 使用	用途
Qwen3.6-35B-A3B	:8083	~22 GB	Production LLM (TENET 翻譯)
Nemotron-Nano-Omni	:8094	~21 GB	Vision + LLM (multimodal)
Qwen3-ASR v2	:8100	~3 GB	Speech recognition
VibeVoice realtime	:8099	~3 GB	Voice synthesis
MemSifter v2 (BGE+Granite)	:8200	~1 GB	Retrieval (BGE-M3 + Granite-278m)
FitMatch VLM	:9100	~1 GB	Food image VLM
7 Hermes bot gateways	—	~1.5 GB	Discord agent runtime
Chrome / agent proxy / 其他	—	~3 GB	Overhead
buff/cache (kernel)	—	~18 GB	OS file cache
總用量	—	~73-75 GB	剩 ~46 GB 給 bench / dev

⑥ Production 升級建議

🎯 推薦：MemSifter v3.3 — 4-way RRF + langid 路由

已寫好 patch (`/home/waynehsu/.memsifter/_scripts/memsifter_v3_3_hybrid.py`)，bench 驗證英文場景 +0.030 NDCG@10、中文場景不影響。

新增 Granite-emb-english-r2 為第 4 路 RRF
內建 langid 偵測：英文 query → 4-way；中文 query → 維持 3-way（不傷中文）
新增 /reindex_r2 endpoint 供現有 namespace 平滑升級（不重 BGE+278m）
儲存 +50%（多 1 個 .npz 檔），延遲 +20-30 ms / query

⚠️ 不推薦：加 reranker

所有測過的 reranker 對 NFCorpus 都沒明顯提升或反害。Cross-encoder +0.002 不值 470ms/q；LLM-as-judge zero-shot 全失敗。如要做 LLM judge，需要 RL-finetune proxy（蒐集 hermes 對話 outcome label，訓練 ~1 週）。