witcheer PRO

witcheer

https://x.com/witcheer

AI & ML interests

Local AI Maxxing.

Recent Activity

updated a dataset about 18 hours ago

witcheer/rtx-5090-benchmarks

updated a dataset 3 days ago

witcheer/agentic-score-leaderboard

updated a bucket 14 days ago

gemma-challenge/gemma-witcheer

View all activity

Organizations

Posts 3

Post

195

new dataset: which local LLM best *drives an agent*? benchmarked 4 models for pairing with Hermes Agent (@NousResearch ) - a CodeAct agent that writes python to call its tools. RTX 5090, llama.cpp. two phases, hybrid:

>>> phase A (synthetic): scored 4 axes — code-as-action, long-context, instruction-following under Hermes' real ~3.5K-token prompt, multi-step loops. top was a near-tie (within noise): an 18B frankenmerge (Qwopus) edged Qwen3.6-27B, and Hermes' own 36B came LAST.

>>> phase B (real harness): installed Hermes, ran the top 3 through 14 multi-step tasks x3 repeats. the tie broke — and an efficiency gap appeared:

Qwen3.6-27B    100%   | 3.0 turns | 364 tok
Qwopus-18B     85.7%  | 3.6 turns | 870 tok
Nemotron-30B   85.7%  | 4.4 turns | 1334 tok

Qwen is perfect AND 2.4-3.7x more token-efficient — something a synthetic test can't see (only the real agent loop can). verdict: Qwen3.6-27B for local Hermes.

dataset: witcheer/hermes-pairing-bench
collection: witcheer/rtx-5090-benchmark-rig-6a17e365b534abb474250e11

Post

258

new dataset: turboquant KV cache benchmarks for qwen3.6-35B-A3B on RTX 4060 Ti 8GB.

>>> 18 structured runs covering turboquant turbo2/turbo3/turbo4 vs standard q4_0 V cache, context depths 3.5K to 50K, checkpoint modes, and two agent harnesses (hermes vs pi).

>>> novel finding: llama.cpp default context checkpoints (every 8192 tokens, ~63 MiB each) silently accumulate in VRAM and trigger the 8GB cliff at ~26K context. disabling with --checkpoint-every-n-tokens -1 gives smooth degradation instead.

turbo3, 64K ctx: 35.17 tok/s (62 graph splits)
turbo2, 64K ctx: 35.13 tok/s (62 graph splits)
turbo4, 64K ctx: 13.93 tok/s (decompression cliff)
std q4_0, 64K ctx: 31.22 tok/s (82 splits, broken in agents)