Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe

If you want to reproduce my current local Hermes Agent + Qwen3.6-27B setup, this is the shape I would...

venerdì 19 giugno 2026 New tab

TL;DRAI

Qwen3.6-27B on 24GB VRAM: vLLM configured with 131k context, prefix caching, and max_num_seqs=1; Hermes disables child agents and speculative decoding for stable multi-turn inference. The setup targets multi-hour agent sessions over throughput peaks—a practical trade-off for local coding agents avoiding KV-cache thrashing and OOM on consumer GPUs.

927 words~4 min read

If you want to reproduce my current local Hermes Agent + Qwen3.6-27B setup, this is the shape I would start from.

Target

One local coding agent.

One 24GB GPU.

Long context.

Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe

Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe

Related reading

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

How to Build a Persistent AI Agent with Hermes in 15 Minutes

Hermes Agent: First Contact

Automating My Content and Dev Pipeline with Local Hermes Agents & Qwen 35B

We Tried 6 Memory Providers for Hermes Agent — Here's What We Learned

Related reading

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

How to Build a Persistent AI Agent with Hermes in 15 Minutes

Hermes Agent: First Contact

Automating My Content and Dev Pipeline with Local Hermes Agents & Qwen 35B

We Tried 6 Memory Providers for Hermes Agent — Here's What We Learned