Running OpenAI's gpt-oss-20b with 128k Context on a Single L4 GPU

Alexey Nizhegolenko DevOps Engineer, AgentOps Engineer, AI Infrastructure Engineer This is the...

martedì 19 maggio 2026 New tab

3,158 words~14 min read

Alexey Nizhegolenko

DevOps Engineer, AgentOps Engineer, AI Infrastructure Engineer

This is the second article in my series on self-hosting LLMs on GKE. In the first article I covered deploying Gemma4 26B with a 28,000 token context window. This time I'll show you something more impressive: openai/gpt-oss-20b running with a 128,000 token context on the same single L4 GPU.

The setup has been running in production since November 2025, for about 6 months, with no major incidents. That's the kind of track record worth writing about.

Why gpt-oss-20b?

Running OpenAI's gpt-oss-20b with 128k Context on a Single L4 GPU

Running OpenAI's gpt-oss-20b with 128k Context on a Single L4 GPU

Other newsrooms on this story

Related reading

Running Gemma 4 26B on an Old GTX 1080 with llama.cpp

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

Gemma 4: The 128K Multimodal Powerhouse in Your Terminal

Announcing the Availability of OpenAI's Open Models on Together AI

Gemma-4 31B + vLLM on RTX 6000 PRO : A Real-Load Benchmark

Deploying Gemma 4 26B on Proxmox: IaC Setup with Terraform, Ansible & AMD iGPU

Other newsrooms on this story

Related reading

Running Gemma 4 26B on an Old GTX 1080 with llama.cpp

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

Gemma 4: The 128K Multimodal Powerhouse in Your Terminal

Announcing the Availability of OpenAI's Open Models on Together AI

Gemma-4 31B + vLLM on RTX 6000 PRO : A Real-Load Benchmark

Deploying Gemma 4 26B on Proxmox: IaC Setup with Terraform, Ansible & AMD iGPU