Self-hosting LLMs in Europe: real costs, 5 models

APIs like Claude and GPT already solved the cold start problem. They keep models warm, amortize GPU costs across users, and give you sub-second latency. So why would anyone self-host?

Data residency. No rate limits. Full choice of model. And the ability to modify the serving stack itself, like adding KV cache compression to fit more concurrent users on the same GPU. These matter for production workloads. The question is at what volume the economics start working.

I tested five models on Verda GPU cloud in Helsinki to find that threshold. All numbers are from actual test runs.

The models

I tested across a range of sizes, from 1.7B to 235B parameters. All deployed as Verda serverless containers using stock or lightly customized vLLM Docker images. The model weights are cached on a persistent volume so they survive container restarts.

Gemma 4 26B MoE (google/gemma-4-26B-A4B-it). Google's recently released open model. 26 billion total parameters but only 3.8 billion active per token thanks to Mixture-of-Experts architecture. Top-tier on Arena-style evaluations. Apache 2.0. Needs A100 80GB because the weights alone require ~48.5 GiB in BF16.

gpt-oss-20b (openai/gpt-oss-20b). OpenAI's open-weight MoE model, released August 2025. 21 billion total, 3.6 billion active. Ships pre-quantized in 4-bit format (MXFP4), which keeps the on-disk size small. Apache 2.0. Runs on the cheaper L40S GPU.

Qwen3-8B. A solid 8B dense model from the Qwen family. Good baseline for comparison.

Qwen3-1.7B. The smallest model tested. Fast to load, but quality drops noticeably.

Qwen3-235B AWQ. The quality champion from our earlier benchmarks (4.75/5 across 20 multi-turn scenarios). 235 billion parameters, AWQ-quantized to ~120GB on disk. Needs an H200 141GB GPU.

Why serverless GPU does not replace APIs

Serverless GPU containers scale to zero when idle and spin up on demand. In theory, this gives you API-like economics without the API provider. In practice, the cold start kills it.

Cold start is the time from zero replicas running to the first token generated. This includes container image pull, model weight loading from disk to GPU memory, and framework initialization. Even when model weights are cached on a persistent volume, loading tens of gigabytes into GPU memory takes minutes.

Gemma 4 26B MoE on A100 80GB

First boot (image pull + model download + CUDA warm-up): 9.5 minutes
Second boot (model cached, loading 48.5 GiB from disk to GPU): 7.75 minutes

gpt-oss-20b on L40S 48GB

First boot: 3.3 minutes
Second boot (cached): 2.7 minutes

Qwen3-8B on L40S 48GB

First boot: 3 minutes

Qwen3-235B AWQ on H200 141GB

First boot (120GB model download): 13 minutes
Second boot (cached): 5.5 minutes

Even mid-size models take 2-3 minutes. The large ones take 5-13 minutes. An API gives you sub-second latency. This is not a marginal difference. Serverless GPU is a fundamentally different product from an API: it trades latency for cost at low volumes, but it cannot replace an API for interactive use.

Sanity check: response quality

This is not a benchmark. I sent the same three prompts to each model as a functional check: a factual question, a math problem, and a product copywriting task.

Gemma 4 26B MoE

"The capital of Finland is Helsinki."
"17 * 23 = 391"
"Empower your sales team to close more deals with real-time insights and actionable data intelligence that turns complex pipelines into predictable revenue."

gpt-oss-20b

"The capital of Finland is Helsinki."
"17 × 23 = 391."
"Our cloud-based analytics platform delivers real-time, actionable insights through intuitive dashboards, seamless data integration, and AI-powered predictive modeling, empowering teams to make data-driven decisions."

Both produce polished, correct output on these prompts. Gemma 4 is more concise. gpt-oss-20b is more detailed. Both are well above what you might expect from models with under 4 billion active parameters.

I also tested with TurboQuant+ KV cache compression (asymmetric K4/V3). In these tests, the output showed no observable difference from uncompressed. TQ+ compresses the per-conversation memory, which means more concurrent conversations on the same GPU.

Three options, three tradeoffs

There are three ways to run an LLM. Each makes a different tradeoff between latency, cost, and control.

1. Managed APIs (Claude, GPT, Nebius)

Sub-second latency. No infrastructure. Pay per token. For personal use, Claude Pro at $20/month is hard to beat. But monthly subscriptions do not allow serving to third parties. If you are building a product that serves customers, you pay API pricing (per token), which scales linearly with usage. At production volumes, self-hosting starts competing on cost, not just control.

2. Serverless GPU (what I tested)

Scale to zero when idle, pay per 10-minute block. No cost when not in use. Prices observed in April 2026:

Gemma 4 26B on A100: $1.29/hr = $0.22 per session
gpt-oss-20b on L40S: $0.90/hr = $0.15 per session
Qwen3-235B AWQ on H200: $3.39/hr = $0.57 per session

The tradeoff: 2-13 minute cold starts. This is not an API replacement. It is a different product: batch processing, development, internal tools, async workloads where minutes of startup are acceptable. You get full data control and EU residency, but not real-time latency.

3. Always-warm GPU (production self-hosting)

Keep the model loaded and ready. Instant response, same as an API. The tradeoff: you pay for idle time.

Gemma 4 26B on A100 (8hr/day): $310/month
gpt-oss-20b on L40S (8hr/day): $216/month
Qwen3-235B AWQ on H200 (8hr/day): $814/month

Where the break-even is

Always-warm on A100 costs $10.32/day. A serverless session costs $0.22. That is $10.32 / $0.22 = roughly 47 sessions to break even. On L40S: $7.20 / $0.15 = 48 sessions. At around 50 conversations per day, always-warm becomes cheaper than serverless.

For personal use, managed APIs at $20/month are unbeatable. But for products that serve customers, the comparison is against per-token API pricing, not subscription plans. At production volumes, always-warm self-hosting at $216-310/month with unlimited throughput can be cheaper than per-token APIs, while also giving you data residency and full model control.

What I learned getting this working

Use the stock vLLM Docker image. I spent a day building custom Dockerfiles with bash entrypoint scripts. They all crashed silently on Verda. The fix: use docker.io/vllm/vllm-openai:v0.19.0 directly and pass model config as command overrides. The vLLM image handles GPU detection, model downloading, and health checks correctly out of the box.

Build container images on amd64. Podman on macOS Apple Silicon builds arm64 images by default, even when the base image is amd64. These crash silently on x86 servers. I ended up using a $0.06/hour Verda CPU instance as a build server. Builds complete in seconds.

Gemma 4 needs a newer transformers library. The vLLM v0.19.0 pip package ships with transformers 4.57.6, which does not support Gemma 4. A one-line Dockerfile extension fixes it: RUN pip install "transformers>=5.5.0". Also needs python3-dev for CUDA utility compilation.

Use the instruction-tuned model variant. I initially deployed gemma-4-26B-A4B (the base model) and got empty responses. The correct name is gemma-4-26B-A4B-it (instruction-tuned). The base model has no chat template.

The persistent /data volume is essential. Without it, every cold start re-downloads the full model. With it, subsequent boots skip the download. The bottleneck then becomes loading weights from disk into GPU memory, which is still minutes for large models.

The deploy script

The deployment tool is a single Python file. It handles any model configuration:

python containers/deploy.py deploy gemma4-26b-it     # A100, $1.29/hr
python containers/deploy.py deploy gpt-oss-20b       # L40S, $0.90/hr
python containers/deploy.py deploy qwen3-235b-awq    # H200, $3.39/hr
python containers/deploy.py pause gemma4-26b-it       # stop billing
python containers/deploy.py resume gemma4-26b-it      # restart

Adding a new model is one dictionary entry with the model ID, GPU type, and vLLM arguments.

TurboQuant+ KV cache compression

All models were also tested with TurboQuant+ KV cache compression (asymmetric K4/V3). TQ+ compresses the per-conversation memory (the KV cache) by 2-4x. In these tests, the compressed output showed no observable difference from uncompressed responses. More concurrent users on the same GPU at the same cost.

The compression uses Walsh-Hadamard rotation and a precomputed Gaussian codebook. No calibration data needed. It works as a drop-in patch for vLLM. Full write-up: KV cache compression. Weight compression: 60GB to 17GB.

What comes next

Quantizing Gemma 4 26B to AWQ 4-bit would reduce the weight memory from ~48.5 GiB to ~13 GiB. That would cut load time significantly and potentially allow it to run on cheaper GPUs like the L40S, narrowing the cold start gap.

But the fundamental picture is unlikely to change soon. Loading billions of parameters into GPU memory takes time. APIs solve this by keeping models warm and spreading the cost. Self-hosting means you carry that cost yourself. The question is whether data residency, rate limits, or model control justify $216-310/month over $20/month. For some workloads, they do.

All testing on Verda GPU cloud (Helsinki, Finland). EU data residency, 100% renewable energy, ISO 27001. Prices observed in April 2026.
TurboQuant+ library: turboquant-vllm.
Total testing budget: ~$15 across all experiments.

KV cache compression Weight compression turboquant-vllm Verda GPU cloud

More writing →