Switching Open WebUI from Ollama to llama.cpp for Qwen 3.5

22 Mar 2026

I finally got to a local LLM setup that feels pretty usable within a 16 GB VRAM constraint (around 40 tok/s).

As far as I can tell, Open WebUI is still the best open source chat interface for local models. I am running it on a Ryzen 5 5600 box with an RTX 5060 Ti 16 GB card, with Qwen3.5-9B GGUF at Q8_0 as the main local model. I started with Ollama because it is the popular default. For Qwen 3.5 though, I hit a few open issues at the time that made llama.cpp easier for my setup: much slower inference than llama.cpp with the same model, long stalls on later turns in a conversation, and broken /no_think handling for Qwen. It is a very new model, and I am sure the Ollama project will get those fixed.

llama-server exposes an OpenAI-compatible API, so Open WebUI mostly treated it like a config swap.

The more interesting part is that Qwen 3.5’s newer hybrid architecture seems to help a lot here. The 9B model feels much better than I would have expected for the size, and this is the first local setup I have had on this machine that felt worth keeping around.

From what I had been reading, thinking mode also did not seem worth the wait for a setup like this, and that matched what I was seeing. Qwen 3.5 defaults to thinking, but for my use it was usually slower, often added 30-60 seconds, and sometimes got stuck in long thinking loops without noticeably improving the answer. What finally worked reliably here was a newer llama.cpp build with LLAMA_ARG_JINJA=1 and LLAMA_ARG_THINK_BUDGET=0.

Most of the other tuning was about using the remaining VRAM well once the 9B model itself had already taken roughly 9 GB. LLAMA_ARG_N_PARALLEL=1, LLAMA_ARG_FLASH_ATTN=1, q8_0 KV cache, LLAMA_ARG_BATCH_SIZE=4096, and LLAMA_ARG_UBATCH_SIZE=1024 were what got the whole setup to around 14 GB and let me spend the rest on context instead of spilling into system memory. The next size up looked much more likely to spill into RAM. I wanted to see what I could get done with this Nvidia card first, since it still seemed like the safest compatibility bet.

For serious work I would still use frontier hosted models. The local setup is useful enough to keep around for narrower cases, especially redacting or cleaning text before sending it to a cloud model.

Full example gist with the commented minimal two-container compose file:

services: llama-server: image: ghcr.io/ggml-org/llama.cpp:server-cuda-b8369 container_name: llama-server restart: unless-stopped environment: # Model - auto-downloads from Hugging Face on first run. - LLAMA_ARG_HF_REPO=unsloth/Qwen3.5-9B-GGUF:Q8_0 - LLAMA_ARG_N_GPU_LAYERS=99 # VRAM tuning for a 16 GB card. - LLAMA_ARG_CTX_SIZE=131072 - LLAMA_ARG_N_PARALLEL=1 - LLAMA_ARG_FLASH_ATTN=1 - LLAMA_ARG_CACHE_TYPE_K=q8_0 - LLAMA_ARG_CACHE_TYPE_V=q8_0 - LLAMA_ARG_BATCH_SIZE=4096 - LLAMA_ARG_UBATCH_SIZE=1024 # Qwen 3.5 non-thinking mode. - LLAMA_ARG_JINJA=1 - LLAMA_ARG_THINK_BUDGET=0 # Qwen-recommended sampling defaults for non-thinking mode. - LLAMA_ARG_TEMP=0.6 - LLAMA_ARG_TOP_K=20 - LLAMA_ARG_TOP_P=0.95 - LLAMA_ARG_MIN_P=0 # Shared gaming PC - unload after 5 minutes idle. - LLAMA_ARG_SLEEP_IDLE_SECONDS=300 - LLAMA_ARG_PORT=11434 - LLAMA_ARG_HOST=0.0.0.0 volumes: - ./models:/root/.cache/llama.cpp ports: - "11434:11434" deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] open-webui: image: ghcr.io/open-webui/open-webui:v0.8.9 container_name: open-webui restart: unless-stopped ports: - "3000:8080" volumes: - ./open-webui:/app/backend/data environment: - ENABLE_OLLAMA_API=false # llama-server exposes an OpenAI-compatible API. - OPENAI_API_BASE_URLS=http://llama-server:11434/v1 - OPENAI_API_KEYS=no-key - WEBUI_AUTH=true depends_on: - llama-server