Use Coding Agents with On-Premise Inference Services

Introduction

Coding agents such as opencode, Codex CLI, and Claude Code are terminal-based assistants that read your repository, plan changes, edit files, and run commands on your behalf. They normally talk to a hosted model provider over the internet.

This document shows how to point those agents at a model you serve yourself on Alauda AI, so that your source code, prompts, and infrastructure configuration never leave your cluster. The same on-premise InferenceService that you deploy for any other workload can back an interactive coding agent, as long as it exposes an OpenAI-compatible API and has tool (function) calling enabled. opencode and Codex CLI can call that endpoint directly; Claude Code speaks the Anthropic Messages API (/v1/messages) and needs a lightweight translation proxy (see Claude Code).

This page builds directly on the deployment how-tos. It does not repeat how to create or expose an InferenceService; instead it links to them and focuses on the agent-specific configuration and tuning.

WARNING

Coding agents and their configuration formats evolve quickly. The config snippets below are correct starting points for the versions available at the time of writing. Always confirm field names against the current upstream documentation of the agent you use.

Prerequisites

  • A running, ready InferenceService that serves an OpenAI-compatible API. See Create Inference Service using CLI.
  • Network access from the machine running the agent to the service endpoint. For access from a developer laptop outside the cluster, see Configure External Access for Inference Services.
  • A model with tool/function calling support, served with the matching vLLM parser enabled (see Enable tool calling on the runtime). Without this, agents can chat but cannot edit files or run commands.
  • The agent CLI installed locally (opencode, codex, or claude).
  • For Claude Code, a translation proxy (LiteLLM or claude-code-router) to bridge Claude Code's Anthropic Messages API to the OpenAI-compatible endpoint (see Claude Code).

How the pieces fit together

  opencode / Codex CLI
        │  OpenAI Chat Completions API  (POST /v1/chat/completions)

  External access / Load Balancer  ──►  KServe InferenceService (vLLM)

  Claude Code
        │  Anthropic Messages API  (POST /v1/messages)

  Translation proxy (LiteLLM / claude-code-router)
        │  OpenAI Chat Completions API  (POST /v1/chat/completions)

  same InferenceService endpoint
  • opencode and Codex CLI speak the OpenAI Chat Completions API natively, so they can call the InferenceService endpoint directly.
  • Claude Code speaks the Anthropic Messages API, which vLLM does not serve. It requires a small translation proxy in front of the OpenAI-compatible endpoint (see Claude Code).

Step 1: Deploy and smoke-test the endpoint

Deploy your model as an InferenceService following Create Inference Service using CLI, and if the agent runs outside the cluster, expose it following Configure External Access for Inference Services.

Before wiring up any agent, confirm the endpoint answers a chat request. Coding agents fail in confusing ways if the base URL, model name, or auth is wrong, so validate with curl first:

# BASE_URL must end at /v1
BASE_URL="https://your-inference-service-domain.com/v1"
MODEL="qwen-2"        # must match --served-model-name in the InferenceService
API_KEY="sk-local"    # any non-empty value if the server does not enforce auth

curl -sS ${BASE_URL}/chat/completions \
  -H "Authorization: Bearer ${API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"${MODEL}"'",
    "messages": [{"role": "user", "content": "Reply with the single word: ready"}],
    "max_tokens": 16
  }'

A normal JSON completion confirms the endpoint is reachable and the model name is correct. Note the three values you will reuse for every agent: base URL (ending in /v1), model name (the --served-model-name), and API key.

TIP

For reasoning models (DeepSeek R1, QwQ, Qwen3, etc.), also add the matching --reasoning-parser to the vLLM launch flags. See Configure reasoning models and reasoning effort.

Step 2: Enable tool calling on the runtime

Coding agents work by calling tools (read file, write file, run shell). This requires the model to emit tool calls and vLLM to parse them. Add the following flags to the vLLM launch command in your InferenceService (in the sample from Create Inference Service using CLI, they go on the python3 -m vllm.entrypoints.openai.api_server line):

--enable-auto-tool-choice \
--tool-call-parser hermes        # match the parser to your model family
  • The parser must match the model. For example, Qwen2.5 and QwQ-32B commonly use hermes; Qwen3-Coder uses qwen3_xml; Llama 3.x models use llama3_json; Mistral models use mistral. Check the vLLM tool calling documentation for the current parser list and the value that matches your model.
  • Some models need a specific chat template to emit tool calls correctly; pass --chat-template if the model card calls for it.
  • If you serve a reasoning model, also enable the matching --reasoning-parser so the agent receives clean assistant content separated from reasoning traces.

Verify tool calling end-to-end by asking the agent to perform a trivial file operation (for example, "create hello.txt containing the word hi"). If the model replies in prose instead of editing the file, tool calling is not wired up correctly — recheck the parser and model.

Step 2b (optional): Configure reasoning models and reasoning effort

Some models (for example, DeepSeek R1, QwQ, Hunyuan, or Cohere Command A Reasoning) emit chain-of-thought reasoning before their final answer. vLLM separates the reasoning traces from the assistant content so your agent receives clean output — but you must enable the matching flags.

Server-side flags

Add --reasoning-parser to your vLLM launch command. If the same model also needs agent tool calls, pair it with the appropriate --tool-call-parser:

--enable-auto-tool-choice \
--tool-call-parser <parser> \
--reasoning-parser <reasoning-parser>

The table below shows common model families and their required parsers. Confirm against the vLLM tool calling documentation for the current list.

Model family--tool-call-parser--reasoning-parserNotes
DeepSeek R1 (deepseek-ai/DeepSeek-R1-*)deepseek_v3 for DeepSeek-R1 tool callingdeepseek_r1DeepSeek-R1-0528 tool calling also needs --chat-template examples/tool_chat_template_deepseekr1.jinja
QwQ (Qwen/QwQ-32B)hermesdeepseek_r1QwQ uses Hermes-style tool calls and DeepSeek-style reasoning tags
Qwen3 reasoning (Qwen/Qwen3-*)Check the current vLLM docs for the exact Qwen variantqwen3Qwen3 reasoning is enabled by default; disable it with chat_template_kwargs if needed
Hunyuan-A13B-Instructhunyuan_a13bhunyuan_a13bUse both parsers when serving the reasoning mode with tool calls
Cohere Command A Reasoningcohere_command3cohere_command3Requires the optional cohere_melody package

For model families not listed above, check the model card for reasoning instructions and the vLLM tool calling documentation for the matching parser pair.

Configuring reasoning effort and thinking behavior

Reasoning effort controls how much the model "thinks" before answering. For coding agents you typically want low reasoning effort to keep interactive latency acceptable — many short, low-reasoning turns beat a single long, high-reasoning one.

Server-side defaults

vLLM does not expose a generic --reasoning-effort launch flag. Server-wide control is achieved through the model's chat template: you can supply a custom Jinja template that disables thinking by default, then pass it with --chat-template. Alternatively, some models and vLLM versions expose per-model template kwargs; check the vLLM release notes for the specific key.

Request-time controls

Do not assume every vLLM-backed InferenceService accepts reasoning_effort. Support depends on the vLLM version, OpenAI-compatible server implementation, model, and chat template. If the service rejects unknown request fields, reasoning_effort can fail even when the model itself supports reasoning.

Prefer model-specific controls that your deployed vLLM service documents. For example, Qwen3-style templates commonly use chat_template_kwargs to enable or disable thinking:

{
  "model": "Qwen/Qwen3-8B",
  "messages": [{"role": "user", "content": "..."}],
  "chat_template_kwargs": {
    "enable_thinking": false
  }
}

When using the OpenAI Python client, pass vLLM-specific request fields through extra_body:

client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[{"role": "user", "content": "..."}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

For parsers that support an explicit thinking budget, you can also cap reasoning tokens per request:

{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "..."}],
  "thinking_token_budget": 256
}

When using a translation proxy (LiteLLM or claude-code-router), confirm the proxy version passes through these vLLM/OpenAI extension fields before relying on them.

Only use reasoning_effort after you verify that your exact vLLM image and model template accept it. On supported deployments, it can be sent as a top-level Chat Completions field such as "reasoning_effort": "low"; on unsupported deployments, use chat_template_kwargs, thinking_token_budget, or max_tokens instead.

Step 3: Connect your coding agent

opencode

opencode reads configuration from opencode.json in the project root or ~/.config/opencode/opencode.json. Define a custom OpenAI-compatible provider that points at your endpoint:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "onprem": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "On-Prem Alauda AI",
      "options": {
        "baseURL": "https://your-inference-service-domain.com/v1",
        "apiKey": "{env:ONPREM_API_KEY}"
      },
      "models": {
        "qwen-2": {
          "name": "Qwen2.5-Coder (on-prem)"
        }
      }
    }
  }
}
  • The model key (qwen-2) must match the --served-model-name of the InferenceService.
  • Export the key the config references, then select the model: export ONPREM_API_KEY=sk-local and choose onprem/qwen-2 with the /models command inside opencode.

Codex CLI

Codex CLI reads ~/.codex/config.toml. Register your endpoint as a model provider and select it:

model = "qwen-2"
model_provider = "onprem"

[model_providers.onprem]
name = "On-Prem Alauda AI"
base_url = "https://your-inference-service-domain.com/v1"
env_key = "ONPREM_API_KEY"
wire_api = "chat"
  • base_url must end at /v1; model must match the --served-model-name.
  • env_key names the environment variable that holds the API key: export ONPREM_API_KEY=sk-local.
  • Use wire_api = "chat" for vLLM's OpenAI Chat Completions API.

Claude Code

Claude Code communicates over the Anthropic Messages API (/v1/messages), while your InferenceService exposes an OpenAI-compatible endpoint (/v1/chat/completions). Bridge the two by running a translation proxy in front of your endpoint. Two common options:

  • LiteLLM proxy, which exposes an Anthropic-compatible /v1/messages endpoint and routes to any backend model.
  • claude-code-router, a proxy built specifically to point Claude Code at OpenAI-compatible and other backends.

Both approaches handle the API translation for you. Pick whichever fits your workflow — LiteLLM is more general-purpose, while claude-code-router is tailored to Claude Code's needs.

Option 1: LiteLLM proxy

Start the LiteLLM proxy, pointing it at your InferenceService endpoint:

litellm --model openai/qwen-2 \
  --api_base https://your-inference-service-domain.com/v1 \
  --port 4000

This exposes http://localhost:4000/v1/messages (Anthropic format) and forwards requests to your OpenAI-compatible backend.

Then point Claude Code at the proxy:

export ANTHROPIC_BASE_URL="http://127.0.0.1:4000"
export ANTHROPIC_AUTH_TOKEN="not_set"
export ANTHROPIC_API_KEY="not_set_either!"
export ANTHROPIC_MODEL="qwen-2"

export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_ATTRIBUTION_HEADER=0
export CLAUDE_CODE_ENABLE_TELEMETRY=0
export CLAUDE_CODE_DISABLE_1M_CONTEXT=1
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000

claude

Option 2: claude-code-router

Create a config file at ~/.claude-code-router/config.json with your InferenceService as a provider:

{
  "Providers": [
    {
      "name": "onprem",
      "api_base_url": "https://your-inference-service-domain.com/v1/chat/completions",
      "api_key": "sk-local",
      "models": ["qwen-2"]
    }
  ],
  "Router": {
    "default": "onprem,qwen-2"
  }
}

Then start Claude Code through the router:

ccr code

The router automatically sets the required ANTHROPIC_BASE_URL and other environment variables — no manual export needed. The model is selected by the Router.default field in the config (format: provider_name,model_name). You can also activate the router in your shell first with eval "$(ccr activate)" and then run claude directly. Inside a running session, switch models with /model provider_name,model_name.

Notes for on-premise operation

  • The ANTHROPIC_AUTH_TOKEN / ANTHROPIC_API_KEY values (used with the LiteLLM option) must be non-empty but their content does not matter if your proxy and endpoint do not check them; gate access at the endpoint or proxy (see Manage gateways for adding auth via Envoy AI Gateway).
  • The CLAUDE_CODE_DISABLE_* flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, very large outputs) the on-prem model cannot honor. claude-code-router sets some of these automatically.
  • ANTHROPIC_MODEL must match the model name your InferenceService exposes (the --served-model-name).
  • Optionally set ANTHROPIC_SMALL_FAST_MODEL to an on-prem model so background/low-cost requests stay on-prem too.

Claude Code's agentic quality depends heavily on the served model's tool-calling fidelity — prefer a strong instruction- and tool-tuned model, and confirm tool calls round-trip end-to-end before relying on it.

Best practices

Qwen3.6 and Gemma 4 are the two model families we currently recommend for on-premise coding agents. Both have strong instruction tuning and a wide range of sizes and quantization formats available; verify tool-calling parser support against the vLLM version you run.

FamilyWhy it works for coding agentsvLLM --tool-call-parser
Qwen3.6 (Qwen team)Strong code generation and instruction following. MoE variants (35B-A3B) activate only ~3B parameters per token, giving high throughput at low VRAM cost.Check vLLM tool calling docs
Gemma 4 (Google)Clean instruction tuning, compact sizes (E2B, E4B) that fit on consumer GPUs. Verify tool-calling support in the vLLM version you run; Gemma's parser assignment may vary by vLLM release.Check vLLM tool calling docs
Qwen3-Coder (Qwen team)Code-specialized; the MoE variants (30B-A3B, 480B-A35B) are powerful but require more hardware.qwen3_xml

Choose a model that fits your hardware

Start from the GPU memory you have, then pick the largest capable model that leaves headroom for the KV cache. A rough weight-size estimate is parameters × bytes-per-parameter — FP16 ≈ 2 bytes, FP8/INT8 ≈ 1 byte, INT4 ≈ 0.5 bytes per parameter — on top of which the KV cache and runtime overhead consume more memory. Leave 15–25% headroom.

Quantized models from Unsloth on HuggingFace

Unsloth publishes GGUF-quantized versions of the latest models, optimized for fast loading with vLLM. The table below lists the most useful ones for coding agents:

ModelFormatActive paramsVRAM (approx.)Notes
unsloth/gemma-4-E2B-it-qat-GGUFGGUF (QAT)2B~4 GBFastest option; fits on any GPU
unsloth/gemma-4-E4B-it-qat-GGUFGGUF (QAT)4B~8 GBStrong tool-calling at low cost
unsloth/gemma-4-12b-it-GGUFGGUF12B~16 GBGood balance of speed and quality
unsloth/gemma-4-26B-A4B-it-GGUFGGUF (MoE)4B active~12 GBMoE: high quality, low active VRAM
unsloth/gemma-4-31B-it-GGUFGGUF31B~40 GBLargest Gemma 4 dense model
unsloth/Qwen3.6-27B-GGUFGGUF27B~36 GBStrong general-purpose coding
unsloth/Qwen3.6-27B-MTP-GGUFGGUF (MTP)27B~36 GBMulti-token prediction for faster decode
unsloth/Qwen3.6-35B-A3B-MTP-GGUFGGUF (MoE+MTP)3B active~12 GBBest quality/cost ratio; MoE + MTP

Note: GGUF-quantized models load in vLLM via --quantization gguf. For AWQ or GPTQ INT4 variants, check huggingface.co/models — search for qwen3.6 AWQ or gemma-4 GPTQ to find community-quantized versions. Unsloth's QAT (quantization-aware training) models typically retain higher quality at aggressive bit-widths than post-hoc quantization.

Hardware fit guide

GPU memory (single GPU)Example GPUsRecommended model
8–16 GBL4, A10, RTX 4070gemma-4-E2B or gemma-4-E4B (QAT GGUF)
16–24 GBA30 (24G), RTX 4090gemma-4-12B or Qwen3.6-35B-A3B (MoE, 3B active)
40–48 GBA40, L40S, A6000Qwen3.6-27B or gemma-4-31B (GGUF)
80 GBA100-80G, H100, H800Qwen3.6-27B at FP16, or gemma-4-31B at FP16
Multi-GPU (2–8×)2–8 × 80 GBQwen3-Coder-480B-A35B (MoE, tensor-parallel)

Additional selection guidance:

  • Prefer code-specialized, instruction-tuned models that natively support tool/function calling. If the model card does not mention tool calling, the agent will not be able to edit files reliably.
  • Confirm a matching vLLM parser exists for the model (see Enable tool calling on the runtime) before committing to it. Qwen3-Coder models use qwen3_xml; verify Qwen3.6 and Gemma 4 parser support in the vLLM docs for your version.
  • Budget for context length. Coding agents send large prompts (system prompt + file and repo context). Pick a model whose context window covers your largest expected prompt, and remember that a longer --max-model-len consumes more KV cache per request, reducing concurrency.
  • Quantization is a force multiplier on-premise. INT4 (AWQ/GPTQ) or GGUF quantization lets you fit a noticeably more capable model in the same VRAM, which usually matters more for agent quality than raw FP16 precision.
  • MoE models are especially efficient. Qwen3.6-35B-A3B and Gemma 4-26B-A4B activate only 3–4B parameters per token while carrying a larger knowledge base, giving near-dense quality at a fraction of the VRAM cost.

Tune inference service performance

Coding-agent traffic has a distinctive shape: long, highly repetitive prompts (the same system prompt and repo context resent every turn), bursts of short interactive requests, and sensitivity to first-token latency. Tune for it:

  • Enable prefix caching (--enable-prefix-caching). This is the single highest-impact flag for coding agents: the shared prompt prefix is reused across turns instead of being recomputed, cutting prefill cost and latency dramatically. See Automatic Prefix Caching — vLLM.
  • Raise --gpu-memory-utilization toward 0.90–0.95 to enlarge the KV cache, which increases concurrency and the context length you can sustain.
  • Right-size --max-model-len. Set it to the largest context the agent actually needs, not the model's theoretical maximum — every extra token of capacity costs KV-cache memory.
  • Enable chunked prefill (--enable-chunked-prefill) when long prompts cause latency spikes under concurrency, so decode steps are not starved by a large prefill. Note the CLI sample disables it by default.
  • Allow CUDA graphs for steady-state latency: the CLI sample sets ENFORCE_EAGER=True (eager mode, which starts faster but runs slower). Once the service is stable, switch to non-eager to capture CUDA graphs, at the cost of longer startup.
  • Tune batching with --max-num-seqs and --max-num-batched-tokens to balance throughput against per-request latency for your concurrency level.
  • Use FP8 KV cache (--kv-cache-dtype fp8) to stretch context length and concurrency when memory is tight.
  • Shard large models across GPUs with --tensor-parallel-size when a model does not fit on one card.
  • Consider speculative decoding for lower interactive latency on agent loops — see Speculative Decoding for vLLM Inference Services.
  • Mind autoscaling and cold starts. For interactive single-user agent use, keep minReplicas: 1 — scaling from zero adds a multi-minute cold start that is painful mid-task. For bursty multi-developer usage, configure autoscaling deliberately; see Configure Scaling for Inference Services and Set Up Autoscaling for Inference Services with KEDA.
  • Allow long requests. Agent turns can be long-running; size the Knative serving.knative.dev/progress-deadline annotation and your client timeouts accordingly. If requests are cut off, see Inference timeout troubleshooting.

Getting started with vibe coding

"Vibe coding" — iterating quickly by describing intent and letting the agent write the code — works well with a self-hosted model once the basics are right:

  1. Start with a Qwen3.6 or Gemma 4 model that fits comfortably on your GPU with headroom; a responsive smaller model beats a sluggish larger one for interactive flow. For 24 GB GPUs, Qwen3.6-35B-A3B (MoE) is an excellent starting point.
  2. Set a low temperature (around 0–0.2) for code generation to keep edits deterministic and reduce flailing.
  3. Validate tool calling with one trivial task ("create a file and run it") before attempting anything real.
  4. Keep prompts focused — open or reference only the relevant files so the agent's context stays on-topic and prefill stays cheap.
  5. Work in small, reviewable steps and read each diff before accepting it. Commit often so you can roll back a bad suggestion cleanly.

Getting started with MLOps

Because the model runs inside your cluster, a coding agent backed by an on-premise InferenceService is a good fit for operating the platform itself — your manifests, configs, and proprietary code never leave the environment, which matters in regulated settings. Productive starting tasks:

  • Generate or modify InferenceService YAML — for example, "write an InferenceService for model X targeting a 24 GB GPU with prefix caching and tool calling enabled."
  • Add autoscaling, scheduling, or resource configuration — KEDA/KPA autoscaling, CUDA-version-aware scheduling, or Kueue/Volcano queueing.
  • Author and adjust pipelines and monitoring for your model lifecycle.
  • Close the loop: deploy a model with the agent, then use that same on-premise model to drive further platform operations.

For detailed MLOps workflows — managing InferenceServices, configuring gateways, tuning performance iteratively, and planning fine-tuning runs — see Run MLOps with Coding Agents and On-Premise LLMs.

Troubleshooting

  • Agent chats but never edits files or runs commands. Tool calling is not enabled or the parser does not match the model — see Enable tool calling on the runtime.
  • model not found / 404. The model name in the agent config does not match the --served-model-name, or the base URL does not end in /v1.
  • 401 / 403. The agent is sending the wrong (or no) API key for what the endpoint or gateway expects.
  • Requests time out on long tasks. Increase the Knative progress-deadline annotation and the client timeout — see Inference timeout troubleshooting.
  • First request after idle is very slow. The service scaled to zero and is cold-starting; set minReplicas: 1 for interactive use.

References