Client sends `thinking: true` to enable reasoning tokens. Default remains
disabled for instant streaming.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Qwen3-coder generates hundreds of `reasoning` tokens before `content`
tokens, causing 10+ second perceived delay. The reasoning tokens stream
through Axon but the ChatWidget only renders `delta.content`, so users
see a long pause then a burst. Passing `enable_thinking: false` via
chat_template_kwargs skips the reasoning phase entirely.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3-turn conversations passed at ~9120 chars but 4-turn failed at ~10640.
WAF anomaly threshold is between those values. Lowered all limits to keep
multi-turn conversations well under the threshold.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAF anomaly scoring accumulates across the entire request body. After 2-3 turns,
assistant responses containing infrastructure terms (security, scanning, etc.)
push the total past the threshold. Added per-assistant trim (1500 chars) and a
12000-char sliding window that drops oldest messages.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vLLM requires system messages to be at the beginning. When Axon merges
conversation history with new messages, duplicate system messages cause
a 400 error. Strip all but the first system message.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The vLLM backend at Bank Dhofar runs behind an Istio/Envoy WAF with
ModSecurity-style anomaly scoring. The ChatWidget's 41KB system prompt
accumulates enough infrastructure/security keywords to trigger a 403.
Trim system messages to 6000 chars (70% head + 30% tail) before
forwarding to vLLM — preserves identity/behavior instructions at the
start and FAQ/response guidelines at the end.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clients (e.g. ChatWidget) send OpenAI model names like gpt-4o-mini which
vLLM doesn't recognize. The provider now queries available models on
startup and remaps any unrecognized name to the configured default.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces a provider abstraction so Axon can proxy to either Claude SDK
(existing behavior) or a vLLM-compatible endpoint. Toggled via
AXON_PROVIDER env var ("claude" | "vllm"). When vllm, requests pass
through as-is (no prompt translation), session pool and OAuth are skipped.
Closesopenova-io/openova#36
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Skip refresh gracefully when .credentials.json doesn't exist (e.g. CI
smoke test with no Claude auth mounted).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Claude Agent SDK does not refresh OAuth tokens. Axon now:
1. Refreshes the token on startup before creating session pool
2. Runs a periodic refresh every 4 hours
3. Writes refreshed credentials to disk so session subprocesses use them
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Claude Agent SDK does not handle OAuth token refresh. Adds a CronJob
(every 4h) that refreshes the token via Anthropic's OAuth endpoint and
updates the K8s secret. Disabled by default.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Claude Agent SDK does not handle OAuth token refresh — it reads the
accessToken from .credentials.json and uses it directly. When the token
expires (~8h), Axon returns 401 until manually refreshed.
Adds a CronJob (every 4h by default) that:
1. Reads the refreshToken from the K8s secret
2. Calls Anthropic's OAuth token endpoint to get a fresh accessToken
3. Updates the K8s secret with the new credentials
4. Restarts the Axon deployment to pick up the new token
Includes ServiceAccount, Role, and RoleBinding for least-privilege access.
Disabled by default (axon.tokenRefresh.enabled: false).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The credentials were mounted as a read-only K8s secret subPath. When the
Claude SDK refreshed the OAuth token, it couldn't persist the new token
back to disk. On pod restart, the stale expired token was loaded again,
causing 401 auth failures.
Fix: initContainer copies credentials from secret to a writable emptyDir
volume. The SDK can now refresh tokens and persist them within the pod
lifecycle. Also creates the debug/ directory the SDK requires.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Valkey was crash-looping (372 restarts) because the 521MB RDB exceeded
the 512Mi memory limit. Adds maxmemory and maxmemory-policy args to
the valkey deployment template with configurable defaults.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add thinking, effort, profile fields to ChatCompletionRequest
- Add chatV1() and chatV1Stream() using query() with persistSession=false
- Route to V1 when thinking/effort params present or profile='deep'
- V2 session pool unchanged; V1 runs stateless with native systemPrompt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous retryStrategy(times > 5) returned null, permanently destroying the
ioredis client after 5 failed reconnects. After idle, the TCP connection drops,
all 5 retries fail, and every subsequent command throws 'Connection is closed'.
Changes:
- retryStrategy now retries indefinitely (max 30s interval) — connection
is always restored when Valkey comes back
- 'end' event handler restarts the client if ioredis somehow stops retrying
- getValkey() returns null when client.status is 'end'/'close' so callers
skip persistence gracefully instead of throwing
- maxRetriesPerRequest: 3 kept — commands fail fast, background reconnect
handles recovery
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sessions whose Claude CLI subprocess has exited (idle > MAX_IDLE_MS) are
recycled in acquire() rather than returned. This prevents all-stale-pool
scenarios that caused WriteRecsActivity/ExtractIntentActivity to fail with
'Connection is closed' after Axon sits idle overnight.
- Added lastUsed: number to PoolEntry, set on warmup and release
- acquire() skips idle entries older than 5 min, recycles each one
- release() stamps lastUsed so the TTL resets on every successful use
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-add 2-3 word chunk splitting with 25-60ms delays that was lost during
the includePartialMessages refactor. Fixes the "10s wait then dump" UX.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Set includePartialMessages: true on SDK sessions so stream() emits
SDKPartialAssistantMessage (stream_event) carrying content_block_delta
events. chatStream() now yields actual token text as it is generated
instead of waiting for the complete response and fake-streaming it
with word-splits and delays.
This gives true token-by-token TTFT (~200ms first token) rather than
the previous 3-8s wait for the full response before any text appeared.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Claude Agent SDK reuses sessions across conversations. When the
full system prompt was re-sent on subsequent turns wrapped in
[System instructions] tags, Claude flagged it as a prompt injection
attempt. Switch to XML-style tags (<context>, <conversation>) that
Claude recognises as structured prompt sections. Add <new_conversation/>
boundary marker to isolate reused sessions from prior context.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Claude Agent SDK yields complete assistant messages rather than
individual token deltas. This change splits the full text into 2-3
word groups and yields them as separate SSE chunks with small random
delays (25-60ms), giving a natural typing experience on the client.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SDK examples (Python, Node.js), API reference, model aliases,
streaming, conversations, self-hosting instructions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Traces: convLookup, formatPrompt, acquire, send, firstMsg,
stream, release, convStore — logged per request for profiling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
K8s doesn't set HOME from Dockerfile USER directive. Mount
credential file at subpath to preserve debug/ directory.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
K8s runAsNonRoot requires numeric UID. Pin to 1001 in both
Containerfile and Helm chart deployment template.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Agent SDK writes debug logs to ~/.claude/debug/ which must
exist before session creation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Helm chart for deploying Axon LLM gateway with Valkey backing store,
Traefik ingress with TLS, and Claude auth volume mount.
CI workflow builds container image on push to products/axon/ and pushes
SHA-pinned tags to GHCR.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite Axon SaaS LLM gateway with three core changes:
1. Session pool acquire/release pattern — sessions stay alive and are
reused across requests instead of killed after one use. Turn counting
with automatic recycling after 200 turns.
2. Valkey-backed conversation store — all conversation state (messages,
metadata, TTL) lives in Valkey, not filesystem. Sessions are stateless
workers; any session can serve any conversation.
3. 100% OpenAI /v1/chat/completions compatibility — accepts every OpenAI
request parameter (temperature, top_p, stop, frequency_penalty,
presence_penalty, logit_bias, logprobs, seed, tools, tool_choice,
response_format, stream_options, max_completion_tokens, user, store,
metadata). Response shape matches OpenAI exactly: chatcmpl-* id,
system_fingerprint, logprobs:null, refusal:null, usage chunk in
streaming. OpenAI model names (gpt-4o, gpt-4) auto-mapped to Claude.
Axon extension: conversation_id field for multi-turn conversations
backed by Valkey with 7-day TTL. GET /v1/conversations/:id for history.
Includes E2E test suite (67 tests, scripts/e2e-test.sh).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>