Live Model Testing
This guide verifies WebTest AI's model router, session workflows, discovery, and guarded maintenance against real model providers.
Unit tests use fake transports so they are deterministic and safe. Live model testing is a separate smoke layer for provider compatibility and model behavior.
Provider Options
Ollama- Local model runner. WebTest AI talks to it through the
ollamaadapter athttp://127.0.0.1:11434. This is the best first path for open-source and local-first testing.
Moonshot API- Official hosted Kimi API from Moonshot AI. Use this when you want full hosted Kimi performance without local hardware. This should be tested through an OpenAI-compatible or future dedicated Moonshot profile, depending on endpoint support.
OpenRouter- Hosted gateway for many model families. Use this for quick cross-model comparison with one API key. WebTest AI should use the
openrouter-compatibleadapter profile.
Recommended Real-Model Matrix
Start with these Chinese/open-source reasoning and agentic models:
| Model | Provider path | Why test it |
|---|---|---|
qwen3:8b | Ollama | Strong open Chinese reasoning and agentic baseline. |
deepseek-r1:8b | Ollama | Reasoning-focused Chinese open model. |
kimi-k2:1t-cloud | Ollama cloud model | Agentic/coding-oriented Kimi K2 compatibility check. |
kimi-k2-thinking:cloud | Ollama cloud model | Thinking/agentic Kimi path; most relevant for discovery/maintenance sessions. |
glm4 | Ollama | Chinese multilingual/general reasoning compatibility check. |
yi:9b | Ollama | Chinese/English bilingual model compatibility check. |
Useful model references:
- Qwen3: https://ollama.com/library/qwen3
- DeepSeek-R1: https://ollama.com/library/deepseek-r1
- Kimi K2: https://ollama.com/library/kimi-k2
- Kimi K2 Thinking: https://ollama.com/library/kimi-k2-thinking
- GLM4: https://ollama.com/library/glm4
- Yi: https://ollama.com/library/yi
Ollama Setup
Install and start Ollama, then pull the models you want to test:
ollama serve
ollama pull qwen3:8b
ollama pull deepseek-r1:8b
ollama pull kimi-k2:1t-cloud
ollama pull kimi-k2-thinking:cloud
ollama pull glm4
ollama pull yi:9b
Large/cloud-tagged models may require network access, an Ollama account, or provider-side availability. If a pull fails, skip that row and record the provider error.
Local Config Template
For the recommended first local smoke test, use the checked-in Qwen3 config at examples/config/ollama-qwen3.config.json.
The ollama adapter disables model thinking for structured JSON calls when a profile declares "reasoning": true. This keeps Qwen3-style reasoning traces out of machine-readable responses while still documenting that the model is reasoning-capable.
To compare other models, copy examples/config/ollama-qwen3.config.json to a temporary local file and change only the profile model value.
Generic temporary config shape:
{
"models": {
"activeProfile": "live-model",
"profiles": {
"live-model": {
"provider": "ollama",
"model": "qwen3:8b",
"endpoint": "http://127.0.0.1:11434",
"apiKeyEnv": null,
"capabilities": {
"structuredJson": true,
"reasoning": true,
"toolCalling": false,
"streaming": false,
"vision": false
},
"limits": {
"timeoutMs": 240000,
"retries": 0,
"maxInputBytes": 120000,
"maxOutputTokens": 1024,
"maxSessionTurns": 3
}
}
},
"writePolicy": {
"roots": ["specs", "artifacts", ".webtest-ai"],
"extensions": [".md", ".json", ".js"]
}
}
}
For each model in the matrix, change only models.profiles.live-model.model.
Smoke 1: Router JSON Completion
This verifies the adapter can call the model and parse structured JSON.
node - <<'NODE'
const { complete } = require("./src/models/router");
const config = require("./examples/config/ollama-qwen3.config.json");
complete({
config,
purpose: "live.router_smoke",
messages: [
{ role: "system", content: "Return strict JSON only." },
{ role: "user", content: "Return {\"ok\":true,\"steps\":[\"Open \\\"/\\\"\"]}." }
],
modelCalls: []
}).then((result) => {
console.log(JSON.stringify({
success: result.success,
status: result.status,
provider: result.provider,
model: result.model,
output: result.output,
warnings: result.warnings,
error: result.error
}, null, 2));
}).catch((error) => {
console.error(error);
process.exitCode = 1;
});
NODE
Pass criteria:
success: truestatus: "ok"output.ok === true
Common failures:
disabled:activeProfileis missing or null.errorwith provider message: Ollama is not running, model is not pulled, or provider timed out.empty: model returned non-JSON. Increase prompt strictness, timeout, or retry count.
Smoke 2: Discovery Workflow
This verifies the session layer and webtest-ai discover command.
Start the demo site in another terminal:
npm run demo:site
Then run:
node ./src/cli/index.js discover \
--url http://127.0.0.1:4010 \
--config ./examples/config/ollama-qwen3.config.json \
--dry-run
Pass criteria:
Discovery Status: proposalwith one or more proposed flows, orDiscovery Status: no-opwithout provider errors.
no-op means the integration worked but the model did not propose a useful flow.
Smoke 3: Guarded Maintenance
This verifies that model-proposed writes stay policy-gated. Use a safe target under artifacts first:
node - <<'NODE'
const path = require("path");
const { runMaintenanceWorkflow } = require("./src/autonomy/maintenance");
const config = require("./examples/config/ollama-qwen3.config.json");
runMaintenanceWorkflow({
config,
baseDir: process.cwd(),
targetPaths: ["artifacts/live-maintenance/proposal.md"],
apply: false,
context: {
instruction: "Propose a tiny markdown test note only. Do not include secrets."
}
}).then((result) => {
console.log(JSON.stringify({
status: result.status,
success: result.success,
proposedWrites: result.proposedWrites.map((write) => ({
path: write.path,
allowed: write.allowed,
reason: write.reason
})),
blockedWrites: result.blockedWrites,
warnings: result.warnings
}, null, 2));
}).catch((error) => {
console.error(error);
process.exitCode = 1;
});
NODE
Pass criteria:
status: "proposal"orstatus: "no-op"- no write is applied because
apply: false - proposed writes include
allowed: trueonly for paths covered bymodels.writePolicy
Do not use apply: true until proposal behavior is reviewed.
Recording Results
Track each model with:
- model name
- provider path
- router smoke status
- discovery status and number of proposed flows
- maintenance status and blocked/applied write counts
- average latency or timeout behavior
- JSON reliability notes
Example:
| Model | Router | Discovery | Maintenance | Notes |
|---|---|---|---|---|
qwen3:8b | ok | proposal, 2 flows | proposal, 1 allowed write | Good JSON after one retry. |
deepseek-r1:8b | empty | not run | not run | Returned reasoning text before JSON. |
Safety Notes
- Do not store API keys in config files. Use
apiKeyEnv. - Do not commit temporary live-model configs unless they contain no secrets.
- Keep
apply: falsefor first maintenance runs. - Use narrow
writePolicy.rootsfor auto-maintenance experiments. - Prompts and model responses are intentionally excluded from
modelCallstelemetry.