Live Model Testing

This guide verifies WebTest AI's model router, session workflows, discovery, and guarded maintenance against real model providers.

Unit tests use fake transports so they are deterministic and safe. Live model testing is a separate smoke layer for provider compatibility and model behavior.

Provider Options

Ollama: Local model runner. WebTest AI talks to it through the ollama adapter at http://127.0.0.1:11434. This is the best first path for open-source and local-first testing.

Moonshot API: Official hosted Kimi API from Moonshot AI. Use this when you want full hosted Kimi performance without local hardware. This should be tested through an OpenAI-compatible or future dedicated Moonshot profile, depending on endpoint support.

OpenRouter: Hosted gateway for many model families. Use this for quick cross-model comparison with one API key. WebTest AI should use the openrouter-compatible adapter profile.

Recommended Real-Model Matrix

Start with these Chinese/open-source reasoning and agentic models:

Model	Provider path	Why test it
`qwen3:8b`	Ollama	Strong open Chinese reasoning and agentic baseline.
`deepseek-r1:8b`	Ollama	Reasoning-focused Chinese open model.
`kimi-k2:1t-cloud`	Ollama cloud model	Agentic/coding-oriented Kimi K2 compatibility check.
`kimi-k2-thinking:cloud`	Ollama cloud model	Thinking/agentic Kimi path; most relevant for discovery/maintenance sessions.
`glm4`	Ollama	Chinese multilingual/general reasoning compatibility check.
`yi:9b`	Ollama	Chinese/English bilingual model compatibility check.

Useful model references:

Qwen3: https://ollama.com/library/qwen3
DeepSeek-R1: https://ollama.com/library/deepseek-r1
Kimi K2: https://ollama.com/library/kimi-k2
Kimi K2 Thinking: https://ollama.com/library/kimi-k2-thinking
GLM4: https://ollama.com/library/glm4
Yi: https://ollama.com/library/yi

Ollama Setup

Install and start Ollama, then pull the models you want to test:

ollama serve
ollama pull qwen3:8b
ollama pull deepseek-r1:8b
ollama pull kimi-k2:1t-cloud
ollama pull kimi-k2-thinking:cloud
ollama pull glm4
ollama pull yi:9b

Large/cloud-tagged models may require network access, an Ollama account, or provider-side availability. If a pull fails, skip that row and record the provider error.

Local Config Template

For the recommended first local smoke test, use the checked-in Qwen3 config at examples/config/ollama-qwen3.config.json.

The ollama adapter disables model thinking for structured JSON calls when a profile declares "reasoning": true. This keeps Qwen3-style reasoning traces out of machine-readable responses while still documenting that the model is reasoning-capable.

To compare other models, copy examples/config/ollama-qwen3.config.json to a temporary local file and change only the profile model value.

Generic temporary config shape:

{
  "models": {
    "activeProfile": "live-model",
    "profiles": {
      "live-model": {
        "provider": "ollama",
        "model": "qwen3:8b",
        "endpoint": "http://127.0.0.1:11434",
        "apiKeyEnv": null,
        "capabilities": {
          "structuredJson": true,
          "reasoning": true,
          "toolCalling": false,
          "streaming": false,
          "vision": false
        },
        "limits": {
          "timeoutMs": 240000,
          "retries": 0,
          "maxInputBytes": 120000,
          "maxOutputTokens": 1024,
          "maxSessionTurns": 3
        }
      }
    },
    "writePolicy": {
      "roots": ["specs", "artifacts", ".webtest-ai"],
      "extensions": [".md", ".json", ".js"]
    }
  }
}

For each model in the matrix, change only models.profiles.live-model.model.

Smoke 1: Router JSON Completion

This verifies the adapter can call the model and parse structured JSON.

node - <<'NODE'
const { complete } = require("./src/models/router");
const config = require("./examples/config/ollama-qwen3.config.json");

complete({
  config,
  purpose: "live.router_smoke",
  messages: [
    { role: "system", content: "Return strict JSON only." },
    { role: "user", content: "Return {\"ok\":true,\"steps\":[\"Open \\\"/\\\"\"]}." }
  ],
  modelCalls: []
}).then((result) => {
  console.log(JSON.stringify({
    success: result.success,
    status: result.status,
    provider: result.provider,
    model: result.model,
    output: result.output,
    warnings: result.warnings,
    error: result.error
  }, null, 2));
}).catch((error) => {
  console.error(error);
  process.exitCode = 1;
});
NODE

Pass criteria:

success: true
status: "ok"
output.ok === true

Common failures:

disabled: activeProfile is missing or null.
error with provider message: Ollama is not running, model is not pulled, or provider timed out.
empty: model returned non-JSON. Increase prompt strictness, timeout, or retry count.

Smoke 2: Discovery Workflow

This verifies the session layer and webtest-ai discover command.

Start the demo site in another terminal:

npm run demo:site

Then run:

node ./src/cli/index.js discover \
  --url http://127.0.0.1:4010 \
  --config ./examples/config/ollama-qwen3.config.json \
  --dry-run

Pass criteria:

Discovery Status: proposal with one or more proposed flows, or
Discovery Status: no-op without provider errors.

no-op means the integration worked but the model did not propose a useful flow.

Smoke 3: Guarded Maintenance

This verifies that model-proposed writes stay policy-gated. Use a safe target under artifacts first:

node - <<'NODE'
const path = require("path");
const { runMaintenanceWorkflow } = require("./src/autonomy/maintenance");
const config = require("./examples/config/ollama-qwen3.config.json");

runMaintenanceWorkflow({
  config,
  baseDir: process.cwd(),
  targetPaths: ["artifacts/live-maintenance/proposal.md"],
  apply: false,
  context: {
    instruction: "Propose a tiny markdown test note only. Do not include secrets."
  }
}).then((result) => {
  console.log(JSON.stringify({
    status: result.status,
    success: result.success,
    proposedWrites: result.proposedWrites.map((write) => ({
      path: write.path,
      allowed: write.allowed,
      reason: write.reason
    })),
    blockedWrites: result.blockedWrites,
    warnings: result.warnings
  }, null, 2));
}).catch((error) => {
  console.error(error);
  process.exitCode = 1;
});
NODE

Pass criteria:

status: "proposal" or status: "no-op"
no write is applied because apply: false
proposed writes include allowed: true only for paths covered by models.writePolicy

Do not use apply: true until proposal behavior is reviewed.

Recording Results

Track each model with:

model name
provider path
router smoke status
discovery status and number of proposed flows
maintenance status and blocked/applied write counts
average latency or timeout behavior
JSON reliability notes

Example:

Model	Router	Discovery	Maintenance	Notes
`qwen3:8b`	ok	proposal, 2 flows	proposal, 1 allowed write	Good JSON after one retry.
`deepseek-r1:8b`	empty	not run	not run	Returned reasoning text before JSON.

Safety Notes

Do not store API keys in config files. Use apiKeyEnv.
Do not commit temporary live-model configs unless they contain no secrets.
Keep apply: false for first maintenance runs.
Use narrow writePolicy.roots for auto-maintenance experiments.
Prompts and model responses are intentionally excluded from modelCalls telemetry.