python -m gripprobe.cli --root . validate
python -m gripprobe.cli --root . run --shell gptme --model local/qwen2.5:7b --backend ollama
--backend is selected explicitly at runtime and defaults to ollama.
This avoids ambiguous backend choice when a model spec defines multiple backends.
You can run GripProbe itself inside Docker while keeping Ollama outside the container.
Build:
docker compose build
Validate:
docker compose run --rm gripprobe python3 -m gripprobe.cli --root . validate
Run the default suite against an external Ollama endpoint:
OLLAMA_HOST=http://ollama-host:11434 docker compose run --rm gripprobe \
python3 -m gripprobe.cli --root . run-suite
If runtime probes should be collected from the Ollama host over SSH, also set:
GRIPPROBE_OLLAMA_SSH_TARGET=ollama-host
The compose file mounts:
/work~/.continue read-only for continue-cli~/.config/opencode read-only for opencode~/.config/gptme read-only for gptme~/.ssh read-only for remote host probesThe compose service also exports:
GRIPPROBE_CONTINUE_CONFIG=/tmp/gripprobe-home/.continue/config.yamlGRIPPROBE_OPENCODE_CONFIG=/tmp/gripprobe-home/.config/opencode/opencode.jsonHOME=/tmp/gripprobe-home inside the containerBy default, the service runs as ${UID}:${GID} (falls back to 1000:1000) so files in mounted results/ are created as your host user.
Examples:
OLLAMA_HOST=http://ollama-host:11434 docker compose run --rm gripprobe \
python3 -m gripprobe.cli --root . run --shell gptme --model local/qwen2.5:7b --backend ollama --formats tool
OLLAMA_HOST=http://ollama-host:11434 docker compose run --rm gripprobe \
python3 -m gripprobe.cli --root . run --shell opencode --model local/qwen2.5:7b --backend ollama --formats tool
OLLAMA_HOST=http://ollama-host:11434 docker compose run --rm gripprobe \
python3 -m gripprobe.cli --root . run --shell continue-cli --model local/qwen2.5:7b --backend ollama --formats tool
GripProbe executes benchmarks as many short isolated case sessions, not as one long shared agent conversation.
Short form:
run()
-> matrix point
-> case workspace
-> warmup subprocess
-> measured subprocess
-> validators
-> case.json
In practice, one matrix point is usually two short shell sessions: warmup and measured, each with separate stdout and stderr logs.
A live end-to-end test is available in tests/e2e/test_real_model.py.
It is opt-in and uses a real shell plus a real local model, without mocks.
Default target:
gptmelocal/qwen2.5:7bollamashell_pwdmarkdownRun it explicitly:
GRIPPROBE_RUN_REAL_E2E=1 python -m pytest tests/e2e/test_real_model.py -q
You can override the target with environment variables such as GRIPPROBE_REAL_MODEL, GRIPPROBE_REAL_SHELL, GRIPPROBE_REAL_BACKEND, GRIPPROBE_REAL_TIMEOUT_SECONDS, and GRIPPROBE_OPENAI_BASE_URL. No local endpoint is stored in the repository. Shell binaries are expected to be available on PATH.
If a benchmark session crashes mid-run but the case artifacts already exist, you can rebuild summaries and HTML case pages from the run directory:
python -m gripprobe.cli rebuild-reports --run-dir results/runs/<run_id>
This command recreates summary.md, summary.html, and per-case HTML detail pages from the saved artifacts.
This file lists the primary run metadata keys used in reports.
Score: normalized weighted pass ratio across tests in a row.
0.8) than non-sanity tests (1.0), then score is normalized back to 0..100%.Typical Time: median measured time across representative results in the row.
Outliers: number of tests in the row whose representative time exceeds baseline median for that test by factor 2.5.
count/total_tests_in_row.PASS, FAIL, TIMEOUT, NO_TOOL_CALL, TOOL_UNSUPPORTED, SHELL_ERROR, HARNESS_ERROR, SKIPPED)--resume-suite resumes by case key shell+model+backend+format+test)generated at, git commit, suite id, test tags, shell set, model set, format set, hardware profile id)docs/tests.md) when availablespecs/hardware_profiles.yaml) when available--metadata key=value)hardware_profile_id: profile id from specs/hardware_profiles.yaml. Used by aggregate HTML for hardware cards and row grouping.suite: optional marker for grouping related runs (for example: aggregate_full_passed_matrix).run_note: optional free-form label for experiment context.shell_executableshell_executable_path (sanitized to $HOME/...)shell_versionshell_version_exit_codeollama_context_length (from OLLAMA_CONTEXT_LENGTH, if set)ollama_num_parallel (from OLLAMA_NUM_PARALLEL, if set)ollama_flash_attention (from OLLAMA_FLASH_ATTENTION, if set)ollama_kv_cache_type (from OLLAMA_KV_CACHE_TYPE, if set)runtime_snapshots (loadavg/meminfo/nvidia-smi and ollama /api/ps probe payloads)python3 -m gripprobe.cli --root . run-suite \
--suite default_cli_matrix \
--metadata hardware_profile_id=unspecified
--resume-suite works per case (shell+model+backend+format+test) and skips already completed cases from results/runs/....default_cli_matrix is sanity-first and currently runs with formats: tool.results/aggregate/...; keep results/runs/... as internal diagnostic artifacts.gptme and cn are resolved from PATH.
For continue-cli, provide the config path from the outside when needed:
GRIPPROBE_CONTINUE_CONFIG=/path/to/config.yaml python -m gripprobe.cli --root . run --shell continue-cli --model local/qwen2.5:7b --backend ollama