GripProbe Reports

Usage

Quick Start

python -m gripprobe.cli --root . validate
python -m gripprobe.cli --root . run --shell gptme --model local/qwen2.5:7b --backend ollama

--backend is selected explicitly at runtime and defaults to ollama. This avoids ambiguous backend choice when a model spec defines multiple backends.

Docker

You can run GripProbe itself inside Docker while keeping Ollama outside the container.

Build:

docker compose build

Validate:

docker compose run --rm gripprobe python3 -m gripprobe.cli --root . validate

Run the default suite against an external Ollama endpoint:

OLLAMA_HOST=http://ollama-host:11434 docker compose run --rm gripprobe \
  python3 -m gripprobe.cli --root . run-suite

If runtime probes should be collected from the Ollama host over SSH, also set:

GRIPPROBE_OLLAMA_SSH_TARGET=ollama-host

The compose file mounts:

The compose service also exports:

By default, the service runs as ${UID}:${GID} (falls back to 1000:1000) so files in mounted results/ are created as your host user.

Examples:

OLLAMA_HOST=http://ollama-host:11434 docker compose run --rm gripprobe \
  python3 -m gripprobe.cli --root . run --shell gptme --model local/qwen2.5:7b --backend ollama --formats tool
OLLAMA_HOST=http://ollama-host:11434 docker compose run --rm gripprobe \
  python3 -m gripprobe.cli --root . run --shell opencode --model local/qwen2.5:7b --backend ollama --formats tool
OLLAMA_HOST=http://ollama-host:11434 docker compose run --rm gripprobe \
  python3 -m gripprobe.cli --root . run --shell continue-cli --model local/qwen2.5:7b --backend ollama --formats tool

Execution Model

GripProbe executes benchmarks as many short isolated case sessions, not as one long shared agent conversation.

Short form:

run()
  -> matrix point
  -> case workspace
  -> warmup subprocess
  -> measured subprocess
  -> validators
  -> case.json

In practice, one matrix point is usually two short shell sessions: warmup and measured, each with separate stdout and stderr logs.

Real E2E Test

A live end-to-end test is available in tests/e2e/test_real_model.py. It is opt-in and uses a real shell plus a real local model, without mocks.

Default target:

Run it explicitly:

GRIPPROBE_RUN_REAL_E2E=1 python -m pytest tests/e2e/test_real_model.py -q

You can override the target with environment variables such as GRIPPROBE_REAL_MODEL, GRIPPROBE_REAL_SHELL, GRIPPROBE_REAL_BACKEND, GRIPPROBE_REAL_TIMEOUT_SECONDS, and GRIPPROBE_OPENAI_BASE_URL. No local endpoint is stored in the repository. Shell binaries are expected to be available on PATH.

If a benchmark session crashes mid-run but the case artifacts already exist, you can rebuild summaries and HTML case pages from the run directory:

python -m gripprobe.cli rebuild-reports --run-dir results/runs/<run_id>

This command recreates summary.md, summary.html, and per-case HTML detail pages from the saved artifacts.

This file lists the primary run metadata keys used in reports.

Aggregate report metrics

Aggregate report methodology and reproducibility

User-provided keys (--metadata key=value)

Automatically captured runtime keys

python3 -m gripprobe.cli --root . run-suite \
  --suite default_cli_matrix \
  --metadata hardware_profile_id=unspecified

Current Notes

Shell Configuration

gptme and cn are resolved from PATH.

For continue-cli, provide the config path from the outside when needed:

GRIPPROBE_CONTINUE_CONFIG=/path/to/config.yaml python -m gripprobe.cli --root . run --shell continue-cli --model local/qwen2.5:7b --backend ollama