Open Source · Apache-2.0

Prompt version control for LLM engineers

Every prompt change gets a SHA, a parent, and a commit message. Every eval run is persistent. Compare any two versions. Gate your CI pipeline on pass rate.

$ pip install pressmark
View on GitHub →
pressmark — Prompt: sentiment-classifier
Navigation
Prompts
Datasets
Eval Runs
Versions
v3 a3f91b
v2 7c2d18
v1 2f8a44
sentiment-classifier  ·  v2 — 7c2d18
stable passing
SYSTEM
You are a sentiment classifier. Classify the input text as
POSITIVE, NEGATIVE, or NEUTRAL. Reply with the label only.

USER
{{text}}

# model_params
temperature: 0.0   max_tokens: 8   model: gpt-4o-mini
EVAL PASS RATE
92% 48 / 52 rows

What it does
Built for prompt engineers who ship

Not another playground. Pressmark plugs into how you already work — CLI, CI, and a zero-config web UI for your team.

01
Content-addressed versioning
Every commit produces a 12-char SHA from its exact content. Identical prompts get the same SHA — no duplicate storage, no false diffs.
02
Unified diffs
Compare any two versions with colorized unified diff — additions, removals, and context. Available in CLI and the web UI side-by-side.
03
Dataset management
Import JSONL files or create rows manually. One dataset can run against multiple prompt versions. Results link back to the exact row and version.
04
Async eval runner
Configurable concurrency with asyncio.Semaphore. Each row result is written immediately — a crash mid-run doesn't lose completed work.
05
Eval comparison
Compare two eval runs: pass-rate delta, per-scorer breakdown, Chart.js bar chart. Rows marked gained or regression for instant triage.
06
Zero-config web UI
FastAPI + Jinja2 + HTMX. No npm, no webpack, no build step. SSE streams eval progress live. Works from a single pip install.

Evaluation
Five built-in scorers

Configure per-eval in JSON or TOML. Mix and match — an eval run can have multiple scorers and aggregates each independently.

Type Config keys How it scores Output
exact_match case_sensitive Strip whitespace, compare to expected 0 or 1
contains substring · all_of · any_of Substring presence check, case-folded 0 or 1
regex_match pattern · flags re.search on output 0 or 1
llm_judge criteria · model · threshold LLM grades output against criteria 0–10 0.0 – 1.0
semantic_sim threshold · model Cosine similarity vs expected embedding 0.0 – 1.0

Get started
From install to first eval in minutes
1
Install and initialize
Creates a SQLite database at ~/.pressmark/pressmark.db. No setup wizard, no migrations to run.
2
Commit your first prompt
System prompt, user template with {{variable}} slots, model, and params. Produces a SHA and version number.
3
Import a dataset
JSONL with one row per line. Each row's keys map to template variables. The expected key feeds scorers that need a ground truth.
4
Run an eval
Results stream to the terminal with a Rich progress bar. Pass --min-pass-rate to exit non-zero in CI when quality drops.
5
Open the web UI
Browse prompt history, run comparisons, inspect per-row results, and diff versions in the browser.
terminal
$ pip install pressmark

$ pressmark init
✓ Database created at ~/.pressmark/pressmark.db

$ pressmark prompt commit sentiment \
    --system "Classify as POSITIVE, NEGATIVE, or NEUTRAL." \
    --user "{{text}}" \
    --model openai/gpt-4o-mini \
    --message "initial version"
✓ v1 · sha: a3f91b

$ pressmark dataset import sentiment-test data.jsonl
✓ 52 rows imported

$ pressmark eval run sentiment sentiment-test \
    --scorer '{"type": "exact_match"}' \
    --min-pass-rate 0.90
Running eval  ━━━━━━━━━━━━━━━━━━━━ 100%  52/52
Pass rate: 92.3% ✓

$ pressmark ui
→ http://127.0.0.1:7820
CI pipeline integration
Drop one command into your GitHub Actions workflow. Exits with code 1 when pass rate drops below your threshold — no post-processing needed.
pressmark eval run … --min-pass-rate 0.90

Configuration
One file, no surprises

Copy pressmark.example.toml to pressmark.toml or use environment variables. Both work.

pressmark.toml
[pressmark]
db_path          = "~/.pressmark/pressmark.db"
default_model    = "openai/gpt-4o-mini"
eval_concurrency = 5

[openrouter]
api_key          = "sk-or-v1-..."

[web]
host             = "127.0.0.1"
port             = 7820
environment variables
# Same settings, env-var style

PRESSMARK_DB_PATH=~/.pressmark/pressmark.db
PRESSMARK_DEFAULT_MODEL=openai/gpt-4o-mini
PRESSMARK_EVAL_CONCURRENCY=5

OPENROUTER_API_KEY=sk-or-v1-...

PRESSMARK_HOST=127.0.0.1
PRESSMARK_PORT=7820