Running Experiments

Run prompts against datasets with automatic evaluation to validate quality and consistency.

CLI Usage

Quick Start

agentmark run-experiment agentmark/classifier.prompt.mdx

Requirements:

Dataset configured in prompt frontmatter
Development server running (agentmark dev)
Optional: Evaluation functions defined

Full Command Signature

agentmark run-experiment <filepath> [options]

Options:
  --server <url>        Webhook server URL (default: http://localhost:9417)
  --skip-eval           Skip running evals even if they exist
  --format <format>     Output format: table, csv, json, or jsonl (default: table)
  --threshold <percent> Fail if pass percentage is below threshold (0-100)

Dataset Sampling (pick at most one):
  --sample <percent>    Run on a random N% of rows (1-100)
  --rows <spec>         Select specific rows by index or range (e.g., 0,3-5,9)
  --split <spec>        Train/test split (e.g., train:80 or test:80)
  --seed <number>       Seed for reproducible sampling/splitting

The --server flag defaults to the AGENTMARK_WEBHOOK_URL environment variable if set, otherwise http://localhost:9417.

Command Options

Skip evaluations (output-only mode):

agentmark run-experiment agentmark/test.prompt.mdx --skip-eval

Output format:

agentmark run-experiment agentmark/test.prompt.mdx --format table   # Default
agentmark run-experiment agentmark/test.prompt.mdx --format csv     # Spreadsheets
agentmark run-experiment agentmark/test.prompt.mdx --format json    # Structured
agentmark run-experiment agentmark/test.prompt.mdx --format jsonl   # Line-delimited

Pass rate threshold (CI/CD):

agentmark run-experiment agentmark/test.prompt.mdx --threshold 85

Exits with non-zero code if pass rate falls below the threshold. Requires evaluations that return a passed field. Dataset sampling (see Dataset Sampling below):

agentmark run-experiment agentmark/test.prompt.mdx --sample 20
agentmark run-experiment agentmark/test.prompt.mdx --rows 0,3-5,9
agentmark run-experiment agentmark/test.prompt.mdx --split train:80

Custom server:

agentmark run-experiment agentmark/test.prompt.mdx --server http://staging:9417

Dataset Sampling

Run experiments on a subset of your dataset without modifying the dataset file. The three sampling modes are mutually exclusive — use only one per run. Random sample (--sample <percent>): Run on a random N% of rows. Useful for quick smoke tests against large datasets.

# Run on ~20% of rows (random, non-reproducible)
agentmark run-experiment agentmark/test.prompt.mdx --sample 20

# Reproducible: same 20% every time
agentmark run-experiment agentmark/test.prompt.mdx --sample 20 --seed 42

Specific rows (--rows <spec>): Select individual rows by zero-based index. Supports comma-separated indices and ranges.

# Row 0 only
agentmark run-experiment agentmark/test.prompt.mdx --rows 0

# Rows 0, 3, 4, 5, and 9
agentmark run-experiment agentmark/test.prompt.mdx --rows 0,3-5,9

Train/test split (--split <spec>): Split the dataset into train and test portions. Run only the train portion or only the test portion.

# Run on the first 80% (train portion), positional split
agentmark run-experiment agentmark/test.prompt.mdx --split train:80

# Run on the remaining 20% (test portion), positional split
agentmark run-experiment agentmark/test.prompt.mdx --split test:80

# Seeded split — random assignment, reproducible across runs
agentmark run-experiment agentmark/test.prompt.mdx --split train:80 --seed 42
agentmark run-experiment agentmark/test.prompt.mdx --split test:80 --seed 42

Without --seed, --split uses positional assignment: the first N% of rows are “train” and the rest are “test”. With --seed, each row is assigned to train or test by a deterministic hash — the order in the file does not matter.

Reproducibility with --seed: The --seed flag guarantees the same rows are selected every time, across TypeScript and Python. Pass the same seed to get identical results on any machine or language runtime.

# These two runs always process the exact same rows
agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99
agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99

Use --seed in CI/CD pipelines to prevent flaky results from random row selection.

Output Example

#	Input	AI Result	Expected Output	sentiment_check
1	`{"text":"I love it"}`	positive	positive	PASS (1.00)
2	`{"text":"Terrible"}`	negative	negative	PASS (1.00)
3	`{"text":"It's okay"}`	neutral	neutral	PASS (1.00)

Summary:

Pass rate: 100% (3/3 passed)

The CLI supports both .mdx source files and pre-built .json files (from agentmark build). Media outputs (images, audio) are saved to .agentmark-outputs/ with clickable file paths.

How It Works

The run-experiment command:

Loads your prompt file (.mdx or pre-built .json) and parses the frontmatter
Reads the dataset specified in test_settings.dataset
Sends the prompt and dataset to the dev server (default: http://localhost:9417)
The server runs the prompt against each dataset row
Evaluates results using the evals specified in test_settings.evals
Streams results back to the CLI as they complete
Displays formatted output (table, CSV, JSON, or JSONL)

Configuration

Link dataset and evals in prompt frontmatter:

---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check
---

<System>Classify the sentiment</System>
<User>{props.text}</User>

You can also provide default props via test_settings.props:

test_settings:
  props:
    language: en
    verbose: false
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check

Props from each dataset row override the defaults. Dataset (sentiment.jsonl):

{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": "It's okay"}, "expected_output": "neutral"}

Learn more about datasets → Learn more about evals →

Workflow

1. Develop prompts - Iterate on your prompt design 2. Create datasets - Add test cases covering your scenarios 3. Write evaluations - Define success criteria 4. Run experiments - Test against dataset

agentmark run-experiment agentmark/my-prompt.prompt.mdx

5. Review results - Identify failures and patterns 6. Iterate - Fix issues, improve prompts, add test cases 7. Deploy with confidence - Pass rate meets your threshold

SDK Usage

Run experiments programmatically using formatWithDataset():

import { client } from './agentmark-client';
import { generateText } from 'ai';  // Or your adapter's generation function

const prompt = await client.loadTextPrompt('agentmark/classifier.prompt.mdx');

// Returns a stream of formatted inputs from the dataset
const datasetStream = await prompt.formatWithDataset();

// Process each test case
for await (const item of datasetStream) {
  const { dataset, formatted, evals } = item;

  // Run the prompt with your AI SDK
  const result = await generateText(formatted);

  // Check results
  const passed = result.text === dataset.expected_output;
  console.log(`Input: ${JSON.stringify(dataset.input)}`);
  console.log(`Expected: ${dataset.expected_output}`);
  console.log(`Got: ${result.text}`);
  console.log(`Result: ${passed ? 'PASS' : 'FAIL'}\n`);
}

The stream returns objects with:

dataset - The test case (input and expected_output)
formatted - The formatted prompt ready for your AI SDK
evals - List of evaluation names to run
type - Always "dataset"

Options (FormatWithDatasetOptions):

datasetPath?: string - Override dataset from frontmatter
format?: 'ndjson' | 'json' - Buffer all rows ('json') or stream as available ('ndjson', default)

When to use:

Custom test logic in your test framework
Fine-grained control over test execution
Integrating with existing test infrastructure
Running experiments in application code

Troubleshooting

CLI Issues

Dataset not found:

Check dataset path in frontmatter
Verify file exists and is valid JSONL

Server connection error:

Ensure agentmark dev is running
Check ports are available (default webhook port: 9417)
Verify --server URL if using a custom server

Invalid dataset format:

Each line must be valid JSON
Required: input field
Optional: expected_output field

No evaluations ran:

Add evals to test_settings in frontmatter
Or use --skip-eval flag for output-only mode

Threshold check failed:

The --threshold flag requires evals that return a passed field
Verify your eval functions return { passed: true/false, ... }

Sampling options conflict:

Only one of --sample, --rows, or --split may be used at a time
--seed can be combined with any of them

Next Steps

Datasets

Create test datasets

Evaluations

Write evaluation functions

Testing Overview

Learn testing concepts

Getting Started

Prompts and Agents

Testing

Observability

Integrations

Python

Further Reference

CLI Usage

Quick Start

Full Command Signature

Command Options

Dataset Sampling

Output Example

How It Works

Configuration

Workflow

SDK Usage

Troubleshooting

CLI Issues

Next Steps

Datasets

Evaluations

Testing Overview

Getting Started

Prompts and Agents

Testing

Observability

Integrations

Python

Further Reference

​CLI Usage

​Quick Start

​Full Command Signature

​Command Options

​Dataset Sampling

​Output Example

​How It Works

​Configuration

​Workflow

​SDK Usage

​Troubleshooting

​CLI Issues

​Next Steps

Datasets

Evaluations

Testing Overview

CLI Usage

Quick Start

Full Command Signature

Command Options

Dataset Sampling

Output Example

How It Works

Configuration

Workflow

SDK Usage

Troubleshooting

CLI Issues

Next Steps