Datasets

Datasets are JSONL files containing test cases to validate prompt behavior. Each line has an input (required) and optional expected_output.

Quick Start

1. Create a dataset file (agentmark/datasets/sentiment.jsonl):

{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": ""}}

2. Link to your prompt (frontmatter):

---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
---

<System>
Classify the sentiment
</System>
<User>{props.text}</User>

3. Run experiments:

agentmark run-experiment agentmark/sentiment.prompt.mdx

Dataset Structure

Each line must be valid JSON:

input (required) - Props passed to your prompt
expected_output (optional) - Expected result for evaluation

With expected output (enables evaluations):

{"input": {"text": "Great!", "category": "electronics"}, "expected_output": "positive"}

Without expected output (output-only mode):

{"input": {"text": "Great!", "category": "electronics"}}

What to Test

Common cases:

{"input": {"query": "What is AI?"}, "expected_output": "explanation"}
{"input": {"query": "Explain ML"}, "expected_output": "explanation"}

Edge cases:

{"input": {"text": ""}, "expected_output": "error"}
{"input": {"text": "a"}, "expected_output": "too_short"}
{"input": {"text": "Lorem ipsum... [5000 chars]"}, "expected_output": "truncated"}

Failure modes:

{"input": {"email": "invalid-email"}, "expected_output": "error: invalid email"}
{"input": {"amount": -100}, "expected_output": "error: amount must be positive"}

Real-world data - Use anonymized production data when possible.

LLM-assisted generation - Use LLMs to generate test cases, but have humans verify outputs before using them.

Expected Output Types

Strings (classification):

{"input": {"text": "sunny day"}, "expected_output": "positive"}

Objects (structured data):

{"input": {"text": "John, john@example.com"}, "expected_output": {"name": "John", "email": "john@example.com"}}

Flexible (patterns, not exact matches):

{"input": {"topic": "AI"}, "expected_output": "explanation containing: artificial intelligence"}

Your evaluation function validates flexible expectations.

Dataset Size

Start small (10-20 cases):

5-7 common scenarios
3-5 edge cases
2-3 failure modes

Scale based on needs:

Initial development: 50-100 cases (recommended by Confident AI)
Statistical significance: ~250 cases (for 95% confidence, 5% margin of error)
Production systems: 100-300 cases minimum
High-stakes applications: 300+ cases

Quality > quantity. Start with 50-100 high-quality cases, then grow based on statistical power analysis and real-world findings.

Best Practices

One test case per line (valid JSONL)
Use descriptive inputs that clearly show what’s being validated
Version control datasets alongside prompts
Avoid duplicates - each case should validate something unique
Always anonymize data (never leak sensitive information)

Advanced: Held-Out Test Sets

Create separate datasets to avoid overfitting:

datasets/
├── development.jsonl       # Use during iteration (60-70%)
├── validation.jsonl        # Check progress periodically (15-20%)
└── held-out.jsonl         # Final test before production (15-20%)

Critical rules:

Never iterate on held-out data
Don’t peek at held-out results during development
If you look at held-out results and make changes, create a new held-out set

Example workflow:

Week 1-2: Iterate on development set
  ├─ Test prompt v1 → 75% pass rate
  └─ Test prompt v2 → 82% pass rate

Week 3: Check validation set
  └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)

Before deploy: Test held-out set
  └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements

Advanced: Statistical Significance

Sample size requirements (source):

Quick iteration: 10-20 cases (directional feedback only)
Initial development: 50-100 cases (industry standard)
Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
Production deployment: 100-300 cases minimum
High-stakes systems: 300+ cases

Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance. Confidence intervals - Report uncertainty:

Pass rate: 85% (85 passed out of 100 tests)
Standard error: √(0.85 × 0.15 / 100) = 0.036
95% confidence interval: 85% ± 7% → [78%, 92%]

✅ “Pass rate: 85% [CI: 77%-91%]” ❌ “Pass rate: 85%” Comparing prompts - Use paired comparisons on same dataset:

// For each test case, record if new prompt performed better
const improvements = testCases.map(tc => {
  const oldPassed = evaluateOld(tc);
  const newPassed = evaluateNew(tc);
  return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);
});

const netImprovement = improvements.reduce((a, b) => a + b, 0);
// netImprovement > 10 with 100 cases suggests real improvement

Power analysis - Determine how many samples you need before creating your dataset. Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?” Key parameters:

Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
Significance level (α): Probability of false positive (typically 0.05 = 5%)
Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)

Formula for binary outcomes (pass/fail):

// Simplified formula for comparing two proportions
n ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size)²

// Example: Detect 5% improvement with 80% power, 95% confidence
// Assuming baseline pass rate p = 0.80
n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05)²
n ≈ 7.84 × 0.32 / 0.0025
n ≈ 1,003 test cases

Practical rules of thumb:

Minimum detectable difference	Required sample size (per group)
10% (e.g., 80% → 90%)	~100 samples
5% (e.g., 80% → 85%)	~400 samples
2% (e.g., 80% → 82%)	~2,500 samples
1% (e.g., 80% → 81%)	~10,000 samples

Why this matters: If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application. Practical approach:

// 1. Define minimum improvement you care about
const minImprovement = 0.05; // 5% better pass rate

// 2. Calculate required sample size
const alpha = 0.05;  // 5% false positive rate
const power = 0.80;  // 80% chance to detect real improvement
const baselineRate = 0.80; // Current pass rate

const n = calculateSampleSize(alpha, power, baselineRate, minImprovement);
console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);

// 3. Collect that many test cases before running experiments

Next Steps

Evaluations

Write evaluation functions

Running Experiments

Test your datasets

Testing Overview

Learn testing concepts

Getting Started

Prompts and Agents

Testing

Observability

Integrations

Python

Further Reference

Quick Start

Dataset Structure

What to Test

Expected Output Types

Dataset Size

Best Practices

Advanced: Held-Out Test Sets

Advanced: Statistical Significance

Next Steps

Evaluations

Running Experiments

Testing Overview

Getting Started

Prompts and Agents

Testing

Observability

Integrations

Python

Further Reference

​Quick Start

​Dataset Structure

​What to Test

​Expected Output Types

​Dataset Size

​Best Practices

​Advanced: Held-Out Test Sets

​Advanced: Statistical Significance

​Next Steps

Evaluations

Running Experiments

Testing Overview

Quick Start

Dataset Structure

What to Test

Expected Output Types

Dataset Size

Best Practices

Advanced: Held-Out Test Sets

Advanced: Statistical Significance

Next Steps