datasets/├── development.jsonl # Use during iteration (60-70%)├── validation.jsonl # Check progress periodically (15-20%)└── held-out.jsonl # Final test before production (15-20%)
Critical rules:
Never iterate on held-out data
Don’t peek at held-out results during development
If you look at held-out results and make changes, create a new held-out set
Example workflow:
Week 1-2: Iterate on development set ├─ Test prompt v1 → 75% pass rate └─ Test prompt v2 → 82% pass rateWeek 3: Check validation set └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)Before deploy: Test held-out set └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements
Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
Production deployment: 100-300 cases minimum
High-stakes systems: 300+ cases
Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance.Confidence intervals - Report uncertainty:
✅ “Pass rate: 85% [CI: 77%-91%]”
❌ “Pass rate: 85%”Comparing prompts - Use paired comparisons on same dataset:
// For each test case, record if new prompt performed betterconst improvements = testCases.map(tc => { const oldPassed = evaluateOld(tc); const newPassed = evaluateNew(tc); return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);});const netImprovement = improvements.reduce((a, b) => a + b, 0);// netImprovement > 10 with 100 cases suggests real improvement
Power analysis - Determine how many samples you need before creating your dataset.Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?”Key parameters:
Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
Significance level (α): Probability of false positive (typically 0.05 = 5%)
Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)
Formula for binary outcomes (pass/fail):
// Simplified formula for comparing two proportionsn ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size)²// Example: Detect 5% improvement with 80% power, 95% confidence// Assuming baseline pass rate p = 0.80n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05)²n ≈ 7.84 × 0.32 / 0.0025n ≈ 1,003 test cases
Practical rules of thumb:
Minimum detectable difference
Required sample size (per group)
10% (e.g., 80% → 90%)
~100 samples
5% (e.g., 80% → 85%)
~400 samples
2% (e.g., 80% → 82%)
~2,500 samples
1% (e.g., 80% → 81%)
~10,000 samples
Why this matters:If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application.Practical approach:
// 1. Define minimum improvement you care aboutconst minImprovement = 0.05; // 5% better pass rate// 2. Calculate required sample sizeconst alpha = 0.05; // 5% false positive rateconst power = 0.80; // 80% chance to detect real improvementconst baselineRate = 0.80; // Current pass rateconst n = calculateSampleSize(alpha, power, baselineRate, minImprovement);console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);// 3. Collect that many test cases before running experiments