Overview

Testing in AgentMark

AgentMark provides robust testing capabilities to help you validate and improve your prompts through:

Datasets: Test prompts against diverse inputs with known expected outputs
LLM as Judge Evaluations: Automated quality assessment of prompt outputs using language models
Annotations: Manual labeling and scoring of traces for human-in-the-loop evaluation

Datasets

Datasets enable bulk testing of prompts against a collection of input/output pairs. This allows you to:

Validate prompt behavior across many test cases
Ensure consistency of outputs
Catch regressions when modifying prompts
Generate performance metrics

Each dataset item contains an input to test, along with its expected output for comparison. You can create and manage datasets through the UI or as JSON files.

LLM as Judge Evaluations

Coming soon! LLM evaluations will provide automated assessment of your prompt outputs by using language models as judges. Key features will include:

Real-time evaluation of prompt outputs
Batch evaluation of datasets
Customizable scoring criteria (numeric, boolean, classification, etc.)
Detailed reasoning for each evaluation
Aggregated quality metrics across runs

Annotations

Annotations provide a way to manually evaluate and label traces with human judgment. This enables:

Human-in-the-loop quality assessment
Creation of training datasets from production data
Edge case documentation and debugging
Complementary insights to automated evaluations

Team members can add annotations directly to traces in the dashboard, providing scores, labels, and detailed reasoning for their assessments. This combination of datasets, automated evaluations, and manual annotations gives you comprehensive tools to test, validate, and improve your prompts systematically.

Learn More

Datasets - Create and manage test datasets
Evaluations - Set up automated LLM evaluations
Annotations - Manually label and score traces

Have Questions?

We’re here to help! Choose the best way to reach us:

Join our Discord community for quick answers and discussions
Email us at hello@agentmark.co for support
Schedule an Enterprise Demo to learn about our business solutions

Getting Started

Configuring your Platform

Prompt Management

Observability

Testing

Further Reference

Testing in AgentMark

Datasets

LLM as Judge Evaluations

Annotations

Learn More

Have Questions?

Getting Started

Configuring your Platform

Prompt Management

Observability

Testing

Further Reference

​Testing in AgentMark

​Datasets

​LLM as Judge Evaluations

​Annotations

​Learn More

​Have Questions?

Testing in AgentMark

Datasets

LLM as Judge Evaluations

Annotations

Learn More

Have Questions?