The Relevance AI MCP server includes 19 tools for managing evaluations programmatically. This covers the complete evaluation lifecycle: creating test sets and test cases, configuring evaluator rules and tool simulations, running evaluations, and monitoring batch results. This enables CI/CD integration, automated testing frameworks, and bulk operations that would be impractical to do through the UI.Documentation Index
Fetch the complete documentation index at: https://relevanceai-docs-merge-mcp.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This page covers the MCP tools for programmatic eval management. For the UI-based workflow, see Evals.
Rollout Status: Evals is currently being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see this feature in your account yet, reach out to your account manager to discuss access.
Prerequisites
You need the Relevance AI MCP server connected to your AI coding assistant before using these tools. See the MCP Server page for setup instructions. For better results, also clone the agent skills repository — it gives your assistant the knowledge to use MCP tools correctly.Managing test sets
Test sets (also called Test Suites in the UI) are containers for test cases that you run together as a group.What you can do
- Create a new test set for an agent
- List all test sets for an agent
- Get the details of a specific test set
- Update a test set’s name or configuration
- Delete a test set
Example prompts
Managing test cases
Test cases are individual scenarios within a test set. Each test case defines a simulated user persona, an opening message, conversation limits, and its own evaluator rules.What you can do
- Create a test case within a test set
- List all test cases in a test set
- Get the details of a specific test case
- Update a test case’s scenario, persona, or configuration
- Delete a test case
Example prompts
Configuring evaluator rules
Evaluator rules define the criteria used to assess whether an agent’s response passes or fails a test case. You can add, update, and remove evaluator rules on individual test cases.Evaluator rule types
| Type | What it checks |
|---|---|
| LLM Judge | Evaluates the conversation against a prompt you write, using an LLM to score the result |
| String Contains | Checks whether the agent’s response includes specific text |
| String Equals | Checks whether the agent’s response exactly matches an expected value |
| Tool Usage | Checks whether a specific tool was used, and how many times or in what position |
What you can do
- Add an evaluator rule to a test case
- Update an existing evaluator rule
- Remove an evaluator rule from a test case
- List all evaluator rules on a test case
Example prompts
Configuring tool simulation
Tool simulation lets you emulate tool responses during evaluations without actually calling the real tools. This is useful for testing how your agent handles specific tool outputs without incurring real API calls or side effects. Tool simulations are configured at the test case level. You specify the tool to simulate and a prompt describing the fake response the tool should return.Example prompts
Running evaluations
You can trigger evaluation runs programmatically against a test set. This is the same operation as clicking Run in the UI, but callable from scripts, CI pipelines, and automated workflows.What you can do
- Run a test set (runs all test cases in the set)
- Run an individual test case
- Include or exclude global evaluators from a run
Example prompts
Monitoring batch results
After triggering a run, you can retrieve the results programmatically — including per-test-case scores, evaluator verdicts, and conversation logs.What you can do
- List all evaluation runs for a test set
- Get the detailed results for a specific run, including scores and evaluator verdicts
- Check whether a run is still in progress or complete
Example prompts
CI/CD integration
Because evaluation runs are fully programmable via MCP, you can integrate them into automated pipelines:- Trigger a test set run as part of a pre-deployment check
- Poll for completion and parse pass/fail status
- Block deployment if scores fall below a threshold
Example CI/CD workflow using an AI coding assistant
Example CI/CD workflow using an AI coding assistant
Ask your AI coding assistant:Your assistant will use the MCP eval tools to carry out each step and return a structured report you can act on.
Learn more
- Evals (UI workflow) — create and manage evaluations through the Relevance AI interface
- MCP Server — connect your AI coding assistant to Relevance AI
- Agent Skills — give your assistant built-in knowledge of Relevance AI tools

