Documentation Index
Fetch the complete documentation index at: https://relevanceai-docs-merge-mcp.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.

Programmatic access
You can manage the full evaluation lifecycle programmatically using the Relevance AI MCP server. This covers creating test sets and test cases, configuring evaluator rules and tool simulations, triggering runs, and retrieving results — enabling CI/CD integration and automated testing workflows. See Programmatic evals via MCP for details.What you can do with Evals
Conduct Tests
Create Evaluators
Monitor Performance
Evals sections
The Evals section contains five main sections, accessible from the left sidebar:- Test Suites — Create and manage groups of Test scenarios for your Agent. Each Test Suite can contain multiple scenarios with different prompts and evaluation criteria.
- Evaluators — Configure global evaluation criteria that can be applied across any Test Suite or scenario without needing to set them up each time.
- Runs — View your evaluation run history and results. See average scores, number of conversations evaluated, progress status, credit spend, and creation dates for all past runs.
- Publish Checks — Configure which Test Suites must pass before your Agent can be published. Set a pass threshold and optionally block publishing if evaluations fail.
- Performance — Automatically evaluate live Agent conversations by selecting a global Evaluator, setting a sample rate, and filtering by conversation status.
Understanding Evaluators
Evaluators are evaluation criteria that automatically assess Agent conversations. There are two types of Evaluators:Scenario Evaluators
Scenario Evaluators are created within individual Test scenarios. They evaluate the specific conversation generated by that scenario’s prompt.- Created inside a Test scenario
- Only apply to the scenario they’re defined in
- Scenario-specific evaluation criteria
Global Evaluators
Global Evaluators are configured in the Evaluators tab. They can be selected to run on any Test Suite or scenario without needing to configure them each time — think of them as reusable defaults.- Created in the Evaluators tab (separate from Test Suites)
- Can be selected when running any Test Suite or individual scenario
- Useful for standard criteria you want checked across scenarios, such as professional tone, no hallucinations, or brand voice compliance
- Also used in the Performance tab to automatically evaluate live conversations
Evaluator types
When creating an Evaluator (either scenario-level or global), you choose from the following types:LLM Judge
LLM Judge
| Field | Description |
|---|---|
| Evaluation Prompt | Describe the criteria for passing |
| Judge model | Select which model evaluates the conversation |
| Truncate long conversations | When enabled, conversations that exceed the judge model’s context window are trimmed from the oldest messages first, and the eval runs on the remaining portion. When disabled, oversized conversations fail with an error instead. Note that trimming removes early context, which can affect score accuracy if your evaluation criteria depend on the beginning of the conversation. |
String Contains
String Contains
| Field | Description |
|---|---|
| Required text | The text that must appear in the response |
String Equals
String Equals
| Field | Description |
|---|---|
| Expected value | The exact message the Agent should have sent |
Tool Usage
Tool Usage
| Field | Description |
|---|---|
| Tool | Select the tool to check for |
| Position | Whether the tool was used anywhere, used first, or used last |
| Comparison | Check if the tool was used at least, exactly, or at most X times |
- Go to the Monitor tab and select Evals, then select Evaluators
- Click + New Evaluator
- Select a Type (LLM Judge, String Contains, String Equals, or Tool Usage)
- Enter a Name for the Evaluator (e.g., “Professional Tone”)
- Configure the type-specific settings (see table above)
- Click Create Evaluator
Creating a Test Suite with a scenario
- Open your Agent in the builder and click the Monitor tab (next to the Run tab). Select Evals from the left sidebar, then select Test Suites.
- Click the + New test suite button. Enter a name for your Test Suite and click Create.
- Click on the Test Suite you just created to open it.
- Click the + New Test button to create a scenario within your Test Suite.
-
Fill in the scenario details:
Field Description Example Scenario name A descriptive name for this Test case ”Response Empathy” Persona & situation The persona or situation the simulated user will adopt ”You are an impatient customer who wants quick answers about their bill.” First message A fixed message the simulated user sends to your Agent as the opening message (optional) “Hi, I need help with my bill.” Max turns Maximum conversation turns (1-50) 10 Number of runs How many times this scenario should be executed 3 -
Add Evaluators to define how this specific scenario should be evaluated:
Click Create Evaluator to save it. You can then create additional Evaluators to add more evaluation criteria to the scenario.
Field Description Example Type The Evaluator type LLM Judge Name Name of the evaluation criterion ”Empathy Shown” Type-specific config Settings based on the chosen type (see Evaluator types) Evaluation Prompt: “Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions?” -
(Optional) Add Tool simulations to emulate tool usage without actually calling the tools. Tool simulations are configured at the scenario level:
- Select a tool to simulate
- Provide a prompt describing what the tool should return (a fake response is generated based on your prompt)
- In the Advanced dropdown, you can select a Simulation model to control which model generates the simulated response
- Click Save Test scenario to save your configuration.
Example scenarios
Here are some example Test scenarios you might create:Customer Support - Empathy test
Customer Support - Empathy test
- Evaluation Prompt: Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.
Sales - Product knowledge test
Sales - Product knowledge test
- Evaluation Prompt: Did the Agent provide accurate product information without making claims that cannot be verified? Responses should be factual, reference actual product capabilities, and acknowledge when information needs to be confirmed by a sales representative.
Support - Escalation handling
Support - Escalation handling
- Evaluation Prompt: Did the Agent acknowledge the severity of the situation, validate the customer’s need for escalation, and initiate a handoff to a human representative while maintaining a professional and empathetic tone throughout?
Running evaluations
You can run an entire Test Suite or an individual Test scenario from within a Test Suite by clicking the Run button on either. You can select specific Test scenarios within a Test Suite to run certain ones at once, or run all scenarios in the Test Suite together. Note that you cannot bulk select and run multiple Test Suites at the same time.- Enter a name for the evaluation run (e.g., “Scenario Run - Jan 14, 12:14 PM”). A default name with timestamp is provided.
- Select which global Evaluators to include in the run — you can add or remove global Evaluators before starting. Scenario-level Evaluators are always included automatically.
- Click Run to begin. The system will simulate conversations with your Agent based on your scenario prompts and evaluate them with your selected Evaluators.
Understanding results
After running an evaluation, you’ll see a detailed results screen:Run summary
The top of the results page shows key metrics:| Metric | Description |
|---|---|
| Average Score | Overall pass rate across all scenarios and Evaluators |
| Number of Conversations | How many Test conversations were evaluated |
| Agent Version | The version of the Agent that was tested |
Scenario results
Each scenario displays:| Column | Description |
|---|---|
| Status | Running, Completed, or Failed |
| Name | The scenario name |
| Score | Percentage of Evaluators that passed (shown with progress bar) |
| Evaluators | Pass/fail count (e.g., “1/1 passed”) |
| Credits | Credits consumed for this scenario |
Viewing conversation details
Click View Conversation on any scenario to see:- The full conversation between the simulated user and your Agent
- Evaluator verdicts from all Evaluators included in the run, with detailed explanations of why each Evaluator passed or failed
Pass: The Agent demonstrated strong empathy throughout the conversation. Key examples include: acknowledging the customer’s frustration with being transferred multiple times (“I completely understand how upsetting it must be to feel like you’re not getting the help you need”), validating her experience with the double charge (“I truly understand how frustrating it is to be charged twice”), and directly addressing her skepticism by saying “I completely understand your concerns, especially given your previous experience.”
Performance tab
The Performance tab lets you automatically evaluate live Agent conversations without manually running Test Suites. This is useful for ongoing quality monitoring.Setting up Performance monitoring
- Go to the Monitor tab, select Evals, then select Performance.
- Select a global Evaluator you’ve created in the Evaluators tab.
- Set a Sample rate — the percentage of conversations to evaluate.
- (Optional) Set a Conversation status filter to only evaluate conversations with specific statuses (e.g., completed, escalated). Leave blank to evaluate all conversations.
- Save your settings.
Viewing Performance insights
After the Performance evaluator has processed conversations, you can view:| Metric | Description |
|---|---|
| Overall Score | Aggregate score across all evaluated conversations |
| Total Runs | Number of conversations evaluated |
| Evaluators | Which Evaluators are active |
- Data points for the overall score over time
- Evaluator breakdown showing individual scoring per Evaluator
- Graphs visualizing Evaluator performance trends
- List of evaluation runs with score, name, and the ability to view the full conversation
Publish Checks
Publish Checks let you choose which Test Suites to run before your Agent is published. If the results don’t meet your threshold, publishing can be blocked. You can configure Publish Checks from the Publish Checks section in Evals.Test sets to run
Select which Test Suites to run before publishing. Click Add test sets to choose them — all scenarios in the selected Test Suites will be evaluated.Publish settings
Configure how evaluations affect the publish process:| Setting | Description |
|---|---|
| Pass threshold (%) | The minimum score percentage required for the evaluation to pass (e.g., 100%) |
| Block publish if evaluation fails | When checked, the Agent will only be published if the evaluation score meets or exceeds the pass threshold. If unchecked, the Agent will be published even if the evaluation fails the threshold. |
Best practices
Start simple
Be specific with Evaluators
Test edge cases
Use Performance monitoring
Frequently asked questions (FAQs)
How many scenarios can I have in a Test Suite?
How many scenarios can I have in a Test Suite?
How are credits calculated for evaluations?
How are credits calculated for evaluations?
- The Agent task run (the conversation with your Agent)
- The simulator (the persona/user simulation) - uses an LLM to simulate the user persona
- The Evaluator evaluations (both scenario Evaluators and global Evaluators) - each Evaluator uses an LLM to evaluate the conversation
Can I rerun a previous evaluation?
Can I rerun a previous evaluation?
What's the difference between scenario Evaluators and global Evaluators?
What's the difference between scenario Evaluators and global Evaluators?
Can the LLM Judge evaluate long conversations?
Can the LLM Judge evaluate long conversations?
What happens when a conversation is truncated?
What happens when a conversation is truncated?
I don't see the Evals section. How do I get access?
I don't see the Evals section. How do I get access?

