About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Agentic AI evaluation
Last updated: Jul 07, 2025
The agentic AI evaluation module computes metrics to measure the performance of agentic AI tools to help you streamline your workflows and manage risks for your use case.
Agentic AI evaluation is a module in the
Python SDK. You can use the agentic AI evaluation module to automate and accelerate tasks to help streamline your workflows
and manage regulatory compliance risks by measuring performance with quantitative metrics.ibm-watsonx-gov
The agentic AI evaluation module uses the following evaluators to measure performance for agentic RAG use cases:
- evaluate_context_relevance: To compute context relevance metric of your content retrieval tool
- evaluate_faithfulness: To compute faithfulness metric of your answer generation tool. This metric does not require ground truth
- evaluate_answer_similarity: To compute answer similarity metric of your answer generation tool. This metric requires ground truth for computation
- evaluator.evaluate_retrieval_quality: An evaluation decorator for computing retrieval quality metrics on an agentic tool. Retrieval Quality metrics include Context Relevance, Retrieval Precision, Average Precision, Hit Rate, Reciprocal Rank, NDCG
- evaluator.evaluate_retrieval_precision: An evaluation decorator for computing retrieval precision metric on an agentic tool. This metric uses context relevance values for computation, context relevance metric would be computed as a prerequisite.
- evaluator.evaluate_average_precision: An evaluation decorator for computing average precision metric on an agentic tool. This metric uses context relevance values for computation, context relevance metric would be computed as a prerequisite.
- evaluator.evaluate_hit_rate: An evaluation decorator for computing hit rate metric on an agentic tool. This metric uses context relevance values for computation, context relevance metric would be computed as a prerequisite.
- evaluator.evaluate_reciprocal_rank: An evaluation decorator for computing reciprocal precision metric on an agentic tool. This metric uses context relevance values for computation, context relevance metric would be computed as a prerequisite.
- evaluator.evaluate_ndcg: An evaluation decorator for computing ndcg metric on an agentic tool. This metric uses context relevance values for computation, context relevance metric would be computed as a prerequisite.
- evaluator.evaluate_answer_relevance: An evaluation decorator for computing answer relevance metric on an agentic tool.
- evaluator.evaluate_unsuccessful_requests: An evaluation decorator for computing unsuccessful requests metric on an agentic tool.
- evaluator.evaluate_tool_call_syntactic_accuracy: An evaluation decorator for computing tool_call_syntactic_accuracy metric on an agentic tool.
- evaluator.evaluate_answer_quality: An evaluation decorator for computing answer quality metrics on an agentic tool. Answer Quality metrics include Answer Relevance, Faithfulness, Answer Similarity, Unsuccessful Requests
- evaluator.evaluate_hap: An evaluation decorator for computing HAP metric on an agentic tool.
- evaluator.evaluate_pii: An evaluation decorator for computing PII metric on an agentic tool.
- evaluator.evaluate_harm: An evaluation decorator for computing harm risk on an agentic tool via granite guardian.
- evaluator.evaluate_social_bias: An evaluation decorator for computing social bias risk on an agentic tool via granite guardian.
- evaluator.evaluate_profanity: An evaluation decorator for computing profanity risk on an agentic tool via granite guardian.
- evaluator.evaluate_sexual_content: An evaluation decorator for computing sexual content risk on an agentic tool via granite guardian.
- evaluator.evaluate_unethical_behavior: An evaluation decorator for computing unethical behavior risk on an agentic tool via granite guardian.
- evaluator.evaluate_violence: An evaluation decorator for computing violence risk on an agentic tool via granite guardian.
- evaluator.evaluate_harm_engagement: An evaluation decorator for computing harm engagement risk on an agentic tool via granite guardian.
- evaluator.evaluate_evasiveness: An evaluation decorator for computing evasiveness risk on an agentic tool via granite guardian.
- evaluator.evaluate_jailbreak: An evaluation decorator for computing jailbreak risk on an agentic tool via granite guardian.
- evaluator.evaluate_content_safety: An evaluation decorator for computing content safety metrics on an agentic tool. Content Safety metrics include HAP, PII, Harm, Social Bias, Profanity, Sexual Content, Unethical Behavior, Violence, Harm Engagement, Evasiveness and Jailbreak.
To use the agentic AI evaluation module you must install the
Python SDK with specific settings:ibm-watsonx-gov
pip install "ibm-watsonx-gov[agentic]"
Examples
You can evaluate agentic AI tools with the agentic AI evaluation module as shown in the following examples:
Set up the state
The
Python SDK provides a pydantic based state class that you can extend:ibm-watsonx-gov
from ibm_watsonx_gov.entities.state import EvaluationState
class AppState(EvaluationState):
pass
Set up the evaluator
To evaluate agentic AI applications, you must instantiate the
class to define evaluators to compute different metrics:AgenticEvaluation
from ibm_watsonx_gov.evaluators.agentic_evaluator import AgenticEvaluator
evaluator = AgenticEvaluator()
You can also run an advanced version of the evaluator:
from ibm_watsonx_gov.evaluators.agentic_evaluator import AgenticEvaluator
from ibm_watsonx_gov.config import AgenticAIConfiguration
from ibm_watsonx_gov.entities.agentic_app import (AgenticApp, MetricsConfiguration, Node)
from ibm_watsonx_gov.metrics import AnswerRelevanceMetric
from ibm_watsonx_gov.entities.enums import MetricGroup
# Define the metrics to be computed at the agentic app(interaction) level in metrics_configuration under AgenticApp,
# these metrics use the agent input and output fields.
# The node level metrics to be computed post the graph invocation can be specified in the nodes parameter of AgenticApp.
retrieval_quality_config_web_search_node = {
"input_fields": ["input_text"],
"context_fields": ["web_context"]
}
nodes = [Node(name="Web \nSearch \nNode",
metrics_configurations=[MetricsConfiguration(configuration=AgenticAIConfiguration(**retrieval_quality_config_web_search_node),
metrics=[ContextRelevanceMetric()])])]
agent_app = AgenticApp(name="Rag agent",
metrics_configuration=MetricsConfiguration(metrics=[AnswerRelevanceMetric()],
metric_groups=[MetricGroup.CONTENT_SAFETY]),
nodes=nodes)
evaluator = AgenticEvaluator(agentic_app=agent_app)
Add your evaluators
Compute the context relevance metric by defining the
tool and decorate it with the retrieval_node
evaluator tool:evaluate_context_relevance
@evaluator.evaluate_context_relevance
def retrieval_node(state: AppState, config: RunnableConfig):
# do something
pass
You can also stack evaluators to compute multiple metrics with a tool. The following example shows the
tool decorated with the generate_node
and evaluate_faithfulness
tools
to compute answer quality metrics:evaluate_answer_similarity
@evaluator.evaluate_faithfulness
@evaluator.evaluate_answer_similarity
def generate_node(state: AppState, config: RunnableConfig):
# do something
pass
Make an invocation
When you invoke an application for a row of data, a
key is added to the inputs to track individual rows and associate metrics with each row:interaction_id
evaluator.start_run()
result = rag_app.invoke({"input_text": "What is concept drift?", "ground_truth": "Concept drift occurs when the statistical properties of the target variable change over time, causing a machine learning model’s predictions to become less accurate."})
evaluator.end_run()
eval_result = evaluator.get_result()
eval_result.to_df()
The invocation generates a result as shown in the following example:
interaction_id | Generation Node.answer_similarity | Generation Node.faithfulness | Generation Node.latency | Retrieval Node.context_relevance | Retrieval Node.latency | interaction.cost | interaction.duration | interaction.input_token_count | interaction.output_token_count |
---|---|---|---|---|---|---|---|---|---|
eb1167b367a9c3787c65c1e582e2e662 | 0.924013 | 0.300423 | 3.801389 | 0.182579 | 1.652945 | 0.000163 | 5.575077 | 608 | 121 |
Invoke the graph on multiple rows
To complete batch invocation, you can define a dataframe with questions and ground truths for those questions:
import pandas as pd
question_bank_df = pd.read_csv("https://raw.githubusercontent.com/IBM/ibm-watsonx-gov/refs/heads/samples/notebooks/data/agentic/medium_question_bank.csv")
question_bank_df["interaction_id"] = question_bank_df.index.astype(str)
evaluator.start_run()
result = rag_app.batch(inputs=question_bank_df.to_dict("records"))
evaluator.end_run()
eval_result = evaluator.get_result()
eval_result.to_df()
The dataframe index is used as a
to uniquely indentify each row.interaction_id
The invocation generates a result as shown in the following example:
interaction_id | Generation Node.answer_similarity | Generation Node.faithfulness | Generation Node.latency | Retrieval Node.context_relevance | Retrieval Node.latency | interaction.cost | interaction.duration | interaction.input_token_count | interaction.output_token_count |
---|---|---|---|---|---|---|---|---|---|
12f175ffae3b16ec9a27d85888c132ad | 0.914762 | 0.762620 | 1.483254 | 0.434709 | 1.639955 | 0.000131 | 3.147790 | 701 | 44 |
31d0b6640589f8779b0252440950fd13 | 0.356361 | 0.584075 | 4.864134 | 0.525792 | 1.353179 | 0.000258 | 6.243586 | 623 | 276 |
6d16ee18552116dd2ba4b180cb69ca38 | 0.896585 | 0.889639 | 3.266545 | 0.707973 | 1.686493 | 0.000203 | 4.983225 | 670 | 172 |
7aaf0e891fb797fab7d6467b2f5a522a | 0.774119 | 0.735871 | 3.533067 | 0.715336 | 1.849011 | 0.000187 | 5.404923 | 608 | 161 |
a25b59fd92e8e269d12ecbc40b9475b1 | 0.857428 | 0.875609 | 6.110012 | 0.763275 | 1.374762 | 0.000154 | 7.512924 | 502 | 133 |
ade9b2b4efdd35f80fa34266ccfdba9b | 0.891241 | 0.786779 | 3.674506 | 0.669930 | 1.050648 | 0.000177 | 4.750497 | 642 | 137 |
d480865f9b38fe803042e325a28f5ab0 | 0.935062 | 0.267500 | 3.108228 | 0.182579 | 1.640975 | 0.000163 | 4.776831 | 608 | 121 |
d576d4155ec17dbe176ea1b164264cd5 | 0.861390 | 0.893529 | 2.277618 | 0.838808 | 4.941034 | 0.000144 | 7.247118 | 636 | 83 |
d5fdb76a19fbeb1d9edfa3da6cf55b15 | 0.661731 | 0.684596 | 2.075541 | 0.680110 | 1.632314 | 0.000128 | 3.730348 | 633 | 57 |
daf66c5f2577bffac87a746319c16a0d | 0.890937 | 0.808881 | 2.250932 | 0.706106 | 1.515383 | 0.000141 | 3.797323 | 608 | 86 |
For more information, see the sample notebook.
Parent topic: Metrics computation using Python SDK
Was the topic helpful?
0/1000