This sample application demonstrates how to use the Genkit Evaluators Plugin with all 7 metric types available for evaluating AI model outputs.
The Evaluators Plugin provides a comprehensive set of evaluation metrics for assessing AI-generated content:
| Metric | Description |
|---|---|
| FAITHFULNESS | Evaluates if the answer is faithful to the provided context |
| ANSWER_RELEVANCY | Evaluates if the answer is relevant to the question |
| ANSWER_ACCURACY | Evaluates if the answer matches the reference answer |
| MALICIOUSNESS | Detects harmful or malicious content in the output |
| Metric | Description |
|---|---|
| REGEX | Pattern matching evaluation using regular expressions |
| DEEP_EQUAL | JSON deep equality comparison |
| JSONATA | JSONata expression evaluation for complex JSON queries |
- Java 21 or higher
- Maven 3.x
- OpenAI API key (for LLM-based evaluators)
-
Set your OpenAI API key:
export OPENAI_API_KEY=your-api-key-here -
Build the plugin (from project root):
cd ../.. mvn clean install -DskipTests
./run.shmvn clean compile exec:javaOnce the application starts, you'll have access to:
- Dev UI: http://localhost:3100
- API: http://localhost:8080
The application creates three sample datasets automatically:
- qa_evaluation - Q&A pairs for testing LLM-based evaluators
- regex_validation - Pattern matching test cases
- json_comparison - JSON equality test cases
curl -X POST http://localhost:8080/api/flows/answerQuestion \
-H 'Content-Type: application/json' \
-d '{
"question": "What is the capital of France?",
"context": "France is a country in Europe. Paris is the capital of France."
}'curl -X POST http://localhost:8080/api/flows/testEvaluators \
-H 'Content-Type: application/json' \
-d '{
"output": "This is a successful response!",
"regexPattern": ".*successful.*"
}'Important: Evaluators are not designed to be invoked directly from the "Actions" tab like flows. Instead, use the Evaluations tab:
- Open http://localhost:3100 in your browser
- Navigate to the Evaluations section (not the Actions section)
- Create or select a dataset (the sample creates
qa_evaluation,regex_validation,json_comparison) - Select a target action (e.g.,
/flow/answerQuestion) - Choose evaluators to run (e.g.,
genkitEval/faithfulness,genkitEval/answer_relevancy) - Click "Run Evaluation" to see results
Why evaluators show "No input variables" in the Actions tab:
Evaluators expect an EvalRequest with a dataset array containing multiple test cases. They're designed for batch evaluation, not individual testing. The proper workflow is:
- Test your flows individually in the Actions tab
- Run evaluations using the Evaluations tab with datasets
The most reliable way to run evaluations is via the API endpoint:
# Run faithfulness evaluation on qa_evaluation dataset
curl -X POST http://localhost:3100/api/runEvaluation \
-H 'Content-Type: application/json' \
-d '{
"dataSource": {"datasetId": "qa_evaluation"},
"targetAction": "/flow/answerQuestion",
"evaluators": ["genkitEval/faithfulness"]
}'This will:
- Load the dataset from
.genkit/datasets/qa_evaluation.json - Run the
/flow/answerQuestionflow for each data point - Evaluate the output using the faithfulness metric
- Save results to
.genkit/evals/
Example response:
{
"actionRef": "/flow/answerQuestion",
"datasetId": "qa_evaluation",
"evalRunId": "7588512e-d124-4472-a608-911ca9b8d81c",
"createdAt": "2025-12-25T21:04:20.549340Z"
}View evaluation results:
# List saved evaluations
ls .genkit/evals/
# View specific evaluation run
cat .genkit/evals/<evalRunId>.json | jq .You can also run evaluations programmatically using the Genkit evaluation API:
// Create evaluation request
RunEvaluationRequest.DataSource dataSource = new RunEvaluationRequest.DataSource();
dataSource.setDatasetId("qa_evaluation");
RunEvaluationRequest request = RunEvaluationRequest.builder()
.dataSource(dataSource)
.targetAction("/flow/answerQuestion")
.evaluators(Arrays.asList(
"genkitEval/faithfulness",
"genkitEval/answer_relevancy",
"genkitEval/answer_accuracy"
))
.build();
EvalRunKey result = genkit.evaluate(request);The Evaluators Plugin supports various configuration options:
EvaluatorsPluginOptions options = EvaluatorsPluginOptions.builder()
// Use specific metrics only
.metricTypes(GenkitMetric.FAITHFULNESS, GenkitMetric.REGEX)
// Or use all metrics
.useAllMetrics()
// Configure judge model (for LLM-based metrics)
.judge("openai/gpt-4o-mini")
// Configure embedder (for answer relevancy)
.embedder("openai/text-embedding-3-small")
// Per-metric configuration (for overriding judge/embedder per metric)
.metricConfig(GenkitMetric.FAITHFULNESS, MetricConfig.withJudge(
GenkitMetric.FAITHFULNESS,
"openai/gpt-4o" // Use a more powerful model for faithfulness
))
.build();Note: For programmatic metrics like REGEX, DEEP_EQUAL, and JSONATA, the pattern/expression is provided in the reference field of each test case, not in the configuration.
Evaluates whether the generated answer is faithful to the provided context. Uses a two-step process:
- Extract statements from the answer
- Verify each statement against the context using NLI (Natural Language Inference)
Score: Ratio of faithful statements to total statements (0.0 - 1.0)
Evaluates if the answer is relevant to the question. Optionally uses embedding similarity.
Score: Based on LLM judgment and optional cosine similarity (0.0 - 1.0)
Evaluates if the generated answer matches a reference answer. Uses bidirectional comparison with harmonic mean.
Score: Harmonic mean of forward and backward accuracy (0.0 - 1.0)
Detects harmful, unethical, or malicious content in the output.
Score: 1.0 if safe, 0.0 if malicious
Matches the output against a regular expression pattern.
Score: 1.0 if matches, 0.0 if doesn't match
Compares two JSON objects for deep equality.
Score: 1.0 if equal, 0.0 if different
Evaluates a JSONata expression against the output and checks if the result is truthy.
Score: Based on JSONata expression result (normalized to 0.0 - 1.0)
- Datasets:
./.genkit/datasets/ - Evaluation Runs:
./.genkit/evals/
This is expected behavior! Evaluators are not meant to be run directly from the Actions tab. Use the Evaluations tab instead to:
- Create/select a dataset
- Run evaluations against a flow
This usually means:
- Missing output: Make sure your flow returns an object with an
answerkey, or the evaluator can't find the output to evaluate - Missing context: For FAITHFULNESS metric, ensure the input has a
contextarray or the output includes context - Wrong input format: LLM evaluators expect:
- Input:
{"question": "..."}or{"question": "...", "context": [...]} - Output:
{"answer": "..."}or a plain string
- Input:
Make sure your OPENAI_API_KEY is set correctly:
echo $OPENAI_API_KEYEnsure the evaluators plugin is built first:
cd ../../plugins/evaluators
mvn clean installIf port 8080 or 3100 is in use, you can modify the ports in the sample code or stop the existing process.