This sample demonstrates how to use Genkit's evaluation framework to assess AI output quality with custom evaluators and datasets.
- Custom Evaluators - Define evaluators for length, keywords, sentiment
- LLM-Based Evaluators - Use AI to evaluate AI outputs
- Datasets - Create and manage evaluation datasets
- Evaluation Runs - Execute evaluations and view results
- Dev UI Integration - View evaluations in the Genkit Dev UI
- Java 21+
- Maven 3.6+
- OpenAI API key
# Set your OpenAI API key
export OPENAI_API_KEY=your-api-key-here
# Navigate to the sample directory
cd java/samples/evaluations
# Run the sample
./run.sh
# Or: mvn compile exec:java# Set your OpenAI API key
export OPENAI_API_KEY=your-api-key-here
# Navigate to the sample directory
cd java/samples/evaluations
# Run with Genkit CLI
genkit start -- ./run.shThe Dev UI will be available at http://localhost:4000
Important: Run
genkit startfrom the same directory where the Java app is running. This ensures the Dev UI can find the datasets stored in.genkit/datasets/.
| Flow | Input | Output | Description |
|---|---|---|---|
describeFood |
String (food) | String | Generate appetizing food descriptions |
This sample defines several custom evaluators:
| Evaluator | Description |
|---|---|
custom/length |
Checks if output length is between 50-500 characters |
custom/keywords |
Checks for food-related descriptive keywords |
custom/sentiment |
Evaluates positive/appetizing sentiment |
curl -X POST http://localhost:8080/describeFood \
-H 'Content-Type: application/json' \
-d '"chocolate cake"'Evaluator<Void> lengthEvaluator = genkit.defineEvaluator(
"custom/length",
"Output Length",
"Evaluates whether the output has an appropriate length",
(dataPoint, options) -> {
String output = dataPoint.getOutput().toString();
int length = output.length();
double score = (length >= 50 && length <= 500) ? 1.0 : 0.5;
EvalStatus status = score == 1.0 ? EvalStatus.PASS : EvalStatus.FAIL;
return EvalResponse.builder()
.testCaseId(dataPoint.getTestCaseId())
.evaluation(Score.builder()
.score(score)
.status(status)
.reasoning("Output length: " + length)
.build())
.build();
});Evaluator<Void> keywordEvaluator = genkit.defineEvaluator(
"custom/keywords",
"Food Keywords",
"Checks for food-related descriptive keywords",
(dataPoint, options) -> {
String output = dataPoint.getOutput().toString().toLowerCase();
List<String> keywords = Arrays.asList(
"delicious", "tasty", "flavor", "savory", "sweet");
int foundCount = 0;
for (String keyword : keywords) {
if (output.contains(keyword)) foundCount++;
}
double score = Math.min(1.0, foundCount / 3.0);
return EvalResponse.builder()
.testCaseId(dataPoint.getTestCaseId())
.evaluation(Score.builder()
.score(score)
.status(foundCount >= 2 ? EvalStatus.PASS : EvalStatus.FAIL)
.reasoning("Found " + foundCount + " keywords")
.build())
.build();
});Datasets are stored in .genkit/datasets/ and can be managed via the Dev UI or programmatically:
// Create a dataset
List<DatasetItem> items = Arrays.asList(
new DatasetItem("test-1", "pizza", null),
new DatasetItem("test-2", "sushi", null),
new DatasetItem("test-3", "tacos", null)
);
// Run evaluation
EvalRunKey result = genkit.evaluate(
RunEvaluationRequest.builder()
.datasetId("food-dataset")
.evaluators(List.of("custom/length", "custom/keywords"))
.actionRef("/flow/describeFood")
.build());When running with genkit start, access the Dev UI at http://localhost:4000 to:
- Create and manage datasets
- Run evaluations on flows
- View evaluation results and scores
- Compare evaluation runs
- Inspect individual test cases
Evaluation results include:
- Score: Numeric value (0.0 - 1.0)
- Status: PASS, FAIL, or UNKNOWN
- Reasoning: Explanation of the score