90 lines
2.8 KiB
Markdown
90 lines
2.8 KiB
Markdown
# Evals — Genkit Python
|
|
|
|
## Two types of evaluators
|
|
|
|
1. **Built-in** — ship with `genkit-plugin-evaluators`, register with `register_genkit_evaluators(ai)`
|
|
2. **BYO (LLM-based)** — define your own scoring logic with `ai.define_evaluator()`
|
|
|
|
## Install
|
|
|
|
```bash
|
|
uv add genkit-plugin-evaluators
|
|
```
|
|
|
|
## Dataset format
|
|
|
|
A JSON file, one object per test case:
|
|
```json
|
|
[
|
|
{"testCaseId": "case1", "input": "x", "output": "banana", "reference": "ba?a?a"},
|
|
{"testCaseId": "case2", "input": "x", "output": "apple", "reference": "ba?a?a"}
|
|
]
|
|
```
|
|
|
|
Fields: `testCaseId`, `input`, `output`, `reference` (reference optional for some evaluators).
|
|
|
|
## Built-in evaluators
|
|
|
|
```python
|
|
from genkit.plugins.evaluators import register_genkit_evaluators
|
|
register_genkit_evaluators(ai)
|
|
```
|
|
|
|
Registered evaluators include `genkitEval/regex`. Run via CLI:
|
|
```bash
|
|
genkit eval:run datasets/my_dataset.json --evaluators=genkitEval/regex
|
|
```
|
|
|
|
## BYO evaluator
|
|
|
|
```python
|
|
from genkit.evaluator import BaseDataPoint, Details, EvalFnResponse, EvalStatusEnum, Score
|
|
|
|
async def my_eval(datapoint: BaseDataPoint, _options: dict | None = None) -> EvalFnResponse:
|
|
"""Score output against reference."""
|
|
output = str(datapoint.output or '')
|
|
reference = str(datapoint.reference or '')
|
|
passed = output.strip() == reference.strip()
|
|
return EvalFnResponse(
|
|
test_case_id=datapoint.test_case_id or '',
|
|
evaluation=Score(
|
|
score=1.0 if passed else 0.0,
|
|
status=EvalStatusEnum.PASS if passed else EvalStatusEnum.FAIL,
|
|
details=Details(reasoning='Exact match check'),
|
|
),
|
|
)
|
|
|
|
ai.define_evaluator(
|
|
name='byo/my_eval',
|
|
display_name='My Eval',
|
|
definition='Checks exact match of output vs reference.',
|
|
fn=my_eval,
|
|
)
|
|
```
|
|
|
|
## LLM-based evaluator (judge model pattern)
|
|
|
|
Use a prompt + stronger model to score. See `py/samples/evaluators/src/main.py` for full examples (`byo/maliciousness`, `byo/answer_accuracy`).
|
|
|
|
Core pattern:
|
|
```python
|
|
async def llm_eval(datapoint: BaseDataPoint, _options: dict | None = None) -> EvalFnResponse:
|
|
prompt = ai.prompt('my_judge_prompt')
|
|
rendered = await prompt.render(input={'output': str(datapoint.output), 'reference': str(datapoint.reference)})
|
|
response = await ai.generate(model='googleai/gemini-flash-latest', messages=rendered.messages)
|
|
score = float(response.text.strip())
|
|
return EvalFnResponse(
|
|
test_case_id=datapoint.test_case_id or '',
|
|
evaluation=Score(score=score, status=EvalStatusEnum.PASS if score >= 0.5 else EvalStatusEnum.FAIL),
|
|
)
|
|
```
|
|
|
|
## Run evals via CLI
|
|
|
|
```bash
|
|
genkit eval:run datasets/my_dataset.json --evaluators=byo/my_eval
|
|
genkit eval:run datasets/my_dataset.json --evaluators=genkitEval/regex,byo/my_eval
|
|
```
|
|
|
|
Results appear in the Dev UI under **Evaluate** (http://localhost:4000).
|