Eval Runner
LLM-AS-JUDGE
Run
History
Configuration
Feature / Test Name
System Prompt
Model
claude-haiku-4-5 (fast)
claude-sonnet-4-6 (quality)
Test Cases
+ Add case
Run Eval
No past evals yet
Results
–
◈
No eval running
configure prompt + test cases, then run
#
Input
Output
Score
Status
Metrics
Pass Rate
–
awaiting results
Avg Score
–
0-100 scale
Distribution
pass
border
fail
Tests Run
0
0 pending