Does the summary avoid unnecessary legal jargon while retaining essential legal terminology?
0.90
Clarity
Is the summary written in clear and understandable language suitable for a general audience?
Flexible natural language criteria is evaluated with Pi Scorer, our state of the art scoring model designed for judging data.
Trusted By
Build custom benchmarks you can consistently rely on
Power evals with stable Pi Rubrics, instead of unstable system prompts. Compare models, prompts, and frameworks using Pi Scorer.
Clarity
Is the summary written in clear and understandable language suitable for a general audience?
0.90
System Prompt v3
0.50
System Prompt v2
0.70
Gemini 2.5 Pro
0.60
GPT-4o
Start using rubrics to control your AI
Pi's quality platform defines criteria with rubrics instead of prompts so you can optimize and measure your AI consistently, and predictably.
Requires less than 30 examples to align
Is stable when editing
Consistent scores
5x more efficient than LLM-as-a-judge
Pi Score:
0.90
0.70
Avoids Legal Jargon
Does the summary avoid unnecessary legal jargon while retaining essential legal terminology?
0.90
Clarity
Is the summary written in clear and understandable language suitable for a general audience?
Flexible natural language criteria is evaluated with Pi Scorer, our state of the art scoring model designed for judging data.
Transform data into rubrics
Not sure what to measure? Pi figures it out for you. Feed it any or all of your prompts, your PRDs or your user feedback, and Pi Studio generates an aligned rubric for your application.
Score quickly & consistently
Tap to view
Our foundation model, Pi Scorer, scores more accurately than Deepseek and GPT 4.1, but runs at the size and speed of GPT Mini and Gemini Flash. You can score 20+ custom dimensions in less than 100msec; it’s that fast.
Framework agnostic
Tap to view
A single Pi Scorer can be used in every part of your AI stack and existing tools: offline evals, online observability, training data quality, model optimization, agent control flows and more. Easily plug Pi into Google Spreadsheets, Promptfoo, CrewAI, or any other tool you might be using.
Aligned with your users & experts.
Continuously improve your rubric by calibrating it on your own labels, preferences, and user data to create a powerful feedback loop that matches your team's expertise and actual user behavior.
Combines diverse signals
To achieve the best results, Pi’s holistic rubric uses the right signals for the right tasks—like code correctness for precise tasks and natural language for flexible ones..
5x cheaper than LLM judges.
When you maintain large model performance on a smaller model, you can afford to measure everything you care about. Reinvest your savings to measure more dimensions, more frequently.
from withpi import PiClient
pi = PiClient()scores = pi.scoring_system.score( llm_input="Pi Labs", llm_output="Score anything with Pi Labs today!", scoring_spec=[{"question":"Is there a strong call to action?"}])print(scores.total_score)