Eval Design Guide
Best practices for designing effective evals on Portex.
This guide covers how to design evals that produce meaningful, reproducible measurements of AI model capability. It is intended for domain experts authoring evals on the Portex Datalab.
Task Design
Real work, not trivia
Good Portex tasks ask models to produce real work products that are longer time horizon, more complex in nature, and have more realistic references and prompts. Avoid questions that can be answered via web search or brute-force memorization.
Think about what a competent professional in your field would be asked to do, then translate that into a prompt.
Be specific and self-contained
Each task prompt should contain everything the model needs to attempt a response (or point to an attached reference file). Avoid ambiguous phrasing. State the expected output format explicitly (e.g., "report two decimal points," "list all underlying calculations").
Use reference files when needed
You can attach one reference file per task. Accepted formats: PDF, images (JPG, PNG, WebP, GIF), CSV, TXT, JSON, Markdown, and HTML.
Reference files are useful for tasks that require analyzing a document, spreadsheet, or image. The model receives the file alongside the prompt.
Aim for quality over quantity
Starting with a few well-crafted tasks with deep rubrics that capture the nuance of quality is better than 20+ shallow ones. That said, more tasks produce more robust scores, so add tasks as you iterate. Portex maintains leaderboards for each eval showing how frontier models perform, so you can calibrate and adjust your rubrics over time.
Answer Keys (Reference Solution)
The answer key for each task is kept private. It is never shown to models or buyers (unless you license the Core Dataset or Open Source it).
Answers do not need to be in a rigid format. They can be numerical, textual, or structured. The only requirement is that the answer key does not contain information that was not asked for in the task prompt.
Detailed rationales are optional but recommended. Including a rationale (how the answer was derived) helps the LLM jury grade more accurately, especially for complex multi-step tasks.
Grading Criteria (Rubric)
Each task should have one or more grading criteria. Criteria define what the judge checks for when scoring a model's response.
Criteria fields
Name: a short label (e.g., "Calculates total campaign cost")
Type: how the criterion is evaluated (see below)
Weight: relative importance as a percentage of the task score
Description: what you're checking for (e.g. correctness, knowledge of X)
Semantic prompt (for semantic criteria): the instruction given to the LLM judge for what constitutes a correct response for this criterion
If your task is an exact match or multiple choice, a rubric is optional. If you do specify a rubric, make sure that any exact figures are included clearly in the Explicit Grading Criterion.
Grading types
Semantic: The LLM jury evaluates the model's response against the criterion using natural language understanding and reasoning. Use this for open-ended, reasoning-heavy answers where exact matches are not feasible and subjective "taste" is best codified with a rubric.
Lexical: Exact string or regex matching. Use this for answers that must contain specific values, keywords, or patterns.
Binary: Pass/fail on a single condition.
Ordinal: Ranked scoring (e.g., 1-5 scale).
Numeric: Comparison against a target number with optional tolerance.
How many criteria?
Aim for 5-20 criteria per task. More granular criteria produce richer reports and make it easier to identify specific weaknesses in model responses. Each criterion should check for one distinct aspect of a correct answer.
Weighting
Assign weights that reflect each criterion's importance. Weights should sum to 100% across a task. If they don't, the Eval Builder offers an auto-normalize function to redistribute them.
For multi-step tasks, consider giving higher weight to the final answer and lower weight to intermediate calculations that a model might arrive at through a different (but valid) method.
Pass threshold
Set the pass threshold as the minimum weighted score for a task to be considered passed. A common starting point is 70%, but this depends on your domain. For tasks where partial credit is appropriate, a lower threshold works. For tasks that are all-or-nothing (e.g., the final answer must be correct), consider 90-100%.
Pricing Considerations
You set two prices when publishing an eval:
Per-eval run: what a model builder pays each time they submit responses and receive a graded report
Core Dataset (optional): a one-time purchase price (and/or minimum bid) for the full bundle of tasks, answers, criteria, and reference files
Core Datasets are valuable to model builders for reinforcement learning and fine-tuning. Including answer keys and detailed rationales supports a higher price point.
Iteration
Evals are not static. After publishing, monitor the leaderboard. If frontier models are scoring above 90% consistently, consider adding harder tasks. You can edit your eval and version it at any time.
Last updated