Eval Design Guide

Best practices for designing effective evals on Portex.

This guide covers how to design evals that produce meaningful, reproducible measurements of AI model capability. It is intended for domain experts authoring evals on the Portex Datalab.

Task Design

Real work, not trivia

Good Portex tasks ask models to produce real work products that are longer time horizon, more complex in nature, and have more realistic references and prompts. Avoid questions that can be answered via web search or brute-force memorization.

Think about what a competent professional in your field would be asked to do, then translate that into a prompt.

Be specific and self-contained

Each task prompt should contain everything the model needs to attempt a response (or point to an attached reference file). Avoid ambiguous phrasing. State the expected output format explicitly (e.g., "report two decimal points," "list all underlying calculations").

Use reference files when needed

You can attach one reference file per task. Accepted formats: PDF, images (JPG, PNG, WebP, GIF), CSV, TXT, JSON, Markdown, and HTML.

Reference files are useful for tasks that require analyzing a document, spreadsheet, or image. The model receives the file alongside the prompt.

Aim for quality over quantity

Starting with a few well-crafted tasks with deep rubrics that capture the nuance of quality is better than 20+ shallow ones. That said, more tasks produce more robust scores, so add tasks as you iterate. Portex maintains leaderboards for each eval showing how frontier models perform, so you can calibrate and adjust your rubrics over time.

Answer Keys (Reference Solution)

The answer key for each task is kept private. It is never shown to models or buyers (unless you license the Core Dataset or Open Source it).

Answers do not need to be in a rigid format. They can be numerical, textual, or structured. The only requirement is that the answer key does not contain information that was not asked for in the task prompt.

Detailed rationales are optional but recommended. Including a rationale (how the answer was derived) helps the LLM jury grade more accurately, especially for complex multi-step tasks.

Grading Criteria (Rubric)

Each task should have one or more grading criteria. Criteria define what the judge checks for when scoring a model's response.

Criteria fields

  • Name: a short label (e.g., "Calculates total campaign cost")

  • Type: how the criterion is evaluated (see below)

  • Weight: relative importance as a percentage of the task score

  • Description: what you're checking for (e.g. correctness, knowledge of X)

  • Semantic prompt (for semantic criteria): the instruction given to the LLM judge for what constitutes a correct response for this criterion

circle-exclamation

Grading types

  • Semantic: The LLM jury evaluates the model's response against the criterion using natural language understanding and reasoning. Use this for open-ended, reasoning-heavy answers where exact matches are not feasible and subjective "taste" is best codified with a rubric.

  • Lexical: Exact string or regex matching. Use this for answers that must contain specific values, keywords, or patterns.

  • Binary: Pass/fail on a single condition.

  • Ordinal: Ranked scoring (e.g., 1-5 scale).

  • Numeric: Comparison against a target number with optional tolerance.

How many criteria?

Aim for 5-20 criteria per task. More granular criteria produce richer reports and make it easier to identify specific weaknesses in model responses. Each criterion should check for one distinct aspect of a correct answer.

Weighting

Assign weights that reflect each criterion's importance. Weights should sum to 100% across a task. If they don't, the Eval Builder offers an auto-normalize function to redistribute them.

For multi-step tasks, consider giving higher weight to the final answer and lower weight to intermediate calculations that a model might arrive at through a different (but valid) method.

Pass threshold

Set the pass threshold as the minimum weighted score for a task to be considered passed. A common starting point is 70%, but this depends on your domain. For tasks where partial credit is appropriate, a lower threshold works. For tasks that are all-or-nothing (e.g., the final answer must be correct), consider 90-100%.

Pricing Considerations

You set two prices when publishing an eval:

  • Per-eval run: what a model builder pays each time they submit responses and receive a graded report

  • Core Dataset (optional): a one-time purchase price (and/or minimum bid) for the full bundle of tasks, answers, criteria, and reference files

Core Datasets are valuable to model builders for reinforcement learning and fine-tuning. Including answer keys and detailed rationales supports a higher price point.

Iteration

Evals are not static. After publishing, monitor the leaderboard. If frontier models are scoring above 90% consistently, consider adding harder tasks. You can edit your eval and version it at any time.

Last updated