Eval Design Guide

Best practices for designing effective evals on Portex.

This guide covers how to design evals that produce meaningful, reproducible measurements of AI model capability. It is intended for domain experts authoring evals on the Portex Datalab.

Choosing an Eval Type

Before writing tasks, decide whether your eval should be Q&A (single-turn) or Agentic (multi-turn). The right choice depends on what you are measuring.

When to use Q&A

Choose Q&A when the task can be answered in a single response without executing code or interacting with an environment. Q&A evals test knowledge, reasoning, and the ability to produce written work products.

Examples: a difficult mathematics problem or interpreting a single image.

When to use Agentic

Choose Agentic when the task requires the model to act, not just answer. Agentic evals test a model's ability to plan, execute code, use tools, handle errors, and iterate toward a solution in a sandboxed environment.

Examples: writing and running code to process a dataset, performing multi-step research using web tools, extracting information from a complex document using OCR and scripting, debugging a failing program.

Agentic evals run on the Harborarrow-up-right framework. Each task gets its own container with the packages, tools, and resources you specify.

Can I mix types?

No. Each eval is either Q&A or Agentic. If you need both types of tasks, create separate evals.

Task Design

Real work, not trivia

Good Portex tasks ask models to produce real work products that are longer time horizon, more complex in nature, and have more realistic references and prompts. Avoid questions that can be answered via web search or brute-force memorization.

Think about what a competent professional in your field would be asked to do, then translate that into a prompt.

Be specific and self-contained

Each task prompt should contain everything the model needs to attempt a response (or point to an attached reference file). Avoid ambiguous phrasing. State the expected output format explicitly (e.g., "report two decimal points," "list all underlying calculations").

Use reference files when needed

You can attach reference files to provide documents, images, or datasets that the task requires. Accepted formats: PDF, images (JPG, PNG, WebP, GIF), CSV, TXT, JSON, Markdown, HTML.

For Q&A evals, you can attach one reference file per task. For agentic evals, you can attach multiple reference files. These files are mounted into the agent's container and accessible at known paths.

circle-info

For agentic evals, the total size of all reference files for a task must fit within the container's memory allocation. Keep this in mind when choosing resource presets.

Aim for quality over quantity

Starting with 5-10 well-crafted tasks with deep rubrics is better than 50 shallow ones. That said, more tasks produce more robust scores, so add tasks as you iterate. Portex maintains leaderboards for each eval showing how frontier models perform, so you can calibrate and adjust your rubrics over time.

Designing Agentic Tasks

Agentic tasks require additional design considerations beyond the prompt and answer.

Define the environment

Each agentic task runs in a container. You control:

  • Base image: The Docker image the container starts from (e.g., python:3.11-slim, ubuntu:24.04)

  • System packages: OS-level packages installed via apt-get

  • Python packages: Pip packages available to the agent

  • Resources: CPU, memory, storage, and GPU allocations

  • Timeouts: How long the agent can run before being stopped

Use environment presetsarrow-up-right as a starting point for your domain, then customize as needed.

Specify tools

List the tools available to the agent (e.g., python3, bash). The tool list tells the agent what it can use to solve the task. Keep the list focused: only include tools relevant to the task.

Write verifiable tasks

Agentic tasks work best when they have a clearly verifiable output. Good patterns include:

  • Produce an exact numeric or string value

  • Generate an answer that can be checked programmatically

This lets you use Exact Match criteria alongside LLM-as-a-Judge criteria for more reliable grading.

Note that grading only works with written text outputs right now. Excel files or PDFs as deliverables are not yet accepted.

Set appropriate timeouts

Consider how long a competent agent should take. The default agent timeout is 15 minutes (900 seconds). For complex tasks that involve installing packages, running computations, or making network requests, you may need to increase this. For simpler tasks, a shorter timeout prevents wasted compute.

Include metadata

Agentic tasks support metadata fields that help with organization and analysis:

  • Difficulty: easy, medium, or hard

  • Category: a free-text label (e.g., "software-engineering", "data-analysis")

  • Tags: multiple tags for filtering and grouping

  • Time estimates: how long a domain expert and a junior practitioner would take, in minutes

Answer Keys

The answer key for each task is kept private. It is never shown to models or buyers unless you license the Core Dataset or open-source the eval.

Answers do not need to be in a rigid format. They can be numerical, textual, or structured. The only requirement is that the answer key does not contain information that was not asked for in the task prompt.

Detailed rationales are optional but recommended. Including a rationale (how the answer was derived) helps the LLM jury grade more accurately, especially for complex multi-step tasks.

Grading Criteria

Each task should have one or more grading criteria. Criteria define what the judge checks for when scoring a model's response.

Criteria fields

  • Name: a short label (e.g., "Calculates total campaign cost")

  • Type: how the criterion is evaluated (see below)

  • Weight: relative importance as a percentage of the task score

  • Description: what constitutes a correct response for this criterion

  • Semantic prompt (for LLM-as-a-Judge criteria): the instruction given to the LLM jury

circle-exclamation

Criteria types

LLM-as-a-Judge (semantic): The LLM jury evaluates the model's response against the criterion using natural language understanding. Use this for open-ended, reasoning-heavy answers where exact matches are not feasible. Available for both Q&A and agentic evals.

Exact Match: The model's response must exactly match a specified string. Use this for answers that must be a specific value. Available for agentic evals only.

For imported eval JSON, Portex also supports lexical, binary, ordinal, numeric, and regex-style criteria when you need more structured checks.

circle-info

Agentic evals often combine both types: Exact Match for verifiable outputs (like a computed hash or numeric result) and LLM-as-a-Judge for qualitative aspects (like code quality or reasoning approach).

How many criteria?

Aim for 5-20 criteria per task. More granular criteria produce richer reports and make it easier to identify specific weaknesses in model responses. Each criterion should check for one distinct aspect of a correct answer.

Weighting

Assign weights that reflect each criterion's importance. Weights should sum to 100% across a task. If they don't, the Eval Builder offers an auto-normalize function to redistribute them.

For multi-step tasks, consider giving higher weight to the final answer and lower weight to intermediate calculations that a model might arrive at through a different (but valid) method.

Pass threshold

Set the pass threshold as the minimum weighted score for a task to be considered passed. A common starting point is 70%, but this depends on your domain. For tasks where partial credit is appropriate, a lower threshold works. For tasks that are all-or-nothing (e.g., the final answer must be correct), consider 90-100%.

Pricing Considerations

You set two prices when publishing an eval:

  • Per-eval run: what a model builder pays each time they submit responses and receive a graded report

  • Core Dataset (optional): a one-time purchase price (and/or minimum bid) for the full bundle of tasks, answers, criteria, and reference files

Core Datasets are valuable to model builders for reinforcement learning and fine-tuning. Including answer keys and detailed rationales supports a higher price point.

Alternatively, you can open-source your evalarrow-up-right at no cost to maximize community adoption and public benchmarking.

Iteration

Evals are not static. After publishing, monitor the leaderboard. If frontier models are scoring above 90% consistently, consider adding harder tasks. You can edit your eval and version it at any time.

Last updated