Creating an Eval

How to create an eval using the Eval Builder.

The Eval Builder is the primary way to create evals on Portex. It lets you write tasks, answers, and grading criteria directly in the platform without preparing files externally.

If you prefer working with JSON files in an editor or IDE, see Importing an Eval (Q&A evals only).

Choose an Eval Type

From the Data Studio, navigate to Evals > Eval Builder. You will see a type selection screen with two options:

  • Agentic, Multi-Turn: Task-driven evals with configurable tools and iterative execution. The agent runs in a sandboxed container and can execute code, use tools, and iterate across multiple turns. Built on the Harborarrow-up-right framework.

  • Q&A, Single-Turn: Prompt-driven evals focused on direct model reasoning. The model receives a prompt and produces a single response.

Select the type that matches your eval. See Eval Design Guide for guidance on when to use each type. The type applies to the entire eval and cannot be mixed.

Open the Eval Builder

After selecting a type, the Eval Builder opens. At the top, give your eval a name. The builder auto-saves as you work.

circle-info

You can view an example eval by toggling "View Example" in the Eval Builder header. For agentic evals, the example shows a TerminalBench task with environment configuration and Exact Match criteria.

Write a Task

The left panel shows your task list. Click "+ Add Task" to create a new task.

In the right panel, the Task tab provides a markdown editor with LaTeX support. Write your task prompt here. Be specific about the expected output format and any constraints (see the Eval Design Guide for best practices).

Attach reference files

If your task requires the model to analyze a document, image, or dataset, click "Attach Reference File" below the editor.

  • Q&A evals: one reference file per task. Accepted formats: PDF, images (JPG, PNG, WebP, GIF), CSV, TXT, JSON, Markdown, HTML.

  • Agentic evals: multiple reference files per task. Files are mounted into the agent's container. The total size of all reference files must fit within the container's memory allocation. Agentic tasks accept most file extensions.

Write the Answer (Golden Reference Solution)

Switch to the Answer tab and enter the correct answer for the task. This field is required. This serves as the golden reference solution. Think: what is the most important, salient output from this task?

The answer key is kept private by default. Models and buyers do not see it unless you sell the Core Dataset.

Define Grading Criteria (Rubric)

Switch to the Criterion tab. Here you define a rubric of criteria.

circle-exclamation

Click "+ Add Grading Criterion" to add a criterion. For each criterion, fill in:

  • Criterion Name (required): a short label describing what is being checked

  • Grader Type: choose between LLM-as-a-Judge (semantic evaluation by an LLM jury) or Exact Match (string comparison, agentic evals only)

  • Weight: the percentage weight for this criterion

  • Description (optional): context for what you are checking

  • Explicit Grading Criterion (required): the specific details the judge should check. For LLM-as-a-Judge, this is the prompt sent to the jury. For Exact Match, this is the expected string value. Include exact numbers and figures when relevant. Supports markdown and LaTeX.

Weight distribution

The bottom of the Criterion tab shows a weight distribution bar and the total weight sum. Click "Auto-normalize" if your weights do not sum to 100%.

Pass threshold

Set the pass threshold as a percentage (e.g., 70%). A task is marked as passed if the weighted score across all criteria meets or exceeds this value.

Configure the Environment (Agentic Only)

For agentic evals, the Environment tab appears in the task editor. This is where you define the sandboxed container that the agent will run in.

Environment presets

Start by selecting an environment preset that matches your domain. Presets pre-configure the base image, packages, resources, and timeouts for common use cases:

  • General Researcher: Web research, document parsing, light analysis

  • Finance Analyst: Tabular analytics, stats, spreadsheets

  • Legal Review: OCR, PDF extraction, legal document workflows

  • Chemistry: Cheminformatics with RDKit and Open Babel

  • Data Science: General ML and analysis

See Harbor: Environment Presetsarrow-up-right for full details on each preset.

Container configuration

You can configure the container in two modes:

Simple mode: Select a base image from the dropdown (e.g., python:3.11-slim, ubuntu:24.04, node:20-slim) and add system packages (via apt-get) and Python packages (via pip). You can pin package versions (e.g., pandas==2.2.3).

Dockerfile mode: Write a custom Dockerfile for full control over the build process. The Dockerfile must contain a FROM instruction and has a 10,000 character limit.

Resources

Set the compute resources for the container:

  • CPUs: Number of CPU cores

  • Memory: RAM in MB

  • Storage: Disk space in MB

  • GPUs: Number of GPUs (0 for CPU-only tasks)

Resource presets are available: Standard (1 CPU, 2 GB RAM), Boosted (2 CPU, 4 GB), Heavy (4 CPU, 8 GB).

Timeouts

Configure how long each phase of execution can run:

  • Agent timeout: How long the agent can execute (default: 900 seconds / 15 minutes)

  • Verifier timeout: How long the verifier can run to check output (default: 900 seconds)

  • Build timeout: How long the container build can take (default: 600 seconds / 10 minutes)

Task metadata

For agentic evals, you can also set task metadata (helpful for open sourcing and for contributing to benchmarks):

  • Difficulty: easy, medium, or hard

  • Category: a free-text label (e.g., "software-engineering")

  • Tags: multiple tags for organization

  • Time estimates: expected completion time for an expert and a junior practitioner, in minutes

Preview

The Preview tab renders the task as it will appear to model builders, including formatted markdown, reference files, tools, and (for agentic evals) environment configuration.

Create the Eval Dataset

Once you have written all your tasks with answers and criteria, click "Create Eval Dataset" in the top right. This packages your work into an eval dataset on the Datalab.

From here, you can publish a listing to make the eval available to model builders.

Last updated