# How Evals Work

## What is a Portex Eval?

A Portex Eval is a private, expert-authored test that measures how well AI models perform on real-world, domain-specific work or frontier knowledge. Unlike public benchmarks (which are prone to data contamination), Portex Evals keep answer keys hidden from models and use rubric-driven grading to produce standardized scores. Tasks can be completed *agentically* (multi-turn, e.g. with Codex or Claude Code) or in *Q\&A style* (single-turn, vanilla model reasoning without external tools).

Each eval consists of:

* A set of tasks (prompts, optionally with reference files like PDFs, images, or CSVs)
* A private answer key with a rubric that defines how each task is scored

## Eval Types

Portex supports two types of evals, each suited to different kinds of model capability measurement.

### Q\&A (Single-Turn)

Q\&A evals are prompt-driven. The model receives a task prompt (and optionally a reference file), produces a single response, and that response is graded. This is the standard eval format for measuring knowledge, reasoning, and analysis.

Q\&A evals are well-suited to tasks like:

* Answering domain-specific questions without tools/coding/web search

### Agentic (Multi-Turn)

Agentic evals test a model's ability to operate autonomously over multiple turns in a sandboxed environment. Instead of producing a single response, the model acts as an agent: it can execute code, read and write files, use tools, and iterate toward a solution.

Each agentic task runs inside a container with a configurable environment (base image, installed packages, resource limits, and timeouts). The eval author defines what tools are available and what the environment looks like. The agent interacts with this environment across multiple turns until it produces a final output.

Agentic evals are well-suited to tasks like:

* Writing and executing code to solve a problem
* Performing multi-step research or data analysis
* Working with files, APIs, or command-line tools
* Tasks that require planning, error recovery, and iteration

Agentic evals on Portex are built on the [Harbor](https://harborframework.com/) framework, which provides the sandboxed execution environment and agent orchestration.

## Key Concepts

**Task** — A single prompt given to a model. Tasks are procedural: they ask the model to produce a work product (e.g., a legal memo, a financial calculation, a code review) rather than answer trivia. In Q\&A evals, tasks are self-contained prompts. In agentic evals, tasks also include an environment specification and available tools.

**Environment** — A set of configurations for agentic tasks. Portex integrates with the [Harbor](https://harborframework.com/) framework for agentic evals.

**Answer Key** — A golden reference answer for each task, kept private by default. Only the grading system sees it. Accompanies the rubric.

**Grading Criteria** — A rubric of objective criteria, each with a weight (summing to 100%) and a scoring method. Supported types:

* Semantic (LLM-as-a-Judge): An LLM jury evaluates whether the model's response satisfies the criterion
* Exact Match: Exact string matching against an expected value
* Lexical: Exact string or regex matching
* Binary, ordinal, and numeric types for structured answers

**Pass Threshold** — A per-task percentage. If the weighted score across all criteria meets or exceeds this threshold, the task is marked as passed.

**Core Dataset** — The complete bundle (tasks, answers, reference files, expert notes and harbor-compliant file artifacts) that can be licensed separately. Buyers use Core Datasets for model improvement, including reinforcement learning.

**LLM Jury** — A private configurable ensemble of language models that applies the rubric to grade model responses. Models under evaluation never see the answer key or rubric.

**O\*NET-SOC Taxonomy** — Evals can be tagged to occupations in the BLS O\*NET-SOC system (nearly 1,000 occupational titles aligned to the 2018 SOC). This enables benchmarking by job family and real-world task type. Experts assign an occupation tag when creating a listing; if left unselected, Portex assigns one.

## Lifecycle

### For experts

1. Choose an eval type (Q\&A or Agentic)
2. Author tasks with answer keys and grading criteria
3. For agentic evals, configure the execution environment (packages, resources, timeouts)
4. Upload to the Datalab via the Eval Builder or file import
5. Choose a publishing model (commercial or [open source](https://github.com/portex-ai/portex-documentation/blob/master/for-experts/open-sourcing-an-eval.md))
6. Create and publish a listing. Model builders can now discover, download tasks, and submit runs.

### For model builders

1. Browse evals on the Datalab. Filter by occupation, difficulty, modality, or model performance.
2. Download the Task Bundle (tasks.json and reference files)
3. Run a model locally to produce responses (for agentic evals, this means running an agent against the task environment via Harbor)
4. Upload `model_responses.json` and purchase a run
5. Receive a results report with per-task scores, grader notes, and summary statistics

## Example Eval

An example Q\&A task at the high undergraduate/low graduate level in economics for illustrative purposes.

{% code overflow="wrap" expandable="true" %}

```json
// Example task.json in Behavioral Economics
{
  "version": 1,
  "prompts": [
    {
      "task_id": "HfQM4Y",
      "task_prompt": "Describe the problem of retirement savings under classic standard economic theory. First, state the problem clearly as a utility optimization problem for a simplified two-period consumption model with an interest rate $1+R$ and total wealth $W$ and a discount factor $\\\\delta$.\\n\\nExplain each parameter. Explain the conditions under which a rational agent decides to balance consumption today (period 1) versus consumption tomorrow (period 2). Assume a log utility function and obtain the Euler equation. How do the parameters $1+R$ and $\\\\delta$ impact the decision to consume today vs. tomorrow?\\n\\nExtend the model to any arbitrary number of $T$ periods. Does the same intuition hold? If we still assume log utility, what is the solution to the ratio of consumption in the next time $t+1$ period versus the current time period $t$? What is the name of this consumption model/economic concept under the standard theory?\\n\\nFinally, what happens if we introduce a $\\\\beta$ parameter so that we introduce beta-delta discounting. What are the Euler conditions in this beta-delta model?",
      "reference_file": ""
    }
  ]
}
```

{% endcode %}

{% code overflow="wrap" expandable="true" %}

```json
// Example answers.json in Behavioral Economics
[
  {
    "task_id": "HfQM4Y",
    "answer": "The standard theory describes the concept of consumption smoothing: rational agents do not all of the sudden drop consumption once they retire and instead would like to balance consumption over time rather than favoring consumption today versus consumption tomorrow.",
    "reference_file": "",
    "tools": [],
    "criteria": [
      {
        "id": "459ae047-6ec7-491b-8f9c-2cc9dccf4ace",
        "name": "Defines consumption model",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response includes a formula for lifetime utility $U$ depending on how much the individual consumes today and how much he/she consumes in the future. For example, for a simple two-period model where $t = \\\\{1, 2\\\\}$\\nthe individual chooses consumption in period 1, $c_1$, and consumption in period 2, $c_2$, this is the formula: $$U = u(c_1) + \\\\delta u(c_2)$$"
      },
      {
        "id": "bfbd849b-a137-462b-9270-c91ab9900ef9",
        "name": "Explanation of parameter $\\\\delta$",
        "type": "semantic",
        "description": "",
        "weight": 5,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "States that the parameter $\\\\delta$ captures the weight that the individual places on the future relative to today."
      },
      {
        "id": "75f5d6df-f0d6-4761-98dd-a02983b9e841",
        "name": "Optimization problem",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Responses indicates the consumption optimization problem as:\\n \\n$$\\\\max_{c_1, c_2} \\\\ u(c_1) + \\\\delta u(c_2)$$\\n$$s.t. \\\\ c_1 + \\\\frac{c_2}{1 + R} = W$$\\n\\nWhere $R$ is the interest rate, $W$ is total wealth."
      },
      {
        "id": "180b6d3c-022c-4b9d-8f91-882cb9a52fad",
        "name": "First order conditions/Euler equation",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "From the first order conditions (taking derivatives with respect to each consumption period), response derives the Euler equation as:\\n\\n$$u'(c_1) = \\\\delta(1 + R)u'(c_2)$$."
      },
      {
        "id": "91a407e0-c920-41b8-85c5-4523252ce69c",
        "name": "Explanation of Euler equation",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response explains both sides of the Euler equation as the marginal utility of consuming one unit of $c$ today equal to the marginal benefit of saving one unit today.\\n\\nAn accepted response might also say that for each unit the agent saves today, he gets to consume $1 + R$ units tomorrow, each giving him $u'(c)$ extra units of utility (derivative). In other words, the individual must be indifferent between consuming one more unit today and saving that unit and consuming it in the future."
      },
      {
        "id": "78d21a06-74a7-4608-b985-a4fd782186f8",
        "name": "Defines consumption smoothing over T periods",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response defines the problem over $$T$$periods and solves the following optimization problem, noting the final ratio of future consumption to current consumption as indicated below:\\n\\n$$\\\\max_{c_1, ..., c_T} \\\\ u(c_1) + \\\\delta u(c_2) + ... + \\\\delta^T u(c_T)$$\\n\\n$$s.t. \\\\ c_1 + \\\\frac{c_2}{1 + R} + ... + \\\\frac{c_T}{(1 + R)^T} = W$$\\n\\nfor all $t$ and $t+1$, the Euler equation can be written as:\\n\\n$$u'(c_t) = \\\\delta(1 + R)u'(c_{t+1})$$\\n\\nand if we assume log utility, we obtain:\\n\\n$$\\\\frac{c_{t+1}}{c_t} = \\\\delta(1 + R)$$"
      },
      {
        "id": "b708ae8e-7843-4b9b-86b0-f806f0ffdff5",
        "name": "Intuition of consumption smoothing over T periods",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Answer concludes that the intuition is the very same for any number of $$T$$ periods."
      },
      {
        "id": "36802c9d-1d79-4e78-9be2-0df0216862b9",
        "name": "Euler conditions in beta-delta model for two periods:  $$\\\\beta <1$$",
        "type": "semantic",
        "description": "",
        "weight": 5,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response states that individuals place more weight on the present if $\\\\beta < 1$ and they consume relatively less in period 2 and more in period 1 (future vs. present)."
      },
      {
        "id": "5f104be0-aefd-41b9-8244-a21c5f3bebaa",
        "name": "Euler conditions in beta-delta model for two periods: two periods in the future",
        "type": "semantic",
        "description": "",
        "weight": 5,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response states that when trading off consumption between two *different* points in the future with the beta-delta model, the present-bias parameter does not play a role, so the individual appears to be more patient when trading off consumption and savings in future periods"
      },
      {
        "id": "8128d7bd-d045-438c-860b-67f77ae40d21",
        "name": "Self-control problem in beta-delta model",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response states that the introduction of the beta discounting gives rise to a self-control problem whereby individuals plan to save more tomorrow, but when tomorrow arrives, they succumb to the temptation to consume more than they planned to."
      },
      {
        "id": "dbfc5303-6ac8-4d48-87a8-fff1f1b3db21",
        "name": "Explanation of consumption smoothing wrt $R$ and $\\\\delta$",
        "type": "semantic",
        "description": "",
        "weight": 5,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Assuming we have a log utility function: $u(c) = \\\\log(c)$ the Euler equation can be written as:\\n\\n$$\\\\frac{c_2}{c_1} = \\\\delta(1 + R)$$\\n\\nThe higher $R$ and the higher $\\\\delta$, the more the individual will consume tomorrow relative to today."
      },
      {
        "id": "64344fa1-0264-42cf-ae0c-ceaa0ee4c71f",
        "name": "Names Consumption Smoothing",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response calls this the Life Cycle theory or Consumption Smoothing under the standard model."
      }
    ],
    "passThreshold": 70
  }
]
```

{% endcode %}

## Compatibility

Portex agentic tasks are compatible with the [Harbor](https://harborframework.com/) framework.
