How Evals Work

A conceptual overview of Portex Evals and how they work.

What is a Portex Eval?

A Portex Eval is a private, expert-authored test that measures how well AI models perform on real-world, domain-specific work or frontier knowledge. Unlike public benchmarks (which are prone to data contamination), Portex Evals keep answer keys hidden from models and use rubric-driven grading to produce standardized scores. Tasks can be completed agentically (multi-turn, e.g. with Codex or Claude Code) or in Q&A style (single-turn, vanilla model reasoning without external tools).

Each eval consists of:

A set of tasks (prompts, optionally with reference files like PDFs, images, or CSVs)
A private answer key with a rubric that defines how each task is scored

Key Concepts

Task — A single prompt given to a model. Tasks are procedural: they ask the model to produce a work product (e.g., a legal memo, a financial calculation, a code review) rather than answer trivia.

Environment — A set of configurations for agentic tasks. Portex integrates with the Harbor framework for agentic evals.

Answer Key — A golden reference answer for each task, kept private by default. Only the grading system sees it. Accompanies the rubric.

Grading Criteria — A rubric of objective criterion, each with a weight (summing to 100%) and a scoring method. Supported types:

Semantic: An LLM jury evaluates whether the model's response satisfies the criterion
Lexical: Exact string or regex matching
Binary, ordinal, and numeric types for structured answers

Pass Threshold — A per-task percentage. If the total summed score across all criteria meets or exceeds this threshold, the task is marked as passed.

Core Dataset — The complete bundle (tasks, answers, reference files, expert notes and harbor-compliant file artifacts) that can be licensed separately. Buyers use Core Datasets for model improvement, including reinforcement learning.

LLM Jury — A private configurable ensemble of language models that applies the rubric to grade model responses. Models under evaluation never see the answer key or rubric.

O*NET-SOC Taxonomy — Evals can be tagged to occupations in the BLS O*NET-SOC system (nearly 1,000 occupational titles aligned to the 2018 SOC). This enables benchmarking by job family and real-world task type. Experts assign an occupation tag when creating a listing; if left unselected, Portex assigns one.

Lifecycle

For experts

Author tasks with answer keys and grading criteria
Upload to the Datalab via the Eval Builder or file import
Create an Eval Listing with per-run pricing (and optionally a Core Dataset price) or publish Open Source.
Publish. Model builders can now discover, download tasks, and submit runs.

For model builders

Browse evals on the Datalab. Filter by occupation, difficulty, modality, or model performance.
Download the Task Bundle (tasks.json and reference files)
Run a model locally to produce responses
Upload model_responses.json and purchase a run
Receive a results report with per-task scores, grader notes, and summary statistics

Example Eval

An example task at the high undergraduate/low graduate level in economics for illustrative purposes.

// Example task.json in Behavioral Economics
{
  "version": 1,
  "prompts": [
    {
      "task_id": "HfQM4Y",
      "task_prompt": "Describe the problem of retirement savings under classic standard economic theory. First, state the problem clearly as a utility optimization problem for a simplified two-period consumption model with an interest rate $1+R$ and total wealth $W$ and a discount factor $\\\\delta$.\\n\\nExplain each parameter. Explain the conditions under which a rational agent decides to balance consumption today (period 1) versus consumption tomorrow (period 2). Assume a log utility function and obtain the Euler equation. How do the parameters $1+R$ and $\\\\delta$ impact the decision to consume today vs. tomorrow?\\n\\nExtend the model to any arbitrary number of $T$ periods. Does the same intuition hold? If we still assume log utility, what is the solution to the ratio of consumption in the next time $t+1$ period versus the current time period $t$? What is the name of this consumption model/economic concept under the standard theory?\\n\\nFinally, what happens if we introduce a $\\\\beta$ parameter so that we introduce beta-delta discounting. What are the Euler conditions in this beta-delta model?",
      "reference_file": ""
    }
  ]
}

// Example answers.json in Behavioral Economics
[
  {
    "task_id": "HfQM4Y",
    "answer": "The standard theory describes the concept of consumption smoothing: rational agents do not all of the sudden drop consumption once they retire and instead would like to balance consumption over time rather than favoring consumption today versus consumption tomorrow.",
    "reference_file": "",
    "tools": [],
    "criteria": [
      {
        "id": "459ae047-6ec7-491b-8f9c-2cc9dccf4ace",
        "name": "Defines consumption model",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response includes a formula for lifetime utility $U$ depending on how much the individual consumes today and how much he/she consumes in the future. For example, for a simple two-period model where $t = \\\\{1, 2\\\\}$\\nthe individual chooses consumption in period 1, $c_1$, and consumption in period 2, $c_2$, this is the formula: $$U = u(c_1) + \\\\delta u(c_2)$$"
      },
      {
        "id": "bfbd849b-a137-462b-9270-c91ab9900ef9",
        "name": "Explanation of parameter $\\\\delta$",
        "type": "semantic",
        "description": "",
        "weight": 5,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "States that the parameter $\\\\delta$ captures the weight that the individual places on the future relative to today."
      },
      {
        "id": "75f5d6df-f0d6-4761-98dd-a02983b9e841",
        "name": "Optimization problem",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Responses indicates the consumption optimization problem as:\\n \\n$$\\\\max_{c_1, c_2} \\\\ u(c_1) + \\\\delta u(c_2)$$\\n$$s.t. \\\\ c_1 + \\\\frac{c_2}{1 + R} = W$$\\n\\nWhere $R$ is the interest rate, $W$ is total wealth."
      },
      {
        "id": "180b6d3c-022c-4b9d-8f91-882cb9a52fad",
        "name": "First order conditions/Euler equation",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "From the first order conditions (taking derivatives with respect to each consumption period), response derives the Euler equation as:\\n\\n$$u'(c_1) = \\\\delta(1 + R)u'(c_2)$$."
      },
      {
        "id": "91a407e0-c920-41b8-85c5-4523252ce69c",
        "name": "Explanation of Euler equation",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response explains both sides of the Euler equation as the marginal utility of consuming one unit of $c$ today equal to the marginal benefit of saving one unit today.\\n\\nAn accepted response might also say that for each unit the agent saves today, he gets to consume $1 + R$ units tomorrow, each giving him $u'(c)$ extra units of utility (derivative). In other words, the individual must be indifferent between consuming one more unit today and saving that unit and consuming it in the future."
      },
      {
        "id": "78d21a06-74a7-4608-b985-a4fd782186f8",
        "name": "Defines consumption smoothing over T periods",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response defines the problem over $$T$$periods and solves the following optimization problem, noting the final ratio of future consumption to current consumption as indicated below:\\n\\n$$\\\\max_{c_1, ..., c_T} \\\\ u(c_1) + \\\\delta u(c_2) + ... + \\\\delta^T u(c_T)$$\\n\\n$$s.t. \\\\ c_1 + \\\\frac{c_2}{1 + R} + ... + \\\\frac{c_T}{(1 + R)^T} = W$$\\n\\nfor all $t$ and $t+1$, the Euler equation can be written as:\\n\\n$$u'(c_t) = \\\\delta(1 + R)u'(c_{t+1})$$\\n\\nand if we assume log utility, we obtain:\\n\\n$$\\\\frac{c_{t+1}}{c_t} = \\\\delta(1 + R)$$"
      },
      {
        "id": "b708ae8e-7843-4b9b-86b0-f806f0ffdff5",
        "name": "Intuition of consumption smoothing over T periods",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Answer concludes that the intuition is the very same for any number of $$T$$ periods."
      },
      {
        "id": "36802c9d-1d79-4e78-9be2-0df0216862b9",
        "name": "Euler conditions in beta-delta model for two periods:  $$\\\\beta <1$$",
        "type": "semantic",
        "description": "",
        "weight": 5,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response states that individuals place more weight on the present if $\\\\beta < 1$ and they consume relatively less in period 2 and more in period 1 (future vs. present)."
      },
      {
        "id": "5f104be0-aefd-41b9-8244-a21c5f3bebaa",
        "name": "Euler conditions in beta-delta model for two periods: two periods in the future",
        "type": "semantic",
        "description": "",
        "weight": 5,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response states that when trading off consumption between two *different* points in the future with the beta-delta model, the present-bias parameter does not play a role, so the individual appears to be more patient when trading off consumption and savings in future periods"
      },
      {
        "id": "8128d7bd-d045-438c-860b-67f77ae40d21",
        "name": "Self-control problem in beta-delta model",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response states that the introduction of the beta discounting gives rise to a self-control problem whereby individuals plan to save more tomorrow, but when tomorrow arrives, they succumb to the temptation to consume more than they planned to."
      },
      {
        "id": "dbfc5303-6ac8-4d48-87a8-fff1f1b3db21",
        "name": "Explanation of consumption smoothing wrt $R$ and $\\\\delta$",
        "type": "semantic",
        "description": "",
        "weight": 5,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Assuming we have a log utility function: $u(c) = \\\\log(c)$ the Euler equation can be written as:\\n\\n$$\\\\frac{c_2}{c_1} = \\\\delta(1 + R)$$\\n\\nThe higher $R$ and the higher $\\\\delta$, the more the individual will consume tomorrow relative to today."
      },
      {
        "id": "64344fa1-0264-42cf-ae0c-ceaa0ee4c71f",
        "name": "Names Consumption Smoothing",
        "type": "semantic",
        "description": "",
        "weight": 10,
        "rationale": "",
        "examples": [],
        "semanticPrompt": "Response calls this the Life Cycle theory or Consumption Smoothing under the standard model."
      }
    ],
    "passThreshold": 70
  }
]

Compatibility

Portex tasks are compatible with the Harbor framework for agentic evals.

PreviousWhat is Portex?NextPSDLA

Last updated 20 days ago

hashtagWhat is a Portex Eval?

hashtagKey Concepts

hashtagLifecycle

hashtagFor experts

hashtagFor model builders

hashtagExample Eval

hashtagCompatibility