How Evals Work

A conceptual overview of Portex Evals and how they work.

What is a Portex Eval?

A Portex Eval is a private, expert-authored test that measures how well AI models perform on real-world, domain-specific work or frontier knowledge. Unlike public benchmarks (which are prone to data contamination), Portex Evals keep answer keys hidden from models and use rubric-driven grading to produce standardized scores. Tasks can be completed agentically (multi-turn, e.g. with Codex or Claude Code) or in Q&A style (single-turn, vanilla model reasoning without external tools).

Each eval consists of:

  • A set of tasks (prompts, optionally with reference files like PDFs, images, or CSVs)

  • A private answer key with a rubric that defines how each task is scored

Key Concepts

Task — A single prompt given to a model. Tasks are procedural: they ask the model to produce a work product (e.g., a legal memo, a financial calculation, a code review) rather than answer trivia.

Environment — A set of configurations for agentic tasks. Portex integrates with the Harborarrow-up-right framework for agentic evals.

Answer Key — A golden reference answer for each task, kept private by default. Only the grading system sees it. Accompanies the rubric.

Grading Criteria — A rubric of objective criterion, each with a weight (summing to 100%) and a scoring method. Supported types:

  • Semantic: An LLM jury evaluates whether the model's response satisfies the criterion

  • Lexical: Exact string or regex matching

  • Binary, ordinal, and numeric types for structured answers

Pass Threshold — A per-task percentage. If the total summed score across all criteria meets or exceeds this threshold, the task is marked as passed.

Core Dataset — The complete bundle (tasks, answers, reference files, expert notes and harbor-compliant file artifacts) that can be licensed separately. Buyers use Core Datasets for model improvement, including reinforcement learning.

LLM Jury — A private configurable ensemble of language models that applies the rubric to grade model responses. Models under evaluation never see the answer key or rubric.

O*NET-SOC Taxonomy — Evals can be tagged to occupations in the BLS O*NET-SOC system (nearly 1,000 occupational titles aligned to the 2018 SOC). This enables benchmarking by job family and real-world task type. Experts assign an occupation tag when creating a listing; if left unselected, Portex assigns one.

Lifecycle

For experts

  1. Author tasks with answer keys and grading criteria

  2. Upload to the Datalab via the Eval Builder or file import

  3. Create an Eval Listing with per-run pricing (and optionally a Core Dataset price) or publish Open Source.

  4. Publish. Model builders can now discover, download tasks, and submit runs.

For model builders

  1. Browse evals on the Datalab. Filter by occupation, difficulty, modality, or model performance.

  2. Download the Task Bundle (tasks.json and reference files)

  3. Run a model locally to produce responses

  4. Upload model_responses.json and purchase a run

  5. Receive a results report with per-task scores, grader notes, and summary statistics

Example Eval

An example task at the high undergraduate/low graduate level in economics for illustrative purposes.

Compatibility

Portex tasks are compatible with the Harborarrow-up-right framework for agentic evals.

Last updated