How Evals Work
A conceptual overview of Portex Evals and how they work.
What is a Portex Eval?
A Portex Eval is a private, expert-authored test that measures how well AI models perform on real-world, domain-specific work or frontier knowledge. Unlike public benchmarks (which are prone to data contamination), Portex Evals keep answer keys hidden from models and use rubric-driven grading to produce standardized scores. Tasks can be completed agentically (multi-turn, e.g. with Codex or Claude Code) or in Q&A style (single-turn, vanilla model reasoning without external tools).
Each eval consists of:
A set of tasks (prompts, optionally with reference files like PDFs, images, or CSVs)
A private answer key with a rubric that defines how each task is scored
Key Concepts
Task — A single prompt given to a model. Tasks are procedural: they ask the model to produce a work product (e.g., a legal memo, a financial calculation, a code review) rather than answer trivia.
Environment — A set of configurations for agentic tasks. Portex integrates with the Harbor framework for agentic evals.
Answer Key — A golden reference answer for each task, kept private by default. Only the grading system sees it. Accompanies the rubric.
Grading Criteria — A rubric of objective criterion, each with a weight (summing to 100%) and a scoring method. Supported types:
Semantic: An LLM jury evaluates whether the model's response satisfies the criterion
Lexical: Exact string or regex matching
Binary, ordinal, and numeric types for structured answers
Pass Threshold — A per-task percentage. If the total summed score across all criteria meets or exceeds this threshold, the task is marked as passed.
Core Dataset — The complete bundle (tasks, answers, reference files, expert notes and harbor-compliant file artifacts) that can be licensed separately. Buyers use Core Datasets for model improvement, including reinforcement learning.
LLM Jury — A private configurable ensemble of language models that applies the rubric to grade model responses. Models under evaluation never see the answer key or rubric.
O*NET-SOC Taxonomy — Evals can be tagged to occupations in the BLS O*NET-SOC system (nearly 1,000 occupational titles aligned to the 2018 SOC). This enables benchmarking by job family and real-world task type. Experts assign an occupation tag when creating a listing; if left unselected, Portex assigns one.
Lifecycle
For experts
Author tasks with answer keys and grading criteria
Upload to the Datalab via the Eval Builder or file import
Create an Eval Listing with per-run pricing (and optionally a Core Dataset price) or publish Open Source.
Publish. Model builders can now discover, download tasks, and submit runs.
For model builders
Browse evals on the Datalab. Filter by occupation, difficulty, modality, or model performance.
Download the Task Bundle (tasks.json and reference files)
Run a model locally to produce responses
Upload model_responses.json and purchase a run
Receive a results report with per-task scores, grader notes, and summary statistics
Example Eval
An example task at the high undergraduate/low graduate level in economics for illustrative purposes.
Compatibility
Portex tasks are compatible with the Harbor framework for agentic evals.
Last updated