How Evals Work
A conceptual overview of Portex Evals and how they work.
What is a Portex Eval?
A Portex Eval is a private, expert-authored test that measures how well AI models perform on real-world, domain-specific work or frontier knowledge. Unlike public benchmarks (which are prone to data contamination), Portex Evals keep answer keys hidden from models and use rubric-driven grading to produce standardized scores. Tasks can be completed agentically (multi-turn, e.g. with Codex or Claude Code) or in Q&A style (single-turn, vanilla model reasoning without external tools).
Each eval consists of:
A set of tasks (prompts, optionally with reference files like PDFs, images, or CSVs)
A private answer key with a rubric that defines how each task is scored
Eval Types
Portex supports two types of evals, each suited to different kinds of model capability measurement.
Q&A (Single-Turn)
Q&A evals are prompt-driven. The model receives a task prompt (and optionally a reference file), produces a single response, and that response is graded. This is the standard eval format for measuring knowledge, reasoning, and analysis.
Q&A evals are well-suited to tasks like:
Answering domain-specific questions without tools/coding/web search
Agentic (Multi-Turn)
Agentic evals test a model's ability to operate autonomously over multiple turns in a sandboxed environment. Instead of producing a single response, the model acts as an agent: it can execute code, read and write files, use tools, and iterate toward a solution.
Each agentic task runs inside a container with a configurable environment (base image, installed packages, resource limits, and timeouts). The eval author defines what tools are available and what the environment looks like. The agent interacts with this environment across multiple turns until it produces a final output.
Agentic evals are well-suited to tasks like:
Writing and executing code to solve a problem
Performing multi-step research or data analysis
Working with files, APIs, or command-line tools
Tasks that require planning, error recovery, and iteration
Agentic evals on Portex are built on the Harbor framework, which provides the sandboxed execution environment and agent orchestration.
Key Concepts
Task — A single prompt given to a model. Tasks are procedural: they ask the model to produce a work product (e.g., a legal memo, a financial calculation, a code review) rather than answer trivia. In Q&A evals, tasks are self-contained prompts. In agentic evals, tasks also include an environment specification and available tools.
Environment — A set of configurations for agentic tasks. Portex integrates with the Harbor framework for agentic evals.
Answer Key — A golden reference answer for each task, kept private by default. Only the grading system sees it. Accompanies the rubric.
Grading Criteria — A rubric of objective criteria, each with a weight (summing to 100%) and a scoring method. Supported types:
Semantic (LLM-as-a-Judge): An LLM jury evaluates whether the model's response satisfies the criterion
Exact Match: Exact string matching against an expected value
Lexical: Exact string or regex matching
Binary, ordinal, and numeric types for structured answers
Pass Threshold — A per-task percentage. If the weighted score across all criteria meets or exceeds this threshold, the task is marked as passed.
Core Dataset — The complete bundle (tasks, answers, reference files, expert notes and harbor-compliant file artifacts) that can be licensed separately. Buyers use Core Datasets for model improvement, including reinforcement learning.
LLM Jury — A private configurable ensemble of language models that applies the rubric to grade model responses. Models under evaluation never see the answer key or rubric.
O*NET-SOC Taxonomy — Evals can be tagged to occupations in the BLS O*NET-SOC system (nearly 1,000 occupational titles aligned to the 2018 SOC). This enables benchmarking by job family and real-world task type. Experts assign an occupation tag when creating a listing; if left unselected, Portex assigns one.
Lifecycle
For experts
Choose an eval type (Q&A or Agentic)
Author tasks with answer keys and grading criteria
For agentic evals, configure the execution environment (packages, resources, timeouts)
Upload to the Datalab via the Eval Builder or file import
Choose a publishing model (commercial or open source)
Create and publish a listing. Model builders can now discover, download tasks, and submit runs.
For model builders
Browse evals on the Datalab. Filter by occupation, difficulty, modality, or model performance.
Download the Task Bundle (tasks.json and reference files)
Run a model locally to produce responses (for agentic evals, this means running an agent against the task environment via Harbor)
Upload
model_responses.jsonand purchase a runReceive a results report with per-task scores, grader notes, and summary statistics
Example Eval
An example Q&A task at the high undergraduate/low graduate level in economics for illustrative purposes.
Compatibility
Portex agentic tasks are compatible with the Harbor framework.
Last updated