Importing an Eval

Import an eval from JSON files for advanced users.

If you prefer authoring evals in an editor or IDE, you can upload structured JSON files directly to the Datalab. This path is intended for advanced users who want full control over the eval schema or already have files ready to go locally.

Upload Flow

From the Data Studio, go to Datasets > Upload a File, then select "Eval Dataset." You will be prompted to upload your files.

Eval Dataset Bundle

An eval dataset consists of three parts:

tasks.json (required)
answers.json (required)
Reference files as a .zip archive (optional)

Every task_id in answers.json must also exist in tasks.json. Ensure task ids are unique.

tasks.json

A JSON array of task objects. Each task must include:

task_id: a unique identifier (string)
task_prompt: the full prompt for the model
reference_file: filename of the attached reference file (or empty string if none)

// An example task from the AI Productivity Index (APEX), which measures performance in finance, law, consulting, and medicine.

[
  {
    "task_id": "828",
    "task_prompt": "I'm working to optimize the distribution of marketing spend for my client for the next cycle to better fit their Family Man target group. They want to focus on campaigns that they have experience with, meaning only Partner + Category combos that they've done more than 2 campaigns with, previously. Among those, disregard campaigns where Family Man is not in top 2 in the attached dataset. The ones left are our campaigns of interest for this analysis. First off, what is the total campaign cost and total impressions of the campaigns of interest?",
    "reference_file": "Target_Group (1).csv"
  }
]

This file is downloadable by eval buyers so they can generate model responses.

answers.json

A JSON array of answer objects. Each answer must include:

task_id: matches the task_id in tasks.json
answer: the correct output (text, number, or structured content)
reference_file: the associated reference file name (if any)
criteria: an array of grading criteria (see below)
passThreshold: minimum weighted score (0-100) for the task to be considered passed

Optional fields:

tools: array of tool names if the task involves tool use (can be empty)

// An example task from the AI Productivity Index (APEX), which measures performance in finance, law, consulting, and medicine.

[
  {
    "task_id": "828",
    "answer": "Calculates the total campaign costs for campaigns of interest as $1,591,118.",
    "reference_file": "Target_Group (1).csv",
    "tools": [],
    "criteria": [
      {
        "id": "70093557-b436-4700-804a-51a489b949ad",
        "name": "Calculates the total campaign costs",
        "type": "semantic",
        "description": "Calculates the total campaign costs for campaigns of interest as $1,591,118.",
        "weight": 14.26,
        "rationale": "Using the dataset Target_Group.csv, filter for campaigns with more than 2 campaigns with the same Category + Partners combo. Then filter for campaigns with Family Man in either Target_Group_1 or Target_Group_2. Sum all campaign costs, yielding $1,591,118.",
        "examples": [],
        "semanticPrompt": "Calculates the total campaign costs for campaigns of interest as $1,591,118. (Acceptable value is $1,591,118)"
      }
    ],
    "passThreshold": 70
  }
]

Criteria schema

Each criterion in the criteria array has the following fields:

Field

Type

Required

Description

id

string

Yes

Unique identifier (UUID recommended)

name

string

Yes

Short label for the criterion

type

string

Yes

One of: semantic, lexical, binary, ordinal, numeric, regex

description

string

Yes

What constitutes a correct response

weight

number

Yes

Percentage weight (all criteria weights for a task should sum to 100)

rationale

string

Explanation of how the correct answer is derived

examples

array

Example responses (can be empty)

semanticPrompt

string

For semantic

The prompt sent to the LLM jury to evaluate the response (must contain exact figures)

The answer key is private by default. Buyers never see it unless you sell the Core Dataset.

Reference Files (Optional)

Bundle any reference files into a .zip archive. Each task can reference at most one file. Accepted formats: PDF, images (JPG, PNG, WebP, GIF), CSV, TXT.

No video or audio files are supported.

The archive structure should be flat (files at the root level):

refs.zip/
  Target_Group (1).csv

After Upload

Once your eval dataset is created, proceed to Publishing Your Eval to create a listing.

PreviousCreating an Eval NextEditing an Eval

Last updated 20 days ago

hashtagUpload Flow

hashtagEval Dataset Bundle

hashtagtasks.json

hashtaganswers.json

hashtagCriteria schema

hashtagReference Files (Optional)

hashtagAfter Upload

Upload Flow

Eval Dataset Bundle

tasks.json

answers.json

Criteria schema

Reference Files (Optional)

After Upload