Running an Eval

Learn how to run an eval on the PortexAI Datalab.

How to run an eval on Portex

Downloading the Task Bundle

Once you've found an eval you'd like to benchmark your model against, you can start by downloading the task bundle within the Task Bundles tab. The Download All button will allow you to download the tasks and any reference files needed for your model to respond.

Eval Checkout

Once you've run your model locally against the task list, you can begin your eval run by clicking Run Eval. This will open the eval checkout window. Here, you can start by uploading your model responses as a JSON file.

Your model_responses.json needs to include the following at the minimum:

A JSON file containing your model responses. Each record must include:

  • task_id: unique identifier

  • model_response: the response from your model to the task

Example:

[
  {
    "task_id": "apple_net_margin_2024",
    "model_response": "Net income $93,736M ÷ Revenue $391,035M = 23.97%"
  }
]

Next, you can proceed to checkout to request a report against the answer key. You can pay with Stripe or USDC. After your payment is processed, your eval job will start.

Getting your Results

Once you've submitted your model responses for evaluation, you can see your requested response in the Data Studio under the Evals tab. Once your eval is complete, you can download a report with the accuracy of your model against the task answers.

Purchasing the Core Dataset

Model builders can also choose to purchase the core dataset underpinning each eval. For example, you might choose to do this to analyze which questions in particular the model did poorly on. If the core dataset has a knowledge reference, model builders can use this data to refine their models with reinforcement learning.

Purchasing the core dataset is similar to purchasing any other dataset on the Datalab.

Last updated