ARC
ARC or AI2 Reasoning Challenge is a dataset used to benchmark language models' reasoning abilities. The benchmark consists of 8,000 multiple-choice questions from science exams for grades 3 to 9. The dataset includes two modes: easy and challenge, with the latter featuring more difficult questions that require advanced reasoning.
To learn more about the dataset and its construction, you can read the original paper here.
Arguments
There are three optional arguments when using the ARC
benchmark:
- [Optional]
n_problems
: the number of problems for model evaluation. By default, this is set all problems available in each benchmark mode. - [Optional]
n_shots
: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5. - [Optional] mode: a
ARCMode
enum that selects the evaluation mode. This is set toARCMode.EASY
by default.deepeval
currently supports 2 modes: EASY and CHALLENGE.
Both EASY
and CHALLENGE
modes consist of multiple-choice questions. However, CHALLENGE
questions are more difficult and require more advanced reasoning.
Example
The code below assesses a custom mistral_7b
model (click here to learn how to use ANY custom LLM) on 100 problems in ARC
in EASY mode.
from deepeval.benchmarks import ARC
from deepeval.benchmarks.modes import ARCMode
# Define benchmark with specific n_problems and n_shots in easy mode
benchmark = ARC(
n_problems=100,
n_shots=3,
mode=ARCMode.EASY
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
The overall_score
ranges from 0 to 1, signifying the fraction of accurate predictions across tasks. Both modes' performances are measured using an exact match scorer, focusing on the quantity of correct answers.