Skip to main content

SQuAD

SQuAD (Stanford Question Answering Dataset) is a QA benchmark designed to test a language model's reading comprehension capabilities. It consists of 100K question-answer pairs (including 10K in the validation set), where each answer is a segment of text taken directly from the accompanying reading passage. To learn more about the dataset and its construction, you can read the original SQuAD paper here.

info

SQuAD was constructed by sampling 536 articles from the top 10K Wikipedia articles. A total of 23,215 paragraphs were extracted, and question-answer pairs were manually curated for these paragraphs.

Arguments

There are three optional arguments when using the SQuAD benchmark:

  • [Optional] tasks: a list of tasks (SQuADTask enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of SQuADTask enums can be found here.
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
  • [Optional] evaluation_model: a string specifying which of OpenAI's GPT models to use for scoring, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to gpt-4o.
note

Unlike most benchmarks, DeepEval's SQuAD implementation requires an evaluation_model, using an LLM-as-a-judge to generate a binary score determining if the prediction and expected output align given the context.

Example

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on passages about pharmacy and Normans in SQuAD using 3-shot prompting.

from deepeval.benchmarks import SQuAD
from deepeval.benchmarks.tasks import SQuADTask

# Define benchmark with specific tasks and shots
benchmark = SQuAD(
tasks=[SQuADTask.PHARMACY, SQuADTask.NORMANS],
n_shots=3
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on LLM-as-a-judge, is calculated by evaluating whether the predicted answer aligns with the expected output based on the passage context.

For example, if the question asks, "How many atoms are present?" and the model predicts "two atoms," the LLM-as-a-judge determines whether this aligns with the expected answer of "2" by assessing semantic equivalence rather than exact text matching.

SQuAD Tasks

The SQuADTask enum classifies the diverse range of categories covered in the SQuAD benchmark.

from deepeval.benchmarks.tasks import SQuADTask

math_qa_tasks = [SQuADTask.PHARMACY]

Below is the comprehensive list of available tasks:

  • PHARMACY
  • NORMANS
  • HUGUENOT
  • DOCTOR_WHO
  • OIL_CRISIS_1973
  • COMPUTATIONAL_COMPLEXITY_THEORY
  • WARSAW
  • AMERICAN_BROADCASTING_COMPANY
  • CHLOROPLAST
  • APOLLO_PROGRAM
  • TEACHER
  • MARTIN_LUTHER
  • ECONOMIC_INEQUALITY
  • YUAN_DYNASTY
  • SCOTTISH_PARLIAMENT
  • ISLAMISM
  • UNITED_METHODIST_CHURCH
  • IMMUNE_SYSTEM
  • NEWCASTLE_UPON_TYNE
  • CTENOPHORA
  • FRESNO_CALIFORNIA
  • STEAM_ENGINE
  • PACKET_SWITCHING
  • FORCE
  • JACKSONVILLE_FLORIDA
  • EUROPEAN_UNION_LAW
  • SUPER_BOWL_50
  • VICTORIA_AND_ALBERT_MUSEUM
  • BLACK_DEATH
  • CONSTRUCTION
  • SKY_UK
  • UNIVERSITY_OF_CHICAGO
  • VICTORIA_AUSTRALIA
  • FRENCH_AND_INDIAN_WAR
  • IMPERIALISM
  • PRIVATE_SCHOOL
  • GEOLOGY
  • HARVARD_UNIVERSITY
  • RHINE
  • PRIME_NUMBER
  • INTERGOVERNMENTAL_PANEL_ON_CLIMATE_CHANGE
  • AMAZON_RAINFOREST
  • KENYA
  • SOUTHERN_CALIFORNIA
  • NIKOLA_TESLA
  • CIVIL_DISOBEDIENCE
  • GENGHIS_KHAN
  • OXYGEN