Building Better Benchmarks: Towards Standardized AI Evaluation

Jason Zhang
December 11, 2024

AI is here in full force. As our models grow increasingly sophisticated, our need for reliable benchmarking has never been more critical - yet the current benchmarking landscape resembles a patchwork of disparate approaches. A quick look at today’s AI benchmarks show a breadth of inconsistencies in reproducibility standards, accessibility protocol, benchmarking structure, and safeguards against data contamination. This doesn’t just complicate our ability to meaningfully compare models with each other; it casts doubt on the value proposition of using these benchmarks altogether. If we want a shot at understanding these models and what they are truly capable of, we as a field must adopt a unified set of best practices for how we approach benchmark development and use moving forward.

In this post, I discuss what I believe to be the main limitations of today’s benchmarks, and compile a list of best practices for the field of benchmarking to adopt for a more unified and rigorous approach to AI evaluation.

1. Data Leakage

Models are getting trained on the same questions they’re being tested on. This happens in two ways: direct exposure in pre-training data from web-scraping and indirect exposure through iterative training on user interactions. Recent research paints a troubling picture: these two forms of data leakage have leaked an estimated 4.7 million benchmark questions across 263 benchmarks to models like GPT 3.5 and 4 [Leak, Cheat, Repeat]. This could explain why models like GPT-3.5 Turbo have been found to achieve 57% accuracy in reproducing masked incorrect answer choices in the MMLU [Benchmark Probing] or entire evaluation questions when provided with a URL hint [Investigating Data Contamination].

The implications of this? Saturated benchmark scores and inflated perceptions of what these models are capable of. A striking example comes from the release of Apple’s recent GSM-Symbolic paper, where researchers experimented with making minor modifications to the GSM8k, a widely-used reasoning benchmark consisting of grade-school math problems. Despite apparent steady improvements in model performance on the GSM8k over time, researchers found that simply changing character names or adding seemingly relevant but ultimately inconsequential details led to significant performance degradation across models. This vulnerability isn’t unique - other research has shown that simply reordering multiple-choice answers on other benchmarks like the MMLU can introduce substantial variance to model outputs and cause them to drop up to eight positions on benchmark leaderboard rankings [Benchmarks are Targets]. These findings suggest that current static and contaminated benchmarks are testing for all the wrong things — measuring models' ability to memorize and benchmark specific patterns rather than execute compositional, generalizable reasoning.

2. Reproducibility

Reproducing evaluations from scratch is hard. There are so many ways that things can go wrong in the workflow - from slight variations in model prompting, tweaks made to hyperparameter configurations, or LM hallucination during answer extraction. [Evaluating AI Systems] This has proven to be a significant problem across the field - recent studies have shown that up to 76% of papers using ROUGE evaluations had provably incorrect scores, an unsurprising result given that only 20% of the papers implementing the evaluation provided ample information for effective reproduction. [Rogue Scores] This challenge extends far beyond just a single metric - across various benchmarks and implementations of them, researchers frequently omit crucial implementation details that place strain on the field’s ability to validate and build upon published results. [A Systematic Survey and Critical Review of LMs]

Blazing Productive Paths Forward

So are benchmarks a doomed practice? I don’t think so. They have the potential to serve as a useful metric for us to universally evaluate the capabilities of LMs if done right. But given the challenges discussed today, benchmarks today are just simply not that compelling. Here’s how I think we can move forward as a field:

On Handling Data Contamination / Memorization:

Standardized Memorization Scores or Contamination Guarantees from Model Developers: The community has developed robust methods for detecting training data contamination, ranging from n-gram & LM pair scrubbing to Test Slot Guessing to Log Likelihood Probing on QA pairs. By consolidating these techniques into a standardized contamination metric, we can establish a new precedent: requiring frontier model developers to provide comprehensive contamination guarantees for each model release. While this was previously done with GPT-3 and Llama 2, it hasn’t been done since and reviving this practice can provide much-needed transparency that would benefit the entire field of AI.
Securing Evaluation Data: The widespread availability of evaluation test data on the internet undermines the validity of benchmarking. Evaluation data simply needs to be secured. Microsoft’s recent research charts a promising path forward: cryptographically secure environments where both model weights and evaluation data can remain encrypted throughout evaluation and stored privately when not in evaluation. However, this shift towards privatized evaluation must be balanced with transparency, which creates the need for independent auditing organizations to verify the quality and integrity of these evaluations. Initiatives like Scale.AI’s SEAL and AIExplained’s SimpleBench are good examples of this.
Including GUIDs / Canary Strings in Benchmark Data[1] in Benchmark Data: BigBench introduced an effective approach to preventing data contamination by including canary strings as unique global identifiers, or digital fingerprints, for benchmark questions to allow for easy detection during data scrubbing and filtration when designing test datasets for LMs.
Functional / Compositional Benchmarks: We must shift our focus to benchmarks that evaluate true reasoning capabilities rather than rewarding models that exploit irrelevant memorized patterns. The key lies in creating evaluations that test models with questions that are sufficiently out-of-distribution from other questions in training sets they’ve already seen. This can manifest in various ways. One approach is to functionalize datasets, as done with GSM-Symbolic, in which irrelevant factors like character names / proper nouns or irrelevant problem details are sampled to create a unique and different evaluation, while keeping the core reasoning task the same. Other approaches can involve the composition of previously observed patterns learned in training into new and unseen patterns during evaluation time. Examples of benchmarks testing this thesis include CompCMTG and modeLing.

On Bolstering Reproducibility:

Standardizing Reproducibility Standards: The path to reproducible benchmarking demands explicit documentation of every evaluation detail, from hyperparameters to prompts used. This level of transparency cannot be optional; it must become a fundamental requirement for both benchmark released and research papers utilizing these benchmarks. Emerging frameworks like InspectAI allow for this data to get standardized, creating accessible implementation pipelines to ensure for exact benchmark reproduction moving forward.

Footnotes

Canary Strings: Unique identifiers embedded in data to detect contamination or unauthorized access. ↩