Code Kaleidoscope

Solving Data Contamination through Dynamic Benchmarking

A comprehensive suite of dynamic benchmarking strategies for accurate and transparent evaluation of code generation Large Language Models, free from data contamination.

About Code Kaleidoscope

Code Kaleidoscope is a comprehensive suite of dynamic benchmarking strategies that offers accurate and transparent evaluations on code generation LLMs. Our approach consists of two main strategies: Programming Problem Merging (PPM) and Context Diversification (DyCodeEval).

PPM

Programming Problem Merging focuses on merging two different problems to create a new problem free of contamination.

Learn More

DyCodeEval

Context Diversification modifies the context of the problem to create a new problem free of contamination.

Learn More

Comprehensive Survey

Recent Advances in Large Language Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray

Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks.

We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap—the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks.

The Data Contamination Problem

Graph showing data contamination proof

Is data contamination really an issue? Although the data being used to train LLMs is unknown, benchmarks are public and are easily at risk for data scraping algorithms. By running experiments using different percentages of leaked benchmarking data, we see that LLMs begin to show inflated accuracies on these benchmarks demonstrating a degree of contamination.

Our Solution: Code Kaleidoscope

Results showing Code Kaleidoscope effectiveness

To prevent this widespread issue of data contamination, we propose Code Kaleidoscope. In Code Kaleidoscope, there are two main strategies: PPM and DyCodeEval. Each of these strategies aims to dynamically create new benchmarking problems that can be evaluated by the LLM. The graphs above show our results for each of the strategies. We can see that although the LLM may be contaminated with leaked benchmarking data, it still evaluates at the original accuracy on our new problems. Therefore, using Code Kaleidoscope allows for transparent and accurate benchmarking of LLMs!

Get Started with Code Kaleidoscope

Code Kaleidoscope is available and easy to implement in your code now. By using the python package, you can take in a set of benchmarking data and output a new diverse set of programming problems to evaluate your LLM on.