Code Kaleidoscope

Teaser

Solving Data Contamination through Dynamic Benchmarking

Teaser

Image that shows Code Kaledioscope overview

Code Kaledioscope is a set of dynamic benchmarking strategies that offers accurate and transparent evaluations on code generation LLMs. This set consists of two main strategies: Programming Problem Merging (PPM) and Context Diversification (DyCodeEval). Programming Problem Merging is a strategy that focuses on merging two different problems to create a new problem free of contamination. Context Diversification is a strategy that modifies the context of the problem to create a new problem free of contamination.
Teaser

Graph that shows data contamination proof

Is data contamination really an issue? Although the data being used to train LLMs is unknown, benchmarks are public and are easily at risk for data scraping algorithms. By running experiments using different percentages of leaked benchmarking data, we see that LLMs begin to show inflated accuracies on these benchmarks demonstrating a degree of contamination.
Teaser

Graphs of the results with Code Kaleidoscope being used

To prevent this widespread issue of data contamination, we propose Code Kaleidoscope. In Code Kaleidoscope, there are two main strategies: PPM and DyCodeEval. Each of these strategies aims to dynamically create new benchmarking problems that can be evaluated by the LLM. The graphs above show our results for each of the strategies. We can see that although the LLM may be contaminated with leaked benchmarking data, it still evaluates at the original accuracy on our new problems. Therefore, using Code Kaleidoscope allows for transparent and accurate benchmarking of LLMs!

How to use our package?

Code Kaleidoscope is available and easy to implement in your code now. By using the python package, you can take in a set of benchmarking data and output a new diverse set of programming problems to evaluate your LLM on.
Package