Code Kaleidoscope is a comprehensive suite of dynamic benchmarking strategies that offers accurate and transparent evaluations on code generation LLMs. Our approach consists of two main strategies: Programming Problem Merging (PPM) and Context Diversification (DyCodeEval).
Programming Problem Merging focuses on merging two different problems to create a new problem free of contamination.
Learn MoreContext Diversification modifies the context of the problem to create a new problem free of contamination.
Learn MoreData contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks.
We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap—the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks.
Graph showing data contamination proof
Is data contamination really an issue? Although the data being used to train LLMs is unknown, benchmarks are public and are easily at risk for data scraping algorithms. By running experiments using different percentages of leaked benchmarking data, we see that LLMs begin to show inflated accuracies on these benchmarks demonstrating a degree of contamination.
Results showing Code Kaleidoscope effectiveness
To prevent this widespread issue of data contamination, we propose Code Kaleidoscope. In Code Kaleidoscope, there are two main strategies: PPM and DyCodeEval. Each of these strategies aims to dynamically create new benchmarking problems that can be evaluated by the LLM. The graphs above show our results for each of the strategies. We can see that although the LLM may be contaminated with leaked benchmarking data, it still evaluates at the original accuracy on our new problems. Therefore, using Code Kaleidoscope allows for transparent and accurate benchmarking of LLMs!
Code Kaleidoscope is available and easy to implement in your code now. By using the python package, you can take in a set of benchmarking data and output a new diverse set of programming problems to evaluate your LLM on.