Graphs of the results with Code Kaleidoscope being used
To prevent this widespread issue of data contamination, we propose Code Kaleidoscope.
In Code Kaleidoscope, there are two main strategies: PPM and DyCodeEval. Each of these
strategies aims to dynamically create new benchmarking problems that can be evaluated by
the LLM. The graphs above show our results for each of the strategies. We can see that
although the LLM may be contaminated with leaked benchmarking data, it still evaluates at
the original accuracy on our new problems. Therefore, using Code Kaleidoscope allows
for transparent and accurate benchmarking of LLMs!