Contaminated Model Evaluation
To determine the extent of model contamination and the validity of DyCodeEval, we ran
experiments with three different models where we simulated model contamination by finetuning the model with increasing
percentages of the dataset. From the top row, we can see that all the models that are finetuned with a specific leaked
dataset increase in accuracy rapidly when evaluated on that dataset. However, in the bottom row, we see that even though
the dataset is leaked, the models maintain the same accuracy due to DyCodeEval:
our dynamic benchmarking method.