I recently started looking at a Kaggle Challenge about predicting poverty levels in Costa Rica. I used sklearn train_test_split to split the training data into train and validation sets and fit a few models. The first thing I noticed was that my submissions scored significantly lower than my validation sets: 0.36 on the submission vs. .96 on my validation data.
The data consists of information about individuals with the target as their poverty level. The features include both information relating to that individual as well as information for the household they live in. The data includes multiple individuals from the same household, and some exploratory data analysis indicated that most of the features were on a household level rather than the individual level.
This means that doing a random split ends up including data from the same household in both the train and validation datasets, which will result in the leakage that artificially raised my initial validation scores. This also means that my models were all tuned on a validation dataset which was essentially useless.
To fix this I did the split on unique household IDs, so no household would be included in both datasets. After re-tuning the models appropriately, the validation f1 scores had gone down from 0.96 to 0.65. The submissions scores went up to 0.41, which was not a huge increase, but it was much closer to the validation scores.
The moral of this story is never forget to make sure that your training and validation sets don't contain overlap or leakage, or the validation set becomes useless.