Information (or data) leakage is undesired behavior in machine learning during which information that should not be in the training data inflates the model’s ability to learn, causing poor performance at prediction time or in production. Models subject to information leakage do not generalize well to unseen data.
There are multiple types of data leakage, including:
Avoiding or detecting information leakage early is important to prevent models from learning the wrong signals and overestimating their value before they go into production.
In addition to following data science best practices, model interpretability is a great tool to identify and fight information leakage.
At C3 AI, data scientists are well-versed in information leakage problems and how to detect them. C3 AI carefully splits the data into separate groups – training, validation, and test sets – and keeps the test set intact to report the final performance after the model has been optimized on the validation set. For time-series data, C3 AI applications always apply a cut-off timestamp or time-series cross-validation.