Data Leakage in Biomedical Knowledge Graph Link Prediction: A Benchmark Study

Seminar
Starting on
Ending on
Location
IRISA Rennes
Room
Aurigny
Speaker
Marie-Galadriel Brière (Université Aix-Marseille)

In recent years, Biomedical Knowledge Graphs (KGs) have gained significant attention for their potential to infer novel biological interactions through link prediction. KGs are structured representations that organize extensive multi-scale biomedical information into entities, attributes, and relationships. Knowledge Graph Embedding (KGE) models enable efficient exploration of KGs by learning compact data representations, providing systematic frameworks to predict new biological interactions. These models are increasingly used in drug repurposing, helping identify new therapeutic applications for existing drugs.

While numerous KGE models have been developed and benchmarked, existing evaluations have largely neglected the critical issue of Data Leakage. Data leakage occurs when a model inadvertently learns from data that should have remained hidden during the training phase. This leakage inflates performance metrics, compromising the overall validity of benchmark results.

In this study, we systematically explore and benchmark popular KGE methods for link prediction in the biomedical domain, controlling for data leakage to provide a more realistic assessment of model performance. We identify three sources of data leakage: (1) leakage arising from insufficient separation between the training and test sets, (2) leakage due to the model’s use of illegitimate features, and (3) leakage stemming from a misalignment where the test set does not reflect the target data distribution of interest. By accounting for these sources of DL, we conduct a more accurate evaluation of most popular KGE models, offering insights into their true predictive capabilities and limitations.

For internal attendees

Symbiose seminars: https://www.cesgo.org/symbiose/seminars/data-leakage-in-biomedical-know…