Recent work has shown that machine learning can provide a reliable tool to classify somatic and rare germline variants in cancer studies where matched-normal samples are not available. Here, we present a workflow that combines an opensource pipeline with three machine-learning models, XGBoost, LightGBM, and TabNet, trained on eight types of features. Our approach substantially enhances the accuracy across all tested models providing accurate results irrespective of sample ancestry and tumour type. We build a parsimonious model and demonstrate that training on low-coverage data retains high accuracy when applied to high-coverage data and vice versa. In contrast to previous findings, our results indicate that XGBoost slightly outperforms LightGBM, achieving high classification accuracy even in the absence of copy-number information and allowing for the ancestry-unbiased calculation of the tumour mutational burden for different types of cancer.
Accurate variant classification in tumour-only genomic data using interpretable tabular models
Submitted to bioRXiv, 12 December 2025
Type:
Rapport
Date:
2025-12-12
Department:
Data Science
Eurecom Ref:
8547
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Submitted to bioRXiv, 12 December 2025 and is available at :
See also:
PERMALINK : https://www.eurecom.fr/publication/8547