Stochastic kNN-based imputation for recovering missing value distributions: Applications to uncertainty quantification in solar power forecasting

Pashmchi, Parastoo
Thesis

Data quality is a common challenge in data-driven models. One factor that significantly
impacts this quality is the presence of missing values. Missing values, which occur for
various reasons and across different fields, can substantially alter the statistical properties of the data; thus, ignoring them can introduce biases into the results. A common approach is to ignore or delete observations with missing values, but this reduces the sample size and may alter the dataset. On the other hand, imputation is the standard approach for addressing these gaps by filling them with estimated values. Widely used techniques like kNNImputer typically rely on estimating the conditional mean of the missing response. This thesis shows that such deterministic, regression-based methods fail to accurately recover the true underlying distribution of the missing data, thereby leading to distorted data structures and a considerable underestimation of uncertainty. To address these limitations, this thesis presents kNNSampler, a new  stochastic imputation technique designed to maintain the distributional characteristics of missing values. Unlike traditional methods, kNNSampler estimates the conditional distribution of a missing response given a covariate as the empirical distribution of the observed responses of its k-nearest neighbours. By randomly drawing imputations from this distribution, the method captures the natural variability within the data. We establish a theoretical basis for this method by examining the convergence of the mean embedding of the kNN conditional distribution in a Reproducing Kernel Hilbert Space (RKHS). We develop error bounds that establish the estimator’s statistical consistency, showing that it converges to the true conditional distribution as the sample size grows. Empirical tests on synthetic and real datasets show that kNNSampler performs favorably in recovering missing-value distributions, as measured by the energy distance. Lastly, the proposed imputation framework is applied to the industrial challenge of forecasting solar photovoltaic (PV) power. Motivated by the prevalence of missing data in PV assets at SAP Labs France, we develop a prediction model using the Multiple Imputation (MI) framework supported by kNNSampler. By generating multiple plausible imputed datasets, this approach enables the estimation of reliable prediction intervals that explicitly quantify total uncertainty, incorporating both variability from missing data and residual predictive uncertainty. This framework offers a more reliable basis for energy management systems by providing accurate forecasting even with incomplete historical data. Results show that MI-kNNSampler improves uncertainty calibration relative to kNNImputer, while point prediction accuracy remains similar. The kNNSampler multiple imputation is shown to be a practical method for handling missing data and supporting subsequent downstream models. 

HAL
Type:
Thèse
Date:
2026-03-09
Department:
Data Science
Eurecom Ref:
8561
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
See also:

PERMALINK : https://www.eurecom.fr/publication/8561