ORIGINAL RESEARCH
Reconstruction of Missing Values at PM2.5 Monitoring Sites Combining K-Shape Clustering and Conditional Score-Based Diffusion Models for Imputation
,
 
,
 
,
 
,
 
,
 
,
 
 
 
 
More details
Hide details
1
School of Geomatics, Anhui University of Science and Technology, Huainan 232001, China
 
 
Submission date: 2024-11-01
 
 
Final revision date: 2025-02-23
 
 
Acceptance date: 2025-03-17
 
 
Online publication date: 2025-05-09
 
 
Corresponding author
Zhen Zhang   

School of Geomatics, Anhui University of Science and Technology, Huainan 232001, China
 
 
 
KEYWORDS
TOPICS
ABSTRACT
PM2.5 is a significant contributor to air pollution, and complete air quality monitoring data is the key to effective prevention and control of PM2.5. However, there are many missing values in real-time monitoring data due to the instability of the monitoring system, machine failures, or human error. Taking the Yangtze River Delta (YRD) region as an example, this study compared the filling effect of various algorithms in the absence of PM2.5 concentration ground monitoring data, then selected the optimal algorithm and combined it with the K-Shape clustering partitioning results to fill the missing PM2.5 concentration data values. The results showed that the Conditional Score-based Diffusion Models for Imputation (CSDI) had better interpolation accuracy than Autoregressive Integrated Moving Average (ARIMA), K-Nearest Neighbors (KNN), and Multiple Imputation (MI) in the missing values imputation task. The historical PM2.5 data from the YRD, when analyzed using CSDI with K-Shape clustering, showed that Partition III had the highest accuracy and Partition II had the lowest. This variance was due to both the clustering accuracy and the inherent characteristics of each partition regarding PM2.5 fluctuations. Analyzing the daily variation characteristics of PM2.5 concentrations in different partitions revealed approximately 9 am, 3 pm, and 9 pm as the three main time nodes with large CSDI filling errors in the YRD region. These findings have significant implications for air quality monitoring and PM2.5 concentration prediction.
CONFLICT OF INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
REFERENCES (54)
1.
MA J., DING Y., CHENG J.C., JIANG F., WAN Z. A temporal-spatial interpolation and extrapolation method based on geographic Long Short-Term Memory neural network for PM2.5. Journal of Cleaner Production. 237, 117729, 2019. https://doi.org/10.1016/j.jcle....
 
2.
LELIEVELD J., POZZER A., PÖSCHL U., FNAIS M., HAINES A., MÜNZEL T. Loss of life expectancy from air pollution compared to other risk factors: a worldwide perspective. Cardiovascular Research. 116 (11), 1910, 2020. https://doi.org/10.1093/cvr/cv... PMid:32123898 PMCid:PMC7449554.
 
3.
LENI Z., KÜNZI L., GEISER M. Air pollution causing oxidative stress. Current Opinion in Toxicology. 20, 1, 2020. https://doi.org/10.1016/j.coto....
 
4.
XING Y.-F., XU Y.-H., SHI M.-H., LIAN Y.-X. The impact of PM2. 5 on the human respiratory system. Journal of Thoracic Disease. 8 (1), E69, 2016.
 
5.
YANG L., LI C., TANG X. The impact of PM2.5 on the host defense of respiratory system. Frontiers in Cell and Developmental Biology. 8, 91, 2020. https://doi.org/10.3389/fcell.... PMid:32195248 PMCid:PMC7064735.
 
6.
BOWE B., XIE Y., YAN Y., AL-ALY Z. Burden of cause-specific mortality associated with PM2.5 air pollution in the United States. JAMA Network Open. 2 (11), e1915834, 2019. https://doi.org/10.1001/jamane... PMid:31747037 PMCid:PMC6902821.
 
7.
HAYES R.B., LIM C., ZHANG Y., CROMAR K., SHAO Y., REYNOLDS H.R., SILVERMAN D.T., JONES R.R., PARK Y., JERRETT M. PM2.5 air pollution and cause-specific cardiovascular disease mortality. International Journal of Epidemiology. 49 (1), 25, 2020. https://doi.org/10.1093/ije/dy... PMid:31289812 PMCid:PMC7124502.
 
8.
ALSABER A.R., PAN J., AL-HURBAN A. Handling complex missing data using random forest approach for an air quality monitoring dataset: a case study of Kuwait environmental data (2012 to 2018). International Journal of Environmental Research and Public Health. 18 (3), 1333, 2021. https://doi.org/10.3390/ijerph... PMid:33540610 PMCid:PMC7908071.
 
9.
HADEED S.J., O'ROURKE M.K., BURGESS J.L., HARRIS R.B., CANALES R.A. Imputation methods for addressing missing data in short-term monitoring of air pollutants. Science of the Total Environment. 730, 139140, 2020. https://doi.org/10.1016/j.scit... PMid:32402974 PMCid:PMC7745257.
 
10.
LE MORVAN M., JOSSE J., SCORNET E., VAROQUAUX G. What'sa good imputation to predict with missing values? Advances in Neural Information Processing Systems. 34, 11530, 2021.
 
11.
JUNGER W.L., PONCE DE LEON A. Imputation of missing data in time series for air pollutants. Atmospheric Environment. 102, 96, 2015. https://doi.org/10.1016/j.atmo... PMCid:PMC10065291.
 
12.
DE SILVA H., PERERA A.S. Missing data imputation using Evolutionary k-Nearest neighbor algorithm for gene expression data. IEEE, Negombo, Sri Lanka, 2016. https://doi.org/10.1109/ICTER.....
 
13.
PUJIANTO U., WIBAWA A.P., AKBAR M.I. K-nearest neighbor (k-NN) based missing data imputation. IEEE, Yogyakarta, Indonesia, 2019.
 
14.
SYRIOPOULOS P.K., KALAMPALIKIS N.G., KOTSIANTIS S.B., VRAHATIS M.N. KNN Classification: a review. Annals of Mathematics and Artificial Intelligence. 93 (1), 43, 2023. https://doi.org/10.1007/s10472....
 
15.
GOU J., SUN L., DU L., MA H., XIONG T., OU W., ZHAN Y. A representation coefficient-based k-nearest centroid neighbor classifier. Expert Systems with Applications. 194, 116529, 2022. https://doi.org/10.1016/j.eswa....
 
16.
KEERIN P., BOONGOEN T. Improved knn imputation for missing values in gene expression data. Computers, Materials and Continua. 70 (2), 4009, 2021. https://doi.org/10.32604/cmc.2....
 
17.
ZHANG S. Challenges in KNN classification. IEEE Transactions on Knowledge and Data Engineering. 34 (10), 4663, 2021. https://doi.org/10.1109/TKDE.2....
 
18.
LIBASIN Z., UL-SAUFIE A.Z., AHMAT H., SHAZIAYANI W.N. Single and Multiple Imputation Method to Replace Missing Values in Air Pollution Datasets: A Review. IOP Publishing, Seoul, Republic of Korea, 2020. https://doi.org/10.1088/1755-1....
 
19.
PATRICIAN P.A. Multiple imputation for missing data. Research in Nursing & Health. 25 (1), 76, 2002. https://doi.org/10.1002/nur.10... PMid:11807922.
 
20.
DE GOEIJ M.C., VAN DIEPEN M., JAGER K.J., TRIPEPI G., ZOCCALI C., DEKKER F.W. Multiple imputation: dealing with missing data. Nephrology Dialysis Transplantation. 28 (10), 2415, 2013. https://doi.org/10.1093/ndt/gf... PMid:23729490.
 
21.
TASHIRO Y., SONG J., SONG Y., ERMON S. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. Curran Associates, Inc., Canada. 2021.
 
22.
GU C., HU L., ZHANG X., WANG X., GUO J. Climate change and urbanization in the Yangtze River Delta. Habitat International. 35 (4), 544, 2011. https://doi.org/10.1016/j.habi....
 
23.
FU Q., ZHUANG G., WANG J., XU C., HUANG K., LI J., HOU B., LU T., STREETS D.G. Mechanism of formation of the heaviest pollution episode ever recorded in the Yangtze River Delta, China. Atmospheric Environment. 42 (9), 2023, 2008. https://doi.org/10.1016/j.atmo....
 
24.
MA T., DUAN F., HE K., QIN Y., TONG D., GENG G., LIU X., LI H., YANG S., YE S. Air pollution characteristics and their relationship with emissions and meteorology in the Yangtze River Delta region during 2014-2016. Journal of Environmental Sciences. 83, 8, 2019. https://doi.org/10.1016/j.jes.... PMid:31221390.
 
25.
WANG Y., LIU Z., HUANG L., LU G., GONG Y., YALUK E., LI H., YI X., YANG L., FENG J. Development and evaluation of a scheme system of joint prevention and control of PM2.5 pollution in the Yangtze River Delta region, China. Journal of Cleaner Production. 275, 122756, 2020. https://doi.org/10.1016/j.jcle....
 
26.
NEWBOLD P. ARIMA model building and the time series analysis approach to forecasting. Journal of Forecasting. 2 (1), 23, 1983. https://doi.org/10.1002/for.39....
 
27.
ZHANG G.P. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing. 50, 159, 2003. https://doi.org/10.1016/S0925-....
 
28.
NELSON B.K. Time series analysis using autoregressive integrated moving average (ARIMA) models. Academic Emergency Medicine. 5 (7), 739, 1998. https://doi.org/10.1111/j.1553... PMid:9678399.
 
29.
MONDAL P., SHIT L., GOSWAMI S. Study of effectiveness of time series modeling (ARIMA) in forecasting stock prices. International Journal of Computer Science, Engineering and Applications. 4 (2), 13, 2014. https://doi.org/10.5121/ijcsea....
 
30.
SHUMWAY R.H., STOFFER D.S., SHUMWAY R.H., STOFFER D.S. ARIMA models. Time series analysis and its applications: with R examples. Springer Texts in Statistics, Springer, Cham. 2017. https://doi.org/10.1007/978-3-....
 
31.
HYNDMAN R.J., KHANDAKAR Y. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software. 27, 1, 2008. https://doi.org/10.18637/jss.v....
 
32.
BOX G.E., JENKINS G.M., REINSEL G.C., LJUNG G.M. Time series analysis: forecasting and control. John Wiley & Sons, pp. 712. Hoboken, New Jersey. 2015.
 
33.
GRUND S., LÜDTKE O., ROBITZSCH A. Multiple imputation of missing data in multilevel models with the R package mdmb: a flexible sequential modeling approach. Behavior Research Methods. 53 (6), 2631, 2021. https://doi.org/10.3758/s13428... PMid:34027594 PMCid:PMC8613130.
 
34.
RUBIN D.B. Multiple imputation for nonresponse in surveys. John Wiley & Sons, Hoboken, New Jersey. 2004.
 
35.
AZUR M.J., STUART E.A., FRANGAKIS C., LEAF P.J. Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research. 20 (1), 40, 2011. https://doi.org/10.1002/mpr.32... PMid:21499542.
 
36.
STEINBACH M., TAN P.-N. kNN: k-nearest neighbors. Chapman and Hall/CRC, 2009. https://doi.org/10.1201/978142....
 
37.
MASTERS D., LUSCHI C. Revisiting small batch training for deep neural networks. arXiv:1804.07612. 2018.
 
38.
HE K., ZHANG X., REN S., SUN J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. https://doi.org/10.1109/CVPR.2... PMid:26180094.
 
39.
PRECHELT L. Early stopping-but when? Springer, 2002.
 
40.
KINGMA D.P., BA J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.
 
41.
LOSHCHILOV I., HUTTER F. Sgdr: Stochastic gradient descent with warm restarts. arXiv:1608.03983. 2016.
 
42.
SMITH L.N. Cyclical learning rates for training neural networks. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa. 2017. https://doi.org/10.1109/WACV.2....
 
43.
KONG Z., PING W., HUANG J., ZHAO K., CATANZARO B. Diffwave: A versatile diffusion model for audio synthesis. arXiv:2009.09761. 2020.
 
44.
PAPARRIZOS J., GRAVANO L. k-Shape: Efficient and Accurate Clustering of Time Series. Association for Computing Machinery, Melbourne, Victoria, Australia, 2015. https://doi.org/10.1145/272337....
 
45.
YANG J., NING C., DEB C., ZHANG F., CHEONG D., LEE S.E., SEKHAR C., THAM K.W. k-Shape clustering algorithm for building energy usage patterns analysis and forecasting model accuracy improvement. Energy and Buildings. 146, 27, 2017. https://doi.org/10.1016/j.enbu....
 
46.
HODSON T.O. Root mean square error (RMSE) or mean absolute error (MAE): When to use them or not. Geoscientific Model Development Discussions. 2022, 1, 2022. https://doi.org/10.5194/gmd-20....
 
47.
ZHENG S., CHAROENPHAKDEE N. Diffusion models for missing value imputation in tabular data. arXiv:2210.17128. 2022.
 
48.
BRUNEKREEF B., HOLGATE S.T. Air pollution and health. The Lancet. 360 (9341), 1233, 2002. https://doi.org/10.1016/S0140-... PMid:12401268.
 
49.
BAKLANOV A., SCHLÜNZEN K., SUPPAN P., BALDASANO J., BRUNNER D., AKSOYOGLU S., CARMICHAEL G., DOUROS J., FLEMMING J., FORKEL R. Online coupled regional meteorology chemistry models in Europe: current status and prospects. Atmospheric Chemistry and Physics. 14 (1), 317, 2014. https://doi.org/10.5194/acp-14....
 
50.
GUYU Z., XIAOYUAN Y., JIANSEN S., HONGDOU H., QIAN W. A PM2.5 spatiotemporal prediction model based on mixed graph convolutional GRU and self-attention network. Environmental Pollution. 125748, 2025. https://doi.org/10.1016/j.envp... PMid:39929428.
 
51.
CHE Z., PURUSHOTHAM S., CHO K., SONTAG D., LIU Y. Recurrent neural networks for multivariate time series with missing values. Scientific Reports. 8 (1), 6085, 2018. https://doi.org/10.1038/s41598... PMid:29666385 PMCid:PMC5904216.
 
52.
WANG X., DICKINSON R.E., SU L., ZHOU C., WANG K. PM2.5 pollution in China and how it has been exacerbated by terrain and meteorological conditions. Bulletin of the American Meteorological Society. 99 (1), 105, 2018. https://doi.org/10.1175/BAMS-D....
 
53.
ZHANG Q., ZHENG Y., TONG D., SHAO M., WANG S., ZHANG Y., XU X., WANG J., HE H., LIU W. Drivers of improved PM2.5 air quality in China from 2013 to 2017. Proceedings of the National Academy of Sciences. 116 (49), 24463, 2019. https://doi.org/10.1073/pnas.1... PMid:31740599 PMCid:PMC6900509.
 
54.
LI K., JACOB D.J., LIAO H., SHEN L., ZHANG Q., BATES K.H. Anthropogenic drivers of 2013-2017 trends in summer surface ozone in China. Proceedings of the National Academy of Sciences. 116 (2), 422, 2019. https://doi.org/10.1073/pnas.1... PMid:30598435 PMCid:PMC6329973.
 
eISSN:2083-5906
ISSN:1230-1485
Journals System - logo
Scroll to top