Evaluating the Impact of Data Quality on Predictive Performance of Deep Learning Models in Complex Datasets
Keywords:
Data Quality, Deep Learning, Predictive Performance, Complex Datasets, Noise Robustness, Missing Data, Model GeneralizationAbstract
The efficacy of deep learning (DL) models is intrinsically linked to the quality of the data on which they are trained and evaluated. This paper investigates the multifaceted impact of data quality dimensions—including completeness, accuracy, consistency, and relevance—on the predictive performance of DL models when applied to complex, real-world datasets. Through a structured review and simulated experiments, we demonstrate that degradations in data quality, such as introduced noise, missing values, and label errors, lead to significant and non-linear declines in model accuracy, generalization, and robustness. The findings underscore that data quality is not merely a preprocessing concern but a foundational determinant of model success, necessitating rigorous data-centric practices in the machine learning pipeline.
References
Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2019). The ML test score: A rubric for ML production readiness and technical debt reduction.
Gentyala, R. (2025). Ethical Artifacts: Engineering Verifiable Audit Trails for Human-in-the-Loop Decisions in ML Data Pipelines. Journal of Scientific and Engineering Research, 12(10), 240–251.
Northcutt, C. G., Jiang, L., & Chuang, I. L. (2021). Confident learning: Estimating uncertainty in dataset labels.
Cabitza, F., Rasoini, R., & Gensini, G. F. (2017). Unintended consequences of machine learning in medicine.
Gentyala, R. (2025). Bridging the semantic divide: A framework for cross-functional literacy between data and machine learning engineers. European Journal of Advances in Engineering and Technology, 12(4), 91–100.
Krishnan, S., & Wu, E. (2019). Alphaclean: Automatic generation of data cleaning pipelines.
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI.
Gentyala, R. (2025). Mapping imperfections to instruments: A unified taxonomy for data engineering in behavioral economics. International Journal of Data Engineering Research and Development (IJDERD), 2(1), 10–30. https://doi.org/10.34218/IJDERD_02_01_002
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2020). Understanding deep learning requires rethinking generalization.
Yao, Y., Liu, T., Han, B., Gong, M., Deng, J., Niu, G., & Sugiyama, M. (2022). Dual T: Reducing estimation error for transition matrix in label-noise learning.
Gentyala, R. (2025). Benchmarking Prompt Architectures: A Quantitative Study of Contextual and Decomposed Prompting for Complex ETL Code Generation. ISCSITR - International Journal of Computer Science and Engineering (ISCSITR-IJCSE), 6(3), 39–60. https://doi.org/10.63397/ISCSITR-IJCSE_2025_06_03_004
Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, A. (2018). Automating large-scale data quality verification.
Hooker, S. (2020). The hardware lottery.
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems.



