Scalable Machine Learning Algorithms for Processing High-dimensional Data: A Case Study Approach

Oluwafunmilayo Ifeoluwa Somoye; Chimdi Walter Ndubuisi; Samuel Donatus; Felix Amakye

doi:10.9734/ajarr/2025/v19i61064

Scalable Machine Learning Algorithms for Processing High-dimensional Data: A Case Study Approach

Full Article - PDF Review History Discussion

Published: 2025-06-25

DOI: 10.9734/ajarr/2025/v19i61064

Page: 348-367

Issue: 2025 - Volume 19 [Issue 6]

Oluwafunmilayo Ifeoluwa Somoye *

Department of Computer Science, Khoury College, Northeastern University, United States.

Chimdi Walter Ndubuisi

Department of Electrical and Computer Engineering, University of Missouri-Columbia, United States.

Samuel Donatus

Department of Mechanical Engineering, University of South Florida, USA.

Felix Amakye

University of Tennessee, Knoxville, United States.

*Author to whom correspondence should be addressed.

Abstract

The need for highly scalable machine learning algorithms is driven by the fast development of high-dimensional data in various industries, including cybersecurity and finance. Traditional models perform under heavy computational load and suffer reduced predictive performance when handling really massive data sets. This study method explored scalable ML approaches using a 100,000-sample, 5,000-feature dataset. Preprocessing involved normalization and missing value imputation followed by dimensionality reduction using Principal Component Analysis and Recursive Feature Elimination techniques subsequently. DNNs and other models like XGBoost were trained using TensorFlow on massively distributed frameworks pretty quickly. DNNs attained highest accuracy of 94.7% rather slowly after 180 minutes of training while XGBoost struck a nice balance between performance and efficiency with accuracy of 91.2% in just 72 minutes. RFE maintained accuracy with only a 1% drop, while PCA reduced computation time by about 40% at the expense of a negligible 2-3% accuracy reduction. An ANOVA F value of 38.52 at p less than 0.001 indicated significant variations in efficiency across the different models. Distributed XGBoost achieved around a 3.5× speedup hence showcasing great practical scalability in a moderately distributed context. While deep learning models are known for their great accuracy, tree-based models, like XGBoost, provide a better overall solution. Future research should focus on highly adaptive dimensionality reduction approaches and mainly hybrid models in order to substantially increase scalability without significantly sacrificing prediction accuracy.

Keywords: High-dimensional data, scalable machine learning, dimensionality reduction, computational efficiency, model optimization

How to Cite

Somoye, Oluwafunmilayo Ifeoluwa, Chimdi Walter Ndubuisi, Samuel Donatus, and Felix Amakye. 2025. “Scalable Machine Learning Algorithms for Processing High-Dimensional Data: A Case Study Approach”. Asian Journal of Advanced Research and Reports 19 (6):348-67. https://doi.org/10.9734/ajarr/2025/v19i61064.

Downloads

Download data is not yet available.