Scalable Machine Learning Algorithms for Processing High-dimensional Data: A Case Study Approach
Oluwafunmilayo Ifeoluwa Somoye
*
Department of Computer Science, Khoury College, Northeastern University, United States.
Chimdi Walter Ndubuisi
Department of Electrical and Computer Engineering, University of Missouri-Columbia, United States.
Samuel Donatus
Department of Mechanical Engineering, University of South Florida, USA.
Felix Amakye
University of Tennessee, Knoxville, United States.
*Author to whom correspondence should be addressed.
Abstract
The need for highly scalable machine learning algorithms is driven by the fast development of high-dimensional data in various industries, including cybersecurity and finance. Traditional models perform under heavy computational load and suffer reduced predictive performance when handling really massive data sets. This study method explored scalable ML approaches using a 100,000-sample, 5,000-feature dataset. Preprocessing involved normalization and missing value imputation followed by dimensionality reduction using Principal Component Analysis and Recursive Feature Elimination techniques subsequently. DNNs and other models like XGBoost were trained using TensorFlow on massively distributed frameworks pretty quickly. DNNs attained highest accuracy of 94.7% rather slowly after 180 minutes of training while XGBoost struck a nice balance between performance and efficiency with accuracy of 91.2% in just 72 minutes. RFE maintained accuracy with only a 1% drop, while PCA reduced computation time by about 40% at the expense of a negligible 2-3% accuracy reduction. An ANOVA F value of 38.52 at p less than 0.001 indicated significant variations in efficiency across the different models. Distributed XGBoost achieved around a 3.5× speedup hence showcasing great practical scalability in a moderately distributed context. While deep learning models are known for their great accuracy, tree-based models, like XGBoost, provide a better overall solution. Future research should focus on highly adaptive dimensionality reduction approaches and mainly hybrid models in order to substantially increase scalability without significantly sacrificing prediction accuracy.
Keywords: High-dimensional data, scalable machine learning, dimensionality reduction, computational efficiency, model optimization