Parkinson’s Disease Prediction System in Machine Learning

. Around the globe, thousands of people worldwide are suffering by Parkinson's Disease (PD), a central nervous system degenerative condition. Early detection and diagnosis of PD is crucial for successful treatment and management of the disease. In past few years, Machine learning (ML) algorithms has shown great potential in predicting PD based on various physiological and neurological markers. In this disease prediction system, a system is proposed using ML-based approach to predict the presence of PD in patients. The system employs various machine learning models, including Gradient Boosted Tree, random forest, and logistic regression, to identify key markers and patterns associated with the disease. Overall, this disease prediction system provides a valuable tool for early detection and diagnosis of PD, which can lead to better management and treatment of the disease. The proposed approach can also be extended to other neurological disorders, providing a general framework for disease prediction and diagnosis.


Introduction
Thousands of people globally suffer from Parkinson's disease (PD), a devastating neurological disorder. Dopaminergic neurons gradually disappear in PD, which causes a some of physical and non-physical symptoms include tremors, rigidity, and poor balance. Early detection and diagnosis of PD are critical for successful treatment and disease management.
The use of machine learning technology emerged as a promising approach to predicting PD in patients. With the increasing availability of large-scale datasets and advancements in machine learning algorithms, it is now possible to create reliable prediction algorithms that can identify individuals at high risk of acquiring PD before symptoms even manifest.
In [1] Study, these models are typically based on a range of clinical, genetic, and imaging data that can be used to identify key biomarkers associated with PD. Machine learning algorithms can then be trained to analyze these biomarkers and identify patterns and associations between them and the development of PD. In last few years, machine learning (ML) algorithms has shown great potential in predicting and diagnosing PD based on various physiological and neurological markers.
These ML-based approaches can analyze large and complex datasets, identify patterns and relationships between different variables, and make accurate predictions about the presence of PD. One popular approach to predicting PD using machine learning is through the use of support vector machines (SVMs). An example of a supervised learning algorithm is the SVM, which can be trained to categorize data based on particular features. SVMs can be trained on enormous datasets of patient data in the case of PD prediction in order to pinpoint specific traits that are indicative of PD. In this context, a system is proposed for detection of PD using machine learning. The system [2] employs various machine learning models, including gradient boosted tree, random forest, and logistic regression, to identify key markers and patterns associated with the disease. By analyzing a range of physiological and neurological markers, including clinical records, brain imaging, and genetic information, our system can predict with high accuracy, sensitivity, and specificity.

Related Work
This paper [3] surveys various machine learning algorithms for predicting Parkinson's disease. Among them are Decision tree models, random forests, machines with support vectors, artificial neural networks and other algorithms. The algorithms' accuracy spans from 70% to 99%, with certain algorithms performing better than others.
The study in paper [4] found that SVM showed good accuracy (88.9%) compared to other algorithms, and Random Forest had the highest accuracy of 90.26% while Naïve Bayes had the lowest level accuracy of 69.23%. Hierarchical clustering and SOM were also used, predicting higher numbers of clusters in healthy datasets.
In the paper [5], with 34 support vectors, the Nu-SVM model depending on the Gaussian method was shown to have the maximum sensitivity and overall accuracy. The research presents an ensemble learning approach for utilising machine learning to predict early warning signals of Parkinson's disease. The proposed model surpasses existing approaches such as SVM, KNN, RF, DT, MLP, SC, and LR, with an accuracy of 94.87%.
In this study [6], functional MRI (fMRI) data were used to discover brain activity patterns linked to optimal and nonoptimal deep brain stimulation (DBS) settings in Parkinson's disease (PD) patients. achieving 88% accuracy in forecasting optimal vs. non-optimal circumstances.
In paper [7] The analysis reveals that patients can be classified into three subtypes of PD: slow progressors, moderate progressors, and fast progressors. The approach can aid in the interpretability of clinical features and disease progression. The algorithms used were unsupervised learning and mathematical projection.
This paper [8] presents a study of Parkinson's Disease (PD) diagnosis using voice and tremor data. For tremor data, kNN achieved the highest accuracy of 98.5% for 2-level classification and 90% for 5-level classification. By combining both voice and tremor data, an accuracy of 99.8% was achieved using ensemble averaging of kNN, SVM, and naive Bayes for PD detection. The study uses kNN, SVM and Naive Bayes algorithms for classification. The highest accuracy for male voice samples was found to be 90.3% in kNN, and for female voice samples, it was 95.8% in kNN. In tremor data, the maximum accuracy for PD vs non-PD classification was 98.5% in kNN. This paper [9] presents a multimodal machine learning model for predicting the risk of Parkinson's disease. The model was developed using an open-source auto-ML package called GenoML and was validated in an external cohort. The model outperformed previous efforts with an accuracy of 89.72% and was based on a combination of clinicodemographic, genetic, and transcriptomic data.
The study [10] utilizes a PCA-RF model for detecting Parkinson's disease. It was found that the model's performance without PCA was better than with PCA. Specifically, the model achieved 89.9% and 76.7% accuracy, 70.2% and 55.6% sensitivity, and 96.5% and 80.6% specificity without and with PCA, respectively.
In study [11], the outcomes shown the advantages of the suggested ANFIS+PSOGWO algorithm, which outperformed its rivals by 7.3% and predicted Parkinson's disease with an accuracy of 87.5%. The suggested approach outperformed some recent research on Parkinson's disease prediction that employed PSO, GWO, GA, ACO, and DE, among other optimization techniques.
In paper [12] The accuracy of the research varies according on the quality and amount of the datasets used, as well as the methods used. Nonetheless, sensitivities in the 90%-95% range were reached using today's approaches.
Parkinson's disease detection techniques in paper [13] employs a number of equipment to evaluate the degree of illness. The vocal difficulty is one of the most prevalent symptoms, and most patients have vocal defections in the initial phases of the disease. As an outcome, medical systems driven by voice concerns have assumed the lead in contemporary PD detection research.
With significance set in study [14] at p 0.05, the study used SPSS or R to do statistical and data analysis. Machine learning algorithms to forecast 2-year longitudinal medical findings, models such as elastic-net and random forest were developed based on clinical factors, inflammatory cytokine measures, and demographic information (age and sex).

System Design
Ensemble Classifier, Support Vector Machine, and Decision Tree were employed in the study [15]. The classifiers' performance is measured using accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F-score. The suggested method is developed in MATLAB 2018 with multiple classifier parameter values. The Ensemble Classifier with 30 learners produced the maximum accuracy of 94.7%. The flow of building the Parkinson system is depicted in Fig. (2).

Load Data
Loading data in Spark involves various steps such as creating a SparkSession, specifying the data source format, defining the schema or letting Spark infer it, specifying options like delimiter, header, encoding, etc., and reading the data into a DataFrame. Once the data is loaded, it can be processed and transformed using Spark's distributed computing capabilities. It's important to ensure the data is clean, consistent, and in the right format before loading it into Spark for optimal performance and accurate results.

Data Preprocessing
A format is machine learning systems can easily analyse. Noise, inconsistencies, missing values, and irrelevant characteristics in raw data can all have an influence on the performance of machine learning algorithms. Data preparation is necessary for operations such as purifying the data and preparing it for a machine learning model, which improves the model's accuracy and efficacy. Data preprocessing entails a procedure of cleaning, transforming, and reducing data. Mathematical concepts such as statistics, linear algebra, and probability theory are used to perform these operations. For example, mean, median, mode, standard deviation, correlation matrices, and matrix operations are used to handle missing values, identify outliers, and perform scaling and normalization.

Data Normalization
Data normalization is a common data preprocessing technique used in machine learning to transform the data to a common scale or range. This is important because many machine learning algorithms assume that all input variables are on the same scale. If variables are not on the same scale, some variables may have a greater impact on the algorithm, leading to inaccurate or biased results.
One common technique for data normalization is min-max scaling, which scales the data to a range between 0 and 1. The formula for min-max scaling is: features X^2, X^3and so on. This can be expressed mathematically as: In equation (2) is the new feature set, X is the original feature, and n is the desired degree of polynomial expansion.

Feature Reduction
Feature reduction in machine learning is the process of lowering the amount of features or variables in a dataset in order to make data processing easy to compute and improve the efficacy of models. This can be achieved by eliminating irrelevant or redundant features that do not contribute to the accuracy of the model or may even negatively impact it.
One common technique for feature reduction is principal component analysis (PCA), It entails converting the data into a lower-dimensional space while keeping as much of the data's variance as possible. This can be expressed mathematically as: In equation (3) is the new feature set, X is the original feature set, and W is the matrix of principal components that captures the maximum amount of variance in the data. Another common technique for feature reduction is feature selection, which involves selecting a subset of features that are most relevant to the model. This can be achieved through various methods such as correlation analysis, mutual information, or regularization.

Training Model
Training a machine learning model involves using In equation (1) x is the original data point, is the minimum value of the variable, is the maximum value of the variable, and is the normalized data point. Techniques can be applied to individual variables or entire datasets. Data normalization helps to ensure that all variables are given equal importance during machine learning analysis, leading to more accurate and reliable results.

Feature Generation
Feature generation is a technique of creating additional features or variables with the help of already-existing data which will result in machine learning models performance. This can be achieved by combining, transforming, or extracting features that may not have been originally present in the data. Feature generation is particularly useful when the original features are not sufficient to accurately represent the mathematical algorithms to adjust its parameters for accurate predictions on new data, based on concepts like linear algebra, calculus, probability, and statistics. The model is trained with labeled data to minimize differences between predicted and actual outputs. Once sufficiently accurate, the model can be used to make predictions on new data. The algorithms used to train the Model are: 1) Logistic Regression: It's a statistical technique that's applied in binary classification jobs with the objective of estimating the likelihood of an event happening based on input data. Logistic Regression is a statistical technique used in machine learning to predict the probability of a binary outcome. It represents the relationship between one or more independent variables and a binary dependent variable. The mathematical equation for logistic regression and this value represents the estimated probability of the outcome is: underlying patterns or relationships in the data. predictions made by the model. The confusion matrix aids in determining the model's accuracy, precision, sensitivity and specificity which are describes in Table 1. To produce a final forecast, the system constructs a large number of decision trees and combines their results. The approach builds a forest of decision trees, where each tree is educated using a random subset of input features and training data. During prediction, the forest combines all of the individual trees' predictions to produce the ultimate outcome. Combining decision tree with ensemble learning equations yields the mathematical formula for Random Forest. By averaging the forecasts of each individual tree in the forest, the algorithm's output is produced. The output of a confusion matrix provides information about the true positive (2), true negative (118), false positive (1), and false negative (29) predictions made by the model. The confusion matrix aids in determining the model's accuracy, precision, sensitivity and specificity which are describes in Table 1.

3) Gradient Boosted Tree:
A machine learning algorithm called Gradient Boosted Trees is used for regression and classification tasks. Multiple decision trees are combined in this ensemble learning technique to produce predictions. The algorithm works by training decision trees in a sequential manner, where each subsequent tree learns from the errors of the previous tree. The algorithm minimizes the loss function by adding a new tree at each iteration, with the goal of reducing the residual errors. The mathematical equation for Gradient Boosted Trees involves calculating the sum of the output values of multiple decision trees. The output of each tree is weighted by a learning rate and added to the sum. In summary, Gradient Boosted Trees is a powerful algorithm that can handle both numerical and categorical data and can be used for a variety of machine learning tasks. Its strength lies in its ability to minimize the loss function by combining the output of multiple decision trees. The output of a confusion matrix provides information about the true positive (5), true negative (119), false positive (0), and false negative (26) predictions made by the model. It provides several key metrics that help assess the model's accuracy, precision, sensitivity, specificity and effectiveness in making predictions, which are describes in Table 1.  Evaluating Performance: It's a statistical technique that's applied in binary classification jobs with the objective of estimating the likelihood of an event happening based on input data. Logistic Regression is a statistical technique used in machine learning to predict the probability of a binary outcome. It represents the relationship between one or more independent variables and a binary dependent variable. The mathematical equation for logistic regression is:

Conclusion
In conclusion, the project on Parkinson disease prediction system using machine learning has the ability to enhance illness identification and treatment dramatically. Machine learning algorithms can effectively predict the possibility of an individual having Parkinson's disease by analysing several traits and symptoms of the condition. The method can also assist in determining the stage of the disease and its severity, allowing doctors to give patients with personalised treatment options.
This project can be of great significance in the medical field, as it can help doctors to detect Parkinson's disease at an initial stage when it is most treatable. Additionally, the prediction system can be used to identify potential risk factors and develop preventative measures. The implementation of this system could lead to better patient outcomes, and ultimately, contribute to reducing the overall burden of Parkinson's disease on individuals and healthcare systems.
While this project is a significant step forward, there is still room for improvement. Further research and development could lead to an even more accurate prediction system by incorporating additional data sources and refining the machine learning algorithms used.
In conclusion, the machine learning-based Parkinson disease prediction system has the ability to have a big influence on Parkinson's disease identification and management, and it is an intriguing subject for future study.