EECS – Prelim: Machine Learning-Based Prediction of Bacteriocins via Feature Evaluation, Suraiya Akhter
About the event
Student: Suraiya Akhter
Advisor: Dr. John Miller
Degree: Computer Science MS
Thesis Title: Machine Learning-Based Prediction of Bacteriocins via Feature Evaluation
Abstract: Antibiotic resistance is a major public health concern around the globe. As a result, researchers always look for new compounds to develop new antibiotic drugs for combating antibiotic-resistant bacteria. The use of bacteriocins has emerged as a promising strategy in the development of new drugs to combat antibiotic resistance, given their ability to kill bacteria with both broad and narrow natural spectra. Therefore, there is a strong need for an accurate and efficient computational model to predict novel bacteriocins. Machine learning’s ability to learn patterns and features from bacteriocin sequences that are difficult to capture using sequence matching-based methods makes it a potentially superior choice for accurate prediction. Our aims are to develop a machine learning-based software tool called BaPreS (Bacteriocin Prediction Software) and a web application called BPAG (Bacteriocin Prediction based on Alternating Decision Tree and Genetic Algorithm) using optimal set of features for detecting bacteriocin protein sequences with high accuracy. Initially, we extracted potential features from known bacteriocin and non-bacteriocin sequences by considering the physicochemical and structural properties of the protein sequences. Then we reduced the feature set using statistical justifications and recursive feature elimination technique for the BaPreS software tool, and we evaluated the candidate features using the Pearson correlation coefficient followed by the Alternating Decision Tree (ADTree) and Genetic Algorithm (GA) to eliminate unnecessary features for the BPAG web application. Finally, we constructed random forest (RF) and support vector machine (SVM) models using reduced feature sets, which achieved accuracy of up to 95.54% and 98.21% on the testing dataset for BaPreS and BPAG, respectively. The models’ ability to predict highly diverse bacteriocins with a high degree of accuracy is reflected in the achieved level of accuracy. We utilized the best machine learning models to implement BaPreS software tool and BPAG web application. We compared the prediction performance of the BaPreS and BPAG with a popular sequence matching-based tool and a deep learning-based method, and our tools outperformed both. Moreover, BPAG showed superior performance over BaPreS. Currently, both BaPreS and BPAG provide classification results with associated probability values and have options to add new sequences in the training dataset to improve the prediction power of the models.