# 2021 Virtual Undergraduate Research Symposium

# 2021 Virtual Undergraduate Research Symposium

# Scaling Multi-Instance Support Vector Machines for Big Data Classification

# Scaling Multi-Instance Support Vector Machines for Big Data Classification

**PROJECT NUMBER: 32** | **AUTHOR:** Lauren Zoe Baker, Computer Science

**MENTOR: **Hua Wang, Computer Science

**ABSTRACT**

In many applications, data falls naturally into groups. Multi-instance learning is a type of machine learning designed to handle grouped data. These groups of data are called “bags.” Entire bags are then labelled as belonging to a certain class. For example, in a bioinformatics application, a patient admitted to a hospital might have had clinical readings taken at multiple timepoints. These timepoints from a single patient form a bag of instances. We would like to classify that patient as having a certain disease from those multiple instances using a multi-instance classifier. Support Vector Machines (SVMs) are classifiers that can be modified to operate on multi-instance data. SVMs find a hyperplane that separates data into classification boundaries. A common strategy for solving SVMs requires quadratic programming techniques. Quadratic programming can be sped up with heuristic methods. These speedups depend on having a small number of support vectors (data points that fall on the hyperplane). However, as the number of data points increases, so does the number of support vectors. Thus, these heuristic methods become slower as the number of data points increases. In our research, we demonstrate how the runtime of different multi-instance SVM algorithms scale with increasing bags of data, and then propose a novel alternate method of solving the SVM problem using Alternating Direction of Method of Multipliers (ADMM). The speed of this proposed method does not rely on the number of support vectors and scales well to large datasets. Although our method is effective when the number of bags increases, it relies on a least squares calculation which scales poorly with respect to features. To address this limitation, we further extend our method to include an approximation that allows our method to handle data with a large number of features. Finally, we modify our method to include a kernel version so that it can handle the classification of data that is not linearly separable.

**PRESENTATION**

**AUTHOR BIOGRAPHY**

Lauren “Zoe” Baker is a third year undergraduate student double majoring in Computer Science and Applied Mathematics. Zoe does research with the computer science department and has been a member of the Machine Learning, Informatics, and Data Science (MInDS@Mines) research group since 2018. She has been funded by the FIRST and MURF fellowships. In the future, Zoe would like to pursue a research career in the intersection of mathematics and computer science. She is also a member of the cross country and track and field teams at Mines and in her free time enjoys trail running, hiking, and skiing.

This is a really cool project. Do you have an idea of who this would be marketed toward/use by?

Thanks for the comment! Speeding up the computation time of a multi-instance support vector machine means that we can now apply SVM models to very large datasets. Currently, in the age of “big data”, machine learning models are only useful if they can handle large inputs. An example of how this model would be used would be in the bioinformatics world. Consider the problem of classifying a certain image of an eye as having a lesion or not. You could split up the image of the eye into multiple patches, or instances, then use our MISVM algorithm to classify the image (which would be a “bag of instances”) as having a lesion or not. The algorithm could then pinpoint a certain patch as having the lesion. Thus, we now have an automatic imaging method for scanning a picture of a patient’s eye, determining if it is healthy, and if it is not, pinpointing the location where it is not healthy. There could be thousands of images in the dataset, so having a fast MISVM algorithm would be critical for this problem.

I really enjoyed learning about your project! I was wondering how your results would be different if you trained the model with different kernals such as polynomial, radial basis function (RBF) or non-linear functions?

Thank you for the comment, I’m glad you enjoyed learning about our project! We have experimented with using different kernels in our MISVM project and observed the scaling results. We’ve found that the kernel extension requires an update that is computationally expensive and causes our scaling results to go from linear to quadratic. In future work, we would like to make our kernel extension just as fast as our original linear model.

Nice work, Zoe! I recently read about some work using SVMs to classify glycemic control for participants with and without type 2 diabetes. I think your extensions to include a large number of features will be very valuable for this kind of application.

Thank you Dr. Diniz Behn! That’s very interesting that you have read about work with SVMs and glycemic control. We are actually planning on using this work in a diabetes-related application. Specifically, we would like to examine the Messidor 2 dataset, a collection of Diabetic Retinopathy (DR) images, to try to pinpoint certain patches in an eye image as having blood vessel damage due to diabetes.