Introduction to Machine Learning and the Mathematics of Machine Learning#
DSC 340 Week 1 Lecture Notes#
Dr. Julie Butler
Week of August 21 - August 25, 2023
What is Machine Learning?#
Machine learning is the field that occurs at the intersection of artificial intelligence and data science. It is using computer algorithms that can learn to find patterns in and draw meaningful conclusions from data sets. The book Hands-On Machine Learning defines machine learning as “the science of programming computers so they can learn from data”. You can also think of machine learning as a collection of computer programs that solve problems without being programmed with task-specific instructions (i.e. the same algorithm can perform well on many problems without any changes).
This section will explore when you would want to use machine learning over traditional types of programming and also the different types and classifications of machine learning algorithms.
There are many great resources on machine learning, but a few for further investigation are listed below.
When is Machine Learning Useful?#
Machine learning can be applied to many problems, but it is particularly useful in certain context. Machine learning is typically applied when there are large amounts of data to be analyzed, particularly when the pattern can be hard to find. Additionally, machine learning can perform better than traditional programming when the data is expected to fluctuate or vary over time. Machine learning can automatically adapt to these changes in the data while traditional programs may require updates to adjust.
For some examples, image analysis and classification is typically performed by machine learning algorithms, as they can easily find patterns in the images, especially if there are a large amount of images to work with. A famous image classification example is the dog and bagel data set, where the goal is to classify images as “dog” or “bagel”. Machine learning is also typically applied when there are very large data sets where the pattern to be found in the data is unknown or is complex. Astronomy is a good example of this as the data sets from this field can be terabytes in size and sorting through them by hand is tedious and very time consuming. Additionally machine learning is used when the data sets are known to change over time. For example, machine learning is used to study and predict future values in the stock market, whose values can change widely and quickly.
The Machine Learning Workflow#
Though every machine learning problem will use a different algorithm and data set, the analysis shares the same five steps:
Importing your data set and formatting it
Splitting the data into a training set and a test set
Training your machine learning model with the training set
Evaluate the trained model’s performance with the test set
(Optional) Make improvements to your model to increase its performance
These steps will be investigated further next week, but some of the later discussion will make more sense if you have a basic idea of the steps that are normally involved in a machine learning algorithm.
Types of Machine Learning#
All machine learning algorithms must be given a data set from which to learn. A data set is made up of a set of independent variables, usually called the inputs or the x data set, and a set of dependent variables, the outputs or the y data set. Therefore a data set is made up of points of the form (X,y). In the context of machine learning, the X data are called the features and the y data are called the labels. If both the X and y data are present, then the data set is considered to be labelled and if only the X component of the data set is present, then the data is called unlabelled. Machine learning algorithms can be split into three (or sometimes four) categories based on how much of the data set they recieve (i.e. whether the data is labelled or unlabelled).
Supervised Machine Learning#
Supervised machine learning, or simply supervised learning, is machine learning applied to labelled data sets. By labelled, this means that the data set has both an input (X) component and an output (y) component. Therfore, the task of supervised learning it to learn how to match a given member of X to the correct value from y. Note that both X and y can be one dimension or multi-dimensional depending on the data set and application. Also lengths of X and y may be finite or infinite, also depending on the application.
Broadly speaking, supervised learning tasks can be split into two categories: classification or regression. The difference between the two is how many unique entries are in the output data set. In classification, y contains only a finite number of values, corresponding to the number of groups the data set can be classified into. For example, a famous data set in machine learning is called the iris data set, which provides data on different iris flowers and the specific species that flower belongs to, one of three types of irises. In this data set the only possible entries in y are 0 (species 1), 1 (species 2), or 2 (species 3). All other y values are invalid since they do not correspond to a known category. Classification is quite common when it comes to analyzing image data sets, as the goal typically is to determine what is being shown in the image (i.e. to classify the image).
Regression, on the other hand, corresponds to problems where the output could be an infinite number of different values. Most regression problems involve trying to reproduce the function that maps X to y. Put another way the goal of a regression problems is to find the function f such that f(X) = y for every value of X. A famous data set often used in regression problems is known as the Boston housing data set, which provides data on houses around the Boston, MA area an the price the house should sale for. Since house prices do not fall into a finite number of categories (a house could be any numeric price), this data set calls for regression instead of classification.
Some common supervised learning algorithms are k-nearest neighbors, linear regression, logistic regression, support vector machines (SVMs), decision trees, random forests, and neural networks.
Unsupervised Machine Learning#
In contrast to supervised machine learning, unsupervised machine learning (or unsupervised learning) is not given labelled data to learn patterns from. Rather, an unsupervised algorithm is only given X and must draw conclusions about the data without have y.
Like supervised learning, unsupervised learning can also be split into different categories based on the type of task. Common unsupervised learning tasks are clustering and dimensionality reduction. Clustering is the unsupervised learning equivalent of classification in supervised learning. However, instead of sorting data into one of a set number of classes, clustering algorithms determine how many classes (or clusters) are in the data set. If you were given the measurements of the iris flowers from the iris data set, but not the species labels, you could still pretty accurately classify the flowers by first using a clustering algorithm to determine that there are three clusters in the data, which would correspond to three different species of iris flowers.
Examples of unsupervised learning algorithms are principal component analysis (PCA), k-means, and heirarchical cluster analysis (HCA).
Semisupervised Machine Learning#
Semisupervised machine learning algorithms are a combination of supervised and unsupervised machine learning. The data that is given to semisupervised machine learning is primarly unlabelled, but does contain some labelled data. Note that this is not typically considered a class of machine learning in most sources, but it is mentioned in Hands-On Machine Learning.
Reinforcement Learning#
The final classification of machine learning is reinforcement learning. In reinforcement learning, an agent learns to perform a set of actions by observing the enviornment it is in and learning how to maximize its rewards. Reinforcement learning is commonly used in fields like robotics. Reinforcement learning will not be studied in-depth in this course. Rather, we will focus in supervised learning primarly with some unsupervised learning added in.
Offline and Online Learning#
If a machine learning alogorithm uses “offline learning”, it is given all of its training data at once, learns from it, and then discards the data. If changes to the training data are made or if new training data is collected, the algorithm will need to be entirely retrained, using the old training data and the new training data combined. “Online training”, on the other hand, can be training on an initial data set, and then given new training data points at any time to learn from, without having to start the training from scratch. Though most algorithms you encounter in this class (and in general) will use offline learning, there are some advantages to online learning. Mainly, if the training data set you are using is too large to be imported all at once, online learning allows the machine learning algorithm to be trained using smaller chunks of the data set. This is especially useful performing machine learning on images data sets.
Challenges of Machine Learning#
Despite recent advances in machine learning, it does still experience some challenges. These can be divided up into two categories: problems with the data set and computational limitations.
Problems with the Data Sets#
A machine learning algorithm cannot perform well if it is given bad data. This section will go over common problems that occur in data sets and how they can be fixed so that machine learning algorithms can perform better.
The first problem that can occur is that there is not enough training data for the machine learning algorithm to sufficiently learn from. Many machine learning algorithms need a large amount of training data to sufficiently learn, meaning hundreds or thousands of points. If a data set contains few data points there are two ways to increase the performance of a machine learning algorithm. First, if possible, obtain more data points to add to the data set to increase its size. If this is not possible, then the second option is to choose a machine learning algorithm that can perform well on smaller amounts of training data. For example, if only 50 data points are avaliable for the training data set, a neural network is unlikely to perform well since they require a large amount of training data to prevent overfitting (which is discussed below). A better choice of machine learning algorithm, for example, would be ridge regression, which can make accurate predictions with only a small amount of training data.
A second problem that can occur with the data sets is that they are of a poor quality. This means that the data set contains errors, has outliers, or is noisy. All of these makes it harder for the machine learning algorithm to pick out the patterns that occur in the data. These problems can be fixed with “pre-processing”, meaning that the data is altered before being fed into the machine learning algorithm. Errors can be spotted and removed (for example the inclusion of non-numerical data or missing values) using a library like Pandas in Python, outliers can be identified and removed with statistical analysis. Noise can sometimes be removed or lessened with statistcal analysis, but this is a harder problem to remove after the data has been collected. Rather, it is better to try to collect data with as little noise as possible.
Next, there can be problems with the features that are allowed to be included. Many data sets will contain several features, but all may not be needed or relevant to train the machine learning algorithm. For example, when predicting if a football player will be a good quarterback, the player’s height and weight are likely relevant but the player’s hair color is not. If a feature in the data set can be identified as irrelevant, it can be removed before the data is given to the machine learing algorithm. Several features can be combined into one, using dimensionality reduction (an unsupervised machine learning algorithm), before the data is given to a supervised machine learning algorithm. Reducing the number of features the algorithm has to consider can reduce its accuracy. The process of removing irrelevant features and combining other features is called feature engineering.
Finally, the last problem in this category more caused by the choice of machine learing algorithm than the data set. If a complex model is chosen be applied to data sets that are small or noisy, this can result in overfitting. Overfitting means that the model can very accurately reproduce the training data but does not perform will when giving new data. If a model can only perform well on its training data but not new data, it is a useless model. Overfitting can be fixed by choosing a simplier model, or by constraining the complex model using a process called regularization. Overfitting is a major problem in neural networks, and this will be explored in the second half of the course. The opposite of overfitting is underfitting, which occurs when the model selected to too simple to be applied to the data set. Underfitting can be fixed by choosing a more complex model or reducing the number of features (thus reducing the complexity of the data).
Computational Limitations#
The other drawback of machine learning are due to computational limitations. Many machine learning algorithms require a large amount of time to train, and often require large computers and/or GPUs in order to get the best performance. Though the examples of machine learning we will see in this class can easily run on your average laptop, not that this is not neccessarily true for most modern machine learning applications. Note that training a machine learning algorithm and running a trained algorithm have different requirements. This is why you can get ChatGTP and AI art generation programs to run on your computer, but they were trained on much larger systems.
Mathematical Crash Course with Python#
Linear Algebra#
Linear algebra is the field of mathematics that deals with matrices and vectors. Both of these quantities are the basis of many machine learning algorithms, and thus a basic understanding of linear algebra is neccessary to fully understand machine learning. This section of these lecture notes will discuss basic linear algebra operations and how they can be performed computationally. In Python, the easiest way to perform linear algebra calculations is with the common library, NumPy. The most common way to import the library is with the following line.
import numpy as np
Vectors#
A vector is a one dimensional string of numbers. Vectors can be of any length but generally vectors need to be of the same length for mathematical operations to be performed. We can represent a vector of length n generally as:
where \(a_i\) are numbers, either real or complex depending on the application. Note that we will start indexing the elements of a vector with 0 to align with the way Python indexes vectors (which will be discussed later in this section).
In Python vectors can most easily be represented with NumPy arrays. A one dimensional NumPy array can be defined using the following code. Generally (but not always) vectors are represented with lowercase letters, usually from the end of the alphabet.
x = np.array([1, 4, 7, 9])
Note that NumPy arrays are similar to the basic Python data type lists. Though in many cases lists and Numpy arrays can be interchanged, it is better, when thinking about machine learning, to stick with using NumPy arrays for machine learning applications.
Let’s define two arrays of length 3 to be used for the rest of this section. Typically, vectors will be defined with lowercase letters. In this case, we will call the vectors a and b. The below code defines two NumPy arrays (vectors) of length three called a and b and then prints both arrays.
# Define two arrays of length 3
a = np.array([1,2,3])
b = np.array([4,5,6])
print(a)
print(b)
[1 2 3]
[4 5 6]
A vector can be multiplied or divided by a scalar (a number) by simply multiplying or dividing every element in the vector by the scalar. In equation form we have for the multiplications of a scalar
and for the division of a scalar:
Computationally, we have:
c = 5
c_times_a = c*a
print(c_times_a)
a_divide_c = a / c
print(a_divide_c)
[ 5 10 15]
[0.2 0.4 0.6]
An important feature of a vector is that it has both a size and a direction. The magnitude of a vector (its size or its length) can be calculated using the following formula, which is similar to Pythagorean’s theorem. Note that the magnitude of a vector is a scalar.
The direction of a vector is given by its unit vector, which is a vector that points in the same direction as the original vector, but that has a magnitude of 1. The unit vector of vector a can be calculated using the following formula.
Computationally, the magnitude of a vector can be found using the function np.linalg.norm
and the unit vector can be found by dividing the original vector by the magnitude.
# Find magnitude and unit vector
a_magnitude = np.linalg.norm(a)
print(a_magnitude)
a_unit_vector = a / a_magnitude
print(a_unit_vector)
3.7416573867739413
[0.26726124 0.53452248 0.80178373]
Vectors can be added or subtracted (if they are the same length) by adding or subtracting the corresponding elements. The result of adding or subtracting vectors is a vector.
Computationally, we can add and subtract vectors uisng +
and -
:
# Vector addition and subtraction
a_plus_b = a + b
print(a_plus_b)
a_minus_b = a - b
print(a_minus_b)
[5 7 9]
[-3 -3 -3]
Finally, there are two ways to “multiply” vectors that are the same length: the dot (or scalar) product and the cross product.
A dot product (also known as a scalar product) is one way to “multiply” two vectors together. Finding a dot product is quite easy; you simply multiply the numbers if the same position together and then add everything together. The formula to find the dot product is:
The result of performing a dot product is always a scalar. A dot product is a measure of how closely two vectors point in the same direction.
The cross product is the second way to “multiply” two vectors. It is a bit harder to calcualte than the dot product. The formula for finding the cross product between two vectors of lenght three is:
Note that this equation only works for vectors of length 3. The results of performing a cross product is always a vector. The cross product produces a vector that is perpendicular to both of the vectors used to calculate it.
In Python, the dot product of two arrays can be calculated using np.dot
and the cross product of two arrays can be calculated using np.cross
.
# Dot and cross products
a_dot_b = np.dot(a,b)
print(a_dot_b)
a_cross_b = np.cross(a,b)
print(a_cross_b)
32
[-3 6 -3]
Specific elements of an array an be accessed using array indexing. Each element in an array corresponds to a unique index. The first element in the array has an index of 0, the second element has an index of 1, and so on. This means that the last elment in the array has an index of n-1, where n is the number of elements in the array. By adding the index in square brackets after the name of the array, the element corresponding to that index is returned.
print(a[0])
print(a[1])
print(a[2])
1
2
3
Matrices#
Matrices are two dimensional structures, and can therefore be represented with a two dimension NumPy array. Generally (but not always) a matrix is represented with a capital letter, generally from the beginning of the alphabet.
A = np.array([
[1,0,5],
[2,1,6],
[3,4,0]
])
print(A)
[[1 0 5]
[2 1 6]
[3 4 0]]
Let us also define another matrix, B, and a vector, x. The matrix needs to be the same size as A and the vector needs to have the same number of elements as A has rows.
B = np.array([
[9,8,7],
[6,5,4],
[3,2,1],
])
x = np.array([3,4,5])
Matrices can be multiplied or divided by a scalar using the same rules as for vectors (multiply or divide each element in the matrix by the scalar) and matrices of the same size can be added or subtracted also using the vector rules (add or subtract each of the corresponding elements).
Computationally, we can perform all of these operations using the following lines of code.
c = 5
A_times_c = c*A
print(A_times_c)
A_divide_by_c = A/c
print(A_divide_by_c)
A_plus_B = A+B
print(A_plus_B)
A_minus_B = A-B
print(A_minus_B)
[[ 5 0 25]
[10 5 30]
[15 20 0]]
[[0.2 0. 1. ]
[0.4 0.2 1.2]
[0.6 0.8 0. ]]
[[10 8 12]
[ 8 6 10]
[ 6 6 1]]
[[-8 -8 -2]
[-4 -4 2]
[ 0 2 -1]]
Another operator that can be performed on a matrix is called a tranpose, where the rows and the columns of the matrix are exchanged. We can calculate the transpose of a two dimensional array using .T
after the name of the array.
A_transpose = A.T
print(A)
print(A_transpose)
[[1 0 5]
[2 1 6]
[3 4 0]]
[[1 2 3]
[0 1 4]
[5 6 0]]
A vector can be multiplied by a matrix if the number of elements in the vector is the same as the number of rows in the matrix. The result will be a vector with the same length as the number of columns in A.
In Python, we can multiply a vector with a matrix using the @
symbol, not the *
symbol.
x_times_A = x@A
print(x_times_A)
[26 24 39]
Two matrices can also be multiplied if the number of rows in the first matrix is the same as the number of columns in the second matrix. Two matrices can be multiplied using the following formula:
Note that in general AB \(\neq\) BA. In Python, two two-dimensional NumPy arrays can be multiplied again using the @
.
A_times_B = A@B
B_times_A = B@A
print(A_times_B)
print(B_times_A)
[[24 18 12]
[42 33 24]
[51 44 37]]
[[46 36 93]
[28 21 60]
[10 6 27]]
There is another way to multiply matrices and vectors instead of the @
symbol using the function np.matmul
. Note you have to add the arguments in the order you want the multiplications to take place (i.e. AB \(\longrightarrow\) np.matmul(A,B)
but BA \(\longrightarrow\) np.matmul(B,A)
).
A_times_B = np.matmul(A,B)
print(A_times_B)
[[24 18 12]
[42 33 24]
[51 44 37]]
An important operation that can be performed on only square matrices is called the inverse. For a matrix, A, its inverse, \(A^{-1}\) is defined such that \(AA^{-1} = \textbf{I}\), where I is the identity matrix (a matrix where there are ones on the diagonal of the matrix and zeros everywhere else). In Python, the function np.linalg.inv
calculates the inverse of the matrix. Note that due to rounding the off-diagonal terms may not be exactly zero, but they should be very close.
A_inverse = np.linalg.inv(A)
print(A_inverse)
print(A@A_inverse)
[[-24. 20. -5.]
[ 18. -15. 4.]
[ 5. -4. 1.]]
[[ 1.00000000e+00 -2.66453526e-15 -2.22044605e-16]
[ 0.00000000e+00 1.00000000e+00 -4.44089210e-16]
[ 0.00000000e+00 -7.10542736e-15 1.00000000e+00]]
We can get an identity matrix of a given size using NumPy using the np.eye
function.
identity = np.eye(3)
print(identity)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Finally, we can calculate the eigenvectors and eigenvalues of a square matrix. The eigenvalues are known as the “characteristic values” of a matrix and eigenvectors are the vectors that when multiplied to the matrix repreduce the eigenvalue times the eigenvector. There are as many eigenvalue-eigenvector pairs for a matrix as there are rows. These eigenvalues and eigenvectors have different interpretations depending on the context. In Python, we can calculated the eigenvalues and eigenvectors with one function, np.linalg.eig
that returns both the eigenvalues and the eigenvectors. Note that the eigenvectors are given in a matrix where each column of the matrix is an eigenvector.
eigenvalues, eigenvectors = np.linalg.eig(A)
print(eigenvalues[0])
print(eigenvectors[:,0])
7.256022422687388
[-0.45291192 -0.68828659 -0.56668542]