Introduction to Machine Learning#
DSC 340 Week 1 Slides#
Dr. Julie Butler
August 21 - August 24, 2023
Plan for the Week#
Monday (August 21)#
Syllabus Day
Lecture: Why should you learn what machine learning is and not just how to use it?
Office hours: 1pm to 3pm
Tuesday (August 22)#
Office hours: 4pm to 6pm (Bracy 107)
Wednesday (August 23)#
Lecture: Introduction to Machine Learning
Thursday (August 24)#
Week 2 Pre-class homework released
Office hours: 11am to 1pm
Friday (August 25)#
In-class Assignment #1: Mathematics of Machine Learning Crash Course
In-class Assignment #2: Introduction to Data Science Libraries
Week 1 Post-class assignment released
Syllabus Review (~15 minutes)#
All material is located at juliebutler.org/classes/dsc340
In small groups or alone:
Review the syllabus
Add any questions you have about the syllabus to the “General” channel on the Teams (or email them to me)
I will answer some of the questions in class and the rest of the questions after class
What do you know about machine learning? Why are you taking this class? (~5 minutes)#
Discuss these questions with your group and write (or type) some of your thoughts. We will have a class discussion after the small group discussions.
Why should you learn what machine learning is and not just how to use it?#
Knowing how each machine learning algorithm works will make you a better machine learning engineer
If you know how an algorithm works you will know how to best use it and how to improve it
You will know what type of machine learning task you are trying to do and which algorithms work best
You will learn which machine learning algorithms will work best for different data sets
The end goal of this class is to do a machine learning project from coming up with the idea to “publishing” and presenting the results.#
Final Project Ideas#
You can use ChatGTP to generate ideas for this project BUT you cannot use it to complete this project nor can you propose a project based on ChatGTP.
Science and Engineering#
Can you classify a type of animal or plant into different subspecies?
Can you predict the likelihood a person will develop a disease?
Can you use acceleration to predict the position of an object?
Can you solve Schrodinger’s equation with machine learning?
Can you predict properties of different elements/isotopes/molecules?
Can you classify organic molecules based on their chemical structure or their atoms?
Can you classify the type of bridge based on an image or building materials?
Can you predict the future temperatures of a certain city?
Can you use machine learning to solve Navier-Stokes equations?
Business and Finance#
Can you predict the future values of a certain stock?
Can you determine a businesses expected profit at the end of the year given beginning of the year statistics (or a previous years)?
Sports#
Can you predict if a college football player will be drafted by the NFL and in what round of the draft?
Can you predict a baseball player’s batting average?
Can you predict what team a soccer player belongs to?
Hobbies#
Can you predict the type or generation of a Pokemon based on certain stats?
Can you predict the color of a Magic the Gathering card based on its stats and keywords?
Can you predict the challenge rating of a monster in Dungeons and Dragons 5e?
What is Machine Learning?#
What is Machine Learning?#
There has been much recent interest in machine learning and artifical intelligence (ChatGTP, self-driving cars, etc.) but generally not a good explantion of what it is.
Machine learning is the field that occurs at artificial intelligence and data science; it is a collection of programs that learn from given data
When is Machine Learning Useful?#
Large datsets
Datasets with unknown patterns
Image and video analysis
Text processing (Natural Language Processing)
Predicting future values
The Machine Learning Workflow#
Importing your data set and formatting it
Splitting the data into a training set and a test set
Training your machine learning model with the training set
Evaluate the trained model’s performance with the test set
(Optional) Make improvements to your model to increase its performance
Types of Machine Learning#
Machine learning algorithms are classified on what kind of data they take
Data sets have two components:
X: the inputs or the independent variables; features
y: the outputs or the dependent variables; labels
Supervised Learning#
Takes labelled data (i.e. both X and y) and learns the pattern between the features and the labels
Two types of supervised learning based on the task
Classification
Sorting inputs into a set number of categories
Ex: Given a picture, determine if the picture shows a cat or a dog (two categories)
Regression
Infinite number of possible outputs
Approximating a function f such that \(f(X) \approx y\)
Ex: Given some information on a house, what price should it sale for?
Examples: k-nearest neighbors, linear regression, support vector machines, neural networks
Unsupervised Learning#
Learn patterns from unlabelled data (i.e. its given only X)
Can also be roughly split into two categories depending on the task
Clustering
Unsupervised version of classification
How many categories can the data be split into?
Given information of a bunch of different iris flowers, how many species are present?
Dimensionality Reduction
Reduces the number of features by combining similar features
When trying to determine if a person is at risk for diabetes you are given 8 measures of a person’s health. Can the total number of features be reduced by combining similar ones?
Supervised algorithms may perform better on data sets with many features if dimensionality reduction is preform first
Ex: principal component analysis (PCA), k-means, and hierarchical cluster analysis (HCA)
Semisupervised Learning#
Most data is unlabelled but some is labelled
Not always considered a separate type of machine learning
Ex: Photo apps will identify and group faces together, you have to provide the names of each person
Reinforcement Learning#
Based around an agent that learns to perform a task by maximizing a reward
Common in fields such as robotics
Offline vs. Online Learning#
Offline Learning
All training data is given to the algorithm at once to train
If new data is added to the training set, need to entirely retrain with the old and the new data
Online Learning
Training data can be given in batches at any time, online algorithms can improve themselves at any time if given new data
Most algorithms in this class will use offline learning, but online learning does have advantages when it comes to memory
Challenges of Machine Learning#
Problems with Data Sets#
Small training sets
Poor quality data
Too many features or irrelevant features
Overfitting or underfitting
Computational Limitations#
Memory (RAM)
Lack of GPUs and computing clusters