Introduction to Machine Learning#

DSC 340 Week 1 Slides#

Dr. Julie Butler

August 21 - August 24, 2023

Plan for the Week#

Monday (August 21)#

  • Syllabus Day

  • Lecture: Why should you learn what machine learning is and not just how to use it?

  • Office hours: 1pm to 3pm

Tuesday (August 22)#

  • Office hours: 4pm to 6pm (Bracy 107)

Wednesday (August 23)#

  • Lecture: Introduction to Machine Learning

Thursday (August 24)#

  • Week 2 Pre-class homework released

  • Office hours: 11am to 1pm

Friday (August 25)#

  • In-class Assignment #1: Mathematics of Machine Learning Crash Course

  • In-class Assignment #2: Introduction to Data Science Libraries

  • Week 1 Post-class assignment released

Syllabus Review (~15 minutes)#

  • All material is located at juliebutler.org/classes/dsc340

  • In small groups or alone:

    • Review the syllabus

    • Add any questions you have about the syllabus to the “General” channel on the Teams (or email them to me)

  • I will answer some of the questions in class and the rest of the questions after class

What do you know about machine learning? Why are you taking this class? (~5 minutes)#

Discuss these questions with your group and write (or type) some of your thoughts. We will have a class discussion after the small group discussions.

Why should you learn what machine learning is and not just how to use it?#

  • Knowing how each machine learning algorithm works will make you a better machine learning engineer

  • If you know how an algorithm works you will know how to best use it and how to improve it

  • You will know what type of machine learning task you are trying to do and which algorithms work best

  • You will learn which machine learning algorithms will work best for different data sets

The end goal of this class is to do a machine learning project from coming up with the idea to “publishing” and presenting the results.#

Final Project Ideas#

You can use ChatGTP to generate ideas for this project BUT you cannot use it to complete this project nor can you propose a project based on ChatGTP.

Science and Engineering#

  • Can you classify a type of animal or plant into different subspecies?

  • Can you predict the likelihood a person will develop a disease?

  • Can you use acceleration to predict the position of an object?

  • Can you solve Schrodinger’s equation with machine learning?

  • Can you predict properties of different elements/isotopes/molecules?

  • Can you classify organic molecules based on their chemical structure or their atoms?

  • Can you classify the type of bridge based on an image or building materials?

  • Can you predict the future temperatures of a certain city?

  • Can you use machine learning to solve Navier-Stokes equations?

Business and Finance#

  • Can you predict the future values of a certain stock?

  • Can you determine a businesses expected profit at the end of the year given beginning of the year statistics (or a previous years)?

Sports#

  • Can you predict if a college football player will be drafted by the NFL and in what round of the draft?

  • Can you predict a baseball player’s batting average?

  • Can you predict what team a soccer player belongs to?

Hobbies#

  • Can you predict the type or generation of a Pokemon based on certain stats?

  • Can you predict the color of a Magic the Gathering card based on its stats and keywords?

  • Can you predict the challenge rating of a monster in Dungeons and Dragons 5e?

What is Machine Learning?#

What is Machine Learning?#

  • There has been much recent interest in machine learning and artifical intelligence (ChatGTP, self-driving cars, etc.) but generally not a good explantion of what it is.

  • Machine learning is the field that occurs at artificial intelligence and data science; it is a collection of programs that learn from given data

When is Machine Learning Useful?#

  • Large datsets

  • Datasets with unknown patterns

  • Image and video analysis

  • Text processing (Natural Language Processing)

  • Predicting future values

The Machine Learning Workflow#

  1. Importing your data set and formatting it

  2. Splitting the data into a training set and a test set

  3. Training your machine learning model with the training set

  4. Evaluate the trained model’s performance with the test set

  5. (Optional) Make improvements to your model to increase its performance

Types of Machine Learning#

  • Machine learning algorithms are classified on what kind of data they take

  • Data sets have two components:

    • X: the inputs or the independent variables; features

    • y: the outputs or the dependent variables; labels

Supervised Learning#

  • Takes labelled data (i.e. both X and y) and learns the pattern between the features and the labels

  • Two types of supervised learning based on the task

    • Classification

      • Sorting inputs into a set number of categories

      • Ex: Given a picture, determine if the picture shows a cat or a dog (two categories)

    • Regression

      • Infinite number of possible outputs

      • Approximating a function f such that \(f(X) \approx y\)

      • Ex: Given some information on a house, what price should it sale for?

  • Examples: k-nearest neighbors, linear regression, support vector machines, neural networks

Unsupervised Learning#

  • Learn patterns from unlabelled data (i.e. its given only X)

  • Can also be roughly split into two categories depending on the task

    • Clustering

      • Unsupervised version of classification

      • How many categories can the data be split into?

      • Given information of a bunch of different iris flowers, how many species are present?

    • Dimensionality Reduction

      • Reduces the number of features by combining similar features

      • When trying to determine if a person is at risk for diabetes you are given 8 measures of a person’s health. Can the total number of features be reduced by combining similar ones?

      • Supervised algorithms may perform better on data sets with many features if dimensionality reduction is preform first

  • Ex: principal component analysis (PCA), k-means, and hierarchical cluster analysis (HCA)

Semisupervised Learning#

  • Most data is unlabelled but some is labelled

  • Not always considered a separate type of machine learning

  • Ex: Photo apps will identify and group faces together, you have to provide the names of each person

Reinforcement Learning#

  • Based around an agent that learns to perform a task by maximizing a reward

  • Common in fields such as robotics

Offline vs. Online Learning#

  • Offline Learning

    • All training data is given to the algorithm at once to train

    • If new data is added to the training set, need to entirely retrain with the old and the new data

  • Online Learning

    • Training data can be given in batches at any time, online algorithms can improve themselves at any time if given new data

  • Most algorithms in this class will use offline learning, but online learning does have advantages when it comes to memory

Challenges of Machine Learning#

Problems with Data Sets#

  • Small training sets

  • Poor quality data

  • Too many features or irrelevant features

  • Overfitting or underfitting

Computational Limitations#

  • Memory (RAM)

  • Lack of GPUs and computing clusters