Lecture Notes: Introduction to Data Science

Definitions

What is Data Science?

Before we define “data science,” we must first define “data” and “science.”

Data can be defined as something given or admitted, especially as a basis for reasoning or interference. Typically, but not always, the term data is used to denote something raw or meaningless, but when it has been converted to something that has meaning and purpose, it becomes information. This is a very vague definition. It makes more sense if we think of data physically: data can be numbers, images, videos, sounds, or words that we can use to study, model, and predict some part of the world around us. A group of photos of different dogs is an example of data that may help us determine the differences between dog breeds. A list of the birthdays of every student at Mount Union is also a set of data. It could help us determine the average age of a Mount Union student or the most common birthday.

Science is defined as the systematic study of the structure and behavior of the physical and natural world through observation and experiment. Therefore, we could think of [data science[(https://en.wikipedia.org/wiki/Data_science) as the study (science) of converting raw data to something with meaning. In A Hands-On Introduction to Data Science, Chirag Shah defines data science as a field of study and practice involving collecting, storing, and processing data to derive essential insights into a problem or phenomenon. Frank Lo, the director of data science for the online retailer Wayfair, defines data science as a “multidisciplinary blend of data interference, algorithm development, and technology in order to solve analytically complex problems.”

I like to think of data science as the field that occurs at the intersection of statistics, computer science, and graphic design. In data science, we will take raw data and attempt to give it meaning through statistical analysis and try to create models of the data using programming tools (like machine learning). Data science requires practitioners to have good computer science skills to program these models. Additionally, we need to communicate the results of our statistical analysis and models using graphs, which are created so that they quickly convey our conclusions. This part of data science requires a basic knowledge of graphic design to create valuable and visually pleasing graphs.

Several terms are related (but not necessarily equivalent) to data science. Data analytics is a field that deals with drawing insights from raw data, typically through a statistical analysis. While the field of data science does include data analytics, data science goes beyond just simply analyzing raw data; it also includes creating predictive models of the data and visualizations of the data. Big data is the process of collecting large amounts of data. Once collected, the huge data sets can then be analyzed with data science tools. Data engineering is the process of building systems to collect and analyze data, making data engineering a subset of data science, just like data analytics. Finally, machine learning is a set of algorithms that use artificial intelligence to build predictive data models. Phrased another way, machine learning uses artificial intelligence to do data science!

Data science can sometimes be confused with statistics, but they are two different fields. Statistics was primarily developed to handle data sets and related problems without a computer. On the other hand, data science includes applying statistics to a data set almost exclusively on a computer using programming. Additionally, data science consists of accessing information from large databases, writing computer code to manipulate and analyze the data, and meaningfully visualizing the datasets (not activities usually included in statistics).

Some resources consider data science a subset of computer science, but this needs to be revised. Data science and computer science overlap, and the two fields support each other. They both involve programming and developing computational models for real-world systems, but the types of problems each field solves tend to differ. However, algorithms originally developed by computer scientists (like neural networks, a form of machine learning) have greatly enhanced the field of data science.

Where is Data?

Data is everywhere! It can be created by humans (such as gathering the ages of everyone in a room) or by a machine (such as a computer that monitors the temperature of a room). Data can also occur in many formats, such as numbers, images, words, videos, and sounds.

In the modern day, about 328,770 petabytes (PB) of data are generated each day by humans and machines across various formats. Note that 1 PB = 1,000 TB (terabytes) and 1 TB = 1,000 GB (gigabytes). An average laptop has a 500GB hard drive, so the data generated in just one day in the modern era is enough to fill 658 *MILLION** laptops! By the end of next year (2025), worldwide data is expected to reach 181 zetabytes (ZB). 1 ZB equals 1 billion terabytes, so the total data available by the end of 2025 could fill more than 362 BILLION laptops!!

When discussing the rate at which data is being generated and collected, it is helpful to consider the 3V’s of data. The first v stands for velocity, the speed at which data is accumulated (how fast the data is being generated and collected). The second is the data’s volume, size (amount), and scope (the variety of fields and topics). The final v stands for variety, representing the large array of data and types in which data can occur. Data can be numbers, images, videos, sounds, and more. It can be structured (organized in some easily useable way) or unstructured. All 3V’s of data are increasing drastically in the modern day, making data science a significant, rapidly evolving field.

Data Science as a Career

A person with a career in data science is called a data scientist. Data science is one of the fastest-growing fields, and there are many jobs expected to be available in the coming years. Additionally, data scientists across many fields and industries are highly paid for their skills, with the median salary of a data scientist being $103,500. Finally, on the US News and World Report’s List of the 25 best jobs in America for 2024, data science ranks 8 out of 25, with high salaries, rapid growth, and low unemployment (~1%!).

![Salary for Data Scienctist][img/salary.png]

Skills Needed by Data Science

If data science sounds like an attractive career, then there are several essential skill sets that you need to develop. A data scientist needs to have three things: a willingness to experiment, proficiency in mathematical reasoning, and data literacy. A data scientist must have the drive, intuition, and curiosity to solve problems as they are presented and identify and articulate issues independently. This means that when a data scientist is given a data set and a problem to work on, they need to be willing (without direct supervision and prompting) to not only answer the question at hand but also to go further and see what other discoveries can be drawn from the data. Sometimes, this will mean trying something that does not work, but sometimes, you will make an exciting and unexpected discovery! A data scientist must also have proficient mathematical reasoning in statistics and calculus. A data scientist will spend a lot of time creating models of data, which includes creating statistical models and sometimes using calculus to model the rate at which variables change. Additionally, to use machine learning in predictive models, a data scientist must have a good grasp of statistics and calculus (and a little bit of linear algebra). Finally, data scientists need to have data literacy, meaning they have the skills required (formatting raw data, building predictive models, and visualizing data) to extract meaningful information from a data set.

Many job advertisements for data science positions will want the following skills. A modern data scientist must have a strong knowledge of statistics and machine learning, as many modern data science applications involve at least some machine learning. Additionally, a data scientist must have good computer science skills, as programming is used to analyze raw data, create predictive models, and visualize the data sets. Finally, a data scientist needs to be able to create visualizations of the data that are visually appealing and convey the meaning of the data set without an excessive explanation (a picture is worth a thousand words, after all).

![Data Science Skills][img/skills.png]

Finally, we can discuss the specific software skills that a data scientist needs. The above figure shows the software skills most required by data scientists. In this course, we will cover the programming languages Python and R, the database management software MySQL (similar to the SQL on the chart), Jupyter Notebooks, and the Unix Shell for writing code (this will make more sense in a few weeks). While the other software is not covered in this course, there are courses available at Mount that cover every software listed here!

References and Further Reading

  1. Hands-On Introduction to Data Science. Chirag Shah. Chapter 1.
  2. The Most In-Demand Jobs for 2024
  3. EDA on DATA SCIENCE SALARY
  4. The Best Jobs in America in 2024
  5. How To Get Your Data Scientist Career Started