Lecture Notes: Overview of Tools and Software
Tools for Data Science
In this course we will explore the field of data science using four tools: Excel, Python, R, and MySQL. In addition to these, we will also use Jupyter Notebooks to write our code because it easily integrates code, text, and graphing. Note that these are not the only tools used by data scientists. In fact, there are no correct tools for doing data science, just tools which make doing certain tasks easier.
For example, Python and R are both programming languages that we will learn in this course. Both of these languages allow you to create a graph of a data set in just a few lines of code. You can also create graphs in other programming languages such as Java, but it will take significantly more code and will not be as easy to implement. As another example, we will be learning Excel, which is a spreadsheets implementation, and MySQL, which is a database manager, because both spreadsheets and databases allow us to easily store, view, access, and manipulate data.
Excel
What is Excel?
Excel is an implementation of spreadsheets created by Microsoft Office. There are other popular spreadsheet implementations, such as Google Sheets, [LibreOffice] Calc](https://en.wikipedia.org/wiki/LibreOffice_Calc), and Apple Numbers, and much of what we learn about in Excel will transfer to these as well. A spreadsheet is an electronic file format where the data is arranged in a grid of rows and columns. Each grid square in a spreadsheet is called a cell. A spreadsheet will allow us to view a data set, but also manipulate the data and use it in calculations through a set of commands. There are many uses of spreadsheets in data science and in this class we will use Excel to explore a data set, create graphs of the data set, and calculate statistical reports for the data.
Installing Excel (Option 1)
The first option for using Microsoft Excel is to install it on your computer. Excel is a paid product, but luckily Mount Union pays for you to have access to all Microsoft software. If you have not already installed your Microsoft office apps, go to office.com and log in with your Mount Union email and password. On the first screen you see after logging in, click the “Install and more” menu in the top right corner and then click “Install Microsoft 365 apps”. This will download an installer which you can then launch to get everything installed. This will install Microsoft Excel on your computer along with Word, Powerpoint, OneNote, and other Microsoft products.
Office Online (Option 2)
If you do not want to install anything on your computer, you can also use Microsoft Excel on your web browser. Just note that this web browser version does not have every feature and some of the things we learn to do in this course will look slightly different. Go to office.com and log in with your Mount Union email address and password. On the left sidebar once you have logged in there will be an Excel icon. If you click on that you will be taken to the browser version of Excel. You can click “Blank workbook” to get a new Excel spreadsheet.
Python
What is Python?
One of the most popular programming languages in the world over the past few years is Python. Python programming is one of the top skills needed by modern data scientist (and programmers in many other fields). Python is a scripting language, which means that you can run your Python programs a little bit at a time. This is compared to a compiled language like Java, where you need to compile a completed program and run everything at once.
Python is an easy and introductory language. Once we start learning Python you will be able to start writing useful data science programs very quickly. This is partly do to the simple syntax Python has and partly do to a large collection of libraries we can use with Python to turn complex tasks into just a few lines of code. In data science, we will use a programming language to do many of the same things we can do with spreadsheets, such as exploring a data set, creating graphs of the data, and calculating statistics for the data. But, in addition, we will also learn how to write basic computer programs with loops and conditional statements which will allow us to do slightly more advanced things than we can do with just Excel.
Installing Python and Jupyter Notebooks (Option 1)
The first option you have for using Python is to install it on your computer. Luckily, most modern computers already come with Python pre-installed, so all you need to do is make sure it is installed and up-to-date. You can find the instruction for checking to see if you have Python installed here. If you determine that you do not have Python installed on your computer, you can try to install it using the instructions available on the Python website or come see me in office hours before we start covering Python in the course.
Since Python is just a programming language, we need something to write our Python code in. If you have previous experience in Python and want to do it the way you know, that is totally fine for this course. If you are new to Python, I recommend you also install Jupyter Notebooks as a software for writing and running your Python code. Jupyter Notebooks are cool because they allow you to write both code and text in the same file, so you can leave yourself notes. Your graphs will also appear in the notebook and not pop-up in a different window like with some other software you can use for writing and running Python code. I will typically give you Python code (both for in-class projects and homeworks) in Jupyter Notebooks (which have the file extension .ipynb). You can find instructions for installing Jupyter Notebooks here
Using Google Colab (Option 2)
Like Excel, if you do not want to install anything on your computer, there is a web-based version. Google Colab is a web-based version of Jupyter Notebooks which runs on Google’s cloud computers. You can access Google Colab by going to colab.research.google.com and clicking on “New notebook” on the pop-up window. You do need a a Google account to use Google Colab but if you do not have one you can create one for free.
There are some benefits of Google Colab over Jupyter Notebooks. There is no local installation that you need to install yourself and keep updated. Any Python libraries that you need come pre-installed on Google Colab (you need to install these yourself if you use Jupyter Notebooks). Also, all of your code is automatically saved and backed up to your Google Drive account so you are unlikely to lose any data. However, if you decide to use Google Colab, the instructions for the code will be different, but you can come to my office hours for help getting it set up!
R
What is R?
R is another programming language, which, like Python, is commonly used in data science. R allows you to do common data science tasks like graphing in fewer lines of code than Python, but R is generally less popular than Python (except in some fields such as business). In terms of which language is better to learn, you should know both to be competitive in the data science job market. The language you will use the most will depend on your preferences and what field you are in and the types of data you end up using.
Installing R
Installing R is as simple as downloading the correct installer from r-project.org. Note that some installations, like the one for Mac, are CPU specific so be careful on which one you install. This download, once you do through the download installer, will create a new app on your computer, called R with the R logo shown above. If you click on this it will launch a window where you can write R code. You can run your R code from here, or you can run a few commands in this window to allow you to write and run R code in Jupyter Notebooks. Just follow these instruction to get R running in Jupyter Notebooks. Note that if you have a Mac, you need to run these instructions from the terminal, not the app (see me in office hours!).
Unfortunately, at least at the time I am writing these notes, there is no good, free option to run R code through the web browser. However, if you are having trouble getting R installed on your computer or do not want to install extra apps, come talk to me during office hours.
MySQL
What is MySQL?
In this class (at least until we cover MySQL) you will be given data files formatted as comma separated values (csv). This means that you will be given a table of data, formatted just that the rows are separated by an “Enter” and the columns are separated by a comma. This type of file is very easy to import into Excel, Python, and R. However, a csv file is not necessarily the best choice in all scenarios. For example, if you have a very large data file trying to store and use it as a csv will use a lot of the memory on your computer and slow it down. Also, with a csv file you have to import the entire file, even if you only need part of it. Finally, the csv file has to be stored locally, so only one person has access to it, meaning it is not the best format for collaborative projects.
MySQL, which is a version of SQL (Structured Query Language) is a database software which overcomes many of these problems. With a MySQL database, you can access your data through Python, but without needed to open the entire file. MySQL easily allows you to work with large data sets and easily use just part of the data set as well. Typically, in real life applications, a MySQL database will be running on another computer or on a server (a very big computer used to distribute information to many people), but in the class you will host your own database on your computer.
Installing MySQL
Installing MySQL is easy, but it does require you to have Python and Jupyter Notebooks installed on your computer or to do this in Google Colab. Open Jupyter Notebooks or Google Colab and type the following code in and then run it:
!pip install pymysql
For Jupyter Notebooks you just need to run this once and it is installed on your computer for future uses. For Google Colab, you will need to run this line of code every time you want to use MySQL (since Google Colab runs on cloud computers you get a fresh Python installation every time you access it).
Computational Thinking
Computational thinking referes to the method of thinking about a problem and its solution in the context of programming. The way you think about solving a problem computationally is slightly different than you normally do because you need to develop a step-by-step solution that can be translated into programming. The steps that go into computational thinking are as follows:
- The first step is to break down the problem into small pieces. Each of these pieces should be able to be solved and the solution can then be used in the next piece. For example, a common thing you will be asked to do in this class is to create a graph from data given to you in a file. Therefore, the pieces of this problem are to import the data file into Excel/Python/R, format the data we want to graph, create the graph, and finally make the graph look nice.
- Next you need to formulate a solution to each of the small problems that can be expressed through programming. This is where you start thinking of the tools available in Excel/Python/R that you need to make sure are included to solve the problem and what pieces of code would be useful.
- Finally, you need to implement your solution in code and then test to make sure the solution works. This step also includes debugging which is the process of finding errors, or bugs, in your coding and fixing them.
Data Ethics
Ethics, Bias, and Privacy in Data Science
In this class you will rarely be gathering your own data. Rather, you will be using data either provided by me or that you find already collected on the internet. However, even though you are not collecting the data yourself, you still need to consider the ethics, bias, and privacy that went into the data set. How, where and why was the data collected? Who collected the data and what was its intended purpose? Any data set you use should not have problematic answers to these questions.
Let’s look at an example. In 2018 the data science firm Cambridge Analytica used data from Facebook users to do analysis on political preferences and then provided targeted political ads to the Facebook users. The problem with this is that the Facebook users did not agree to have their data collected for this purpose and did not even know that Facebook was collecting this data. Both Cambridge Analytica and Facebook got into a good deal of trouble over this issue and Cambridge Analytica ended up closing. The data collected by Facebook was collected without the user’s consent and was being used for, arguably, not good reasons.
Another example of how ethics, bias, and privacy comes into data science is in the context of introducing a bias into a data set. If you are surveying a large population, it can be easy to either leave out an entire group (or under-represent it) or to over-represent another group. This can lead you to making conclusions that are not entirely true about the population at large. For example, let’s say that Alliance hires a firm to determine what community projects are most needed by the citizens. The firm does a survey of people around Alliance, but restricts itself to only survey people around Mount Union and the hospital. While they may have surveyed a lot of people this way, there data set is going to be biased towards young adults and white collared workers, and likely does not represent Alliance at large. This may lead them to some incorrect conclusions about what the majority of Alliance residents would want. A biased data set, even if it was introduced unintentionally, will lead to bad predictions.