Data Science#

According to Wikipedia:

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

Bringing data to life with graphs and analysis is what makes Jupyter so special. In this lession you’ll get an introduction to Pandas a library for importing, processing and graphing data.

Get started with Pandas using the Pandas tutorials:

https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html

Tables#

Data organized into tables (or tabular data) is a convinent and powerful way to represent information about a group of related items. Tables consist of rows that each represent one entity and columns that are attributes of the entity.

For example, let’s consider the MLB Player dataset from a previous lesson:

[ ]:
import pandas
df = pandas.read_csv('files/mlb_players.csv')
df

In the example above each row represents one player and the columns are the pieces of information about that player. Columns have a data type, just like variables. You can have as many columns and rows as you like, within the limit of the computer’s memory.

The DataFrame#

The heart of the Pandas library is the DataFrame. The DataFrame represents a table and gives you access to the algorithms you learned in the last lesson without having to write the for loops yourself. The algorithms in Pandas are highly optimized and written in the C programming language so they run faster than anything you could implement yourself in Python.

Data frame

But the algorithms you learned don’t work on tables, they work on lists. Getting a list, called a Series, from a table is simple: Just select the column you want to turn into a list.

Series

For example, let’s get the height of every player:

height = df['Height']
height

Try the example:

[ ]:

In order to stay fast, Pandas only supports certain algorithms, like taking the sum of a column. You can write your own with a for loop but that will be much slower.

Mapping#

In Pandas the mapping operations work on a Series and create a new Series. Most of the time you want to add the new series to the original DataFrame so that you have a derived column. Derived columns can make your data easier to work with.

New column

The DataFrame and Series support all of Python’s basic operators. That makes it easy to do mapping of one or more columns. For example to translate the weight of each player from pounds to kilograms:

weight = df['Weight']
weight / 2.205

Or a simply:

df['Weight'] / 2.205

Try it:

[ ]:

If you want to create a new column for the mapped series, it’s easy:

df['Weight (kg)'] = df['Weight'] / 2.205
df['Height (m)'] = df['Height'] / 39.37
df

Now the DataFrame has additional columns for metric height and weight.

[ ]:

Mapping functions can use more than one column in the calculation! For example, you can calculate the Body Mass Index (BMI) of each of the players. The BMI is defined to be:

\[BMI = kg / m^2\]

To calculate the BMI:

df['BMI'] = df['Weight (kg)'] / (df['Height (m)'] ** 2)
df

Try adding the BMI column:

[ ]:

Not everything works like you would expect when you perform a mapping function. For example, you might try to change the height to feet and inches with an f-string:

f'''{height // 12}'{height % 12}"'''  # Totally broken!

If you want to do something that’s not supported by the built-in mapping operators you can write a function to do it. For example:

def feet_inches(inches):
    """Return a string in feet and inches from inches."""
    return f'''{inches // 12}'{inches % 12}"'''

height.apply(feet_inches)
[ ]:

Filtering#

Filtering reduces the rows in a DataFrame or Series to just the ones of interest.

Filtering rows

Filtering works on either a Series or a DataFrame. When filtering a Series the algorithm works just like the filters you implemented. For example if you wanted to just show player heights over 80 inches:

height[height > 80]

Try the example:

[ ]:

Look strange? The Pandas library takes advantage of some of the most advanced features of Python. That sometimes makes it hard to read. In the example above the square brackets are the index operator. Inside of them there’s a filtering expression that creates a new series of True or False. See for yourself:

height > 80

Try it:

[ ]:

Filtering on a series is fine but it looses the connection between the height and the other data in the row. Filtering can also be applied to the whole DataFrame. When you filter this way the output is rows of the DataFrame that match the condition. For example:

df[ df['Height'] > 80 ]

Try that:

[ ]:

Awesome!

Reduction#

Rediction generates a single value from a Series.

Aggregation

Pandas supports many reduction operations. Here are examples:

Reduction function

Example

Description

sum()

height.sum()

Return the sum of all values.

median()

height.median()

Return the median of values.

min()

height.min()

Return the minimum value.

max()

height.min()

Return the maximum value.

len()

len(height)

Return the number of values in the series.

Try the reduction functions in the next cell:

[ ]:

If you’re taking or are going to take statistics here’s a great function for you. The describe function computes summary statistics on all numerical columns:

df.describe()

Try it:

[ ]:

Plotting#

A picture is worth 1,000 words! Pandas makes it easy to plot the data in a Series or multiple series in a DataFrame. There are also many kinds of plots avaialable (too many to cover here).

Plotting

The data we have doesn’t really have an X-axis so we’ll start with a density plot. A histogram shows us how many people fall into “bins” defined by a range of a certain attribute. Histograms of measures like BMI usually result in a bell curve.

To make a histogram:

df['BMI'].plot.hist()

Is the plot a bell curve?

[ ]:

So you can see more with plotting, let’s get a new data set. Pandas will load data directly from a URL. This example loads COVID-19 data from the CDC that’s been lightly processed by the New York Times:

covid = pandas.read_csv(
    "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv",
)
covid

The download will take a little while. As of this writing there are 2 million rows!

[ ]:

Pandas can’t guess how our data is represented. In this case the date column is the X-axis or index. We need to tell the DataFrame that so our plots come out right:

covid = covid.set_index('date')
covid
[ ]:

The data has a row for every county and day since the first reported COVID-19 case on January 21st 2020. Let’s select just the data from Santa Cruz County:

covid_sc = covid[covid['fips'] == 6087.0]
covid_sc

The covid_sc DataFrame now only contains Santa Cruz cases.

[ ]:

Let’s plot the cumulative cases over time:

covid_sc['cases'].plot()

Try that:

[ ]:

Pandas is huge and there’s a enough to learn to fill a whole 16-week class. So I’ll leave you with one more example. What if we wanted to see the new cases every day. That’s the derivative of the cumulative cases and the derivative is a mapping function that’s built into Pandas. Here’s how to plot the new cases:

covid_sc['cases'].diff().plot()
[ ]: