Data Science

According to Wikipedia:

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

Bringing data to life with graphs and analysis is what makes Jupyter so special. In this lession you’ll get an introduction to Pandas a library for importing, processing and graphing data.

Get started with Pandas using the Pandas tutorials:

Fetch a Dataset

The command in the next cell isn’t Python. When a cell has a line that starts with a ! the command is a UNIX command. Run the cell and it will download a dataset that is the number of COVID casses and deaths from all US counties as reported by the CDC and collected by the New York Times.

[ ]:
!wget -O us-counties.csv

After executing the cell you should see a file called us-counties.csv in the file view. Open it.

The DataFrame

The heart of the Pandas library is the DataFrame. The DataFrame gives you access to selection, manipulation and visualization of a data set. First, let’s import the pandas library:

[ ]:
import pandas

The next cell loads the CSV file into a data frame.

[ ]:
data = pandas.read_csv('us-counties.csv', index_col=0)

Now let’s display the data:

[ ]:


The index operator lets you pick only the columns you want to see. You can select a single column:

[ ]:

Or you can pick multiple columns using double square brackets:

[ ]:
data[["county", "state"]]


What if you only want to see data from Santa Cruz? Filtering is the process of picking only the rows you want to see. You can filter the data using a somewhat obtuse but effective syntax:

[ ]:
data[data["county"] == 'Santa Cruz']

Oops! There’s a Santa Cruz County, Arizona! We want only California so we have to use the & and operator.

[ ]:
data[(data['county'] == 'Santa Cruz') & (data['state'] == 'California')]

The syntax is not very Pythonic. The parens ( and ) are REQUIRED when you use the & and | operators.


A picture is worth 1,000 words! Pandas makes it easy to plot a DataFrame.

[ ]:

Let’s look at just cases from Santa Cruz:

[ ]:
santa_cruz_cases = data[(data['county'] == 'Santa Cruz') & (data['state'] == 'California')]["cases"]

Numerical Analysis

Pandas has the power to apply functions to rows and colums. Our dataset only count cumulative cases. What if we want to know new cases? It’s easy to compute, the number of new cases in a day is the difference between the total cases that day and the total cases the day before (a.k.a. the derivative of total cases).

[ ]:
derivative = data['cases'].groupby('date').sum().diff()

The derivative data is pretty noisy. Let’s filter it through a running average:

[ ]:
average = derivative.rolling(7).mean().plot(figsize=(10,6))

Data Analysis

Pandas is a powerful tool for analysis. Unfortunately, its syntax can be pretty confusing. This section is a demonstration of the power of Pandas. You’re not expected to know this syntax for class but I wanted to show off some of what’s possible. The next cell fetches two California data sets:

  1. covidvaccines.csv - A list of vaccinations by county (with some extra junk)

  2. income_data.csv - A list of income and other financial metrics by county.

The goal of this analysis is to see if vaccination rates correlate with income. To know if that’s true or not we’ll make a scatter plot. For each county we’ll calculate the percent vaccinated and plot that by the median income for the latest year we have data.

[ ]:
!wget -O covidvaccines.csv
!wget -O income_data.csv

Process the Vaccine Data

The vaccine dataset is indexed by date. We only want the latest date that there is data for.

[ ]:
vaccines = pandas.read_csv('covidvaccines.csv')
max_date = vaccines['administered_date'].max()
print("Latest date:", max_date)

Let’s reduce the table down to only the rows for the maximum date, and only the columns we care about.

[ ]:
vaccines_by_county = vaccines[vaccines['administered_date'] == max_date][
        ["county", "cumulative_fully_vaccinated"]

# Show the data

Process the Income Data

We want two things from the income data, the median income and the total population. We need the total population to calucate the percentage of the vaccinated. Again, this table is indexed by year, we only want to see the latest year’s data.

[ ]:
income = pandas.read_csv('income_data.csv')
max_year = income['Taxable Year'].max()
print('The latest tax year is:', max_year)

Now we make the table we want:

[ ]:
income_by_county = income[(income['Taxable Year'] == max_year)][
        ['County', 'Median Income', 'Population']
# Show the data

Merge Datasets

The data we need has to be combined (or joined) based on the common column: The name of the county. If one table has a county that’s not in the other table it gets ignored. That’s good because the vaccination data has bogus counties, like “Unknown”.

[ ]:
combined = pandas.merge(vaccines_by_county, income_by_county,
                left_on='county', right_on='County').set_index('county')

We also want to compute the percentage of people vaccinated in each county. That’s a row-wise operation. Pandas makes that as easy as simple math (no for loop required).

[ ]:
combined['vaccinated_percent'] = 100 * combined['cumulative_fully_vaccinated'] / combined['Population']

Finally, let’s take just the columns we want in our combined data set.

[ ]:
combined = combined[['Median Income', 'vaccinated_percent']]

Plot the Data

The code in the next cell produces a pretty plot. It also uses Numpy to fit a line to our scatter plot.

[ ]:
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots()
combined.plot(x=0, y=1, kind='scatter', figsize=(16,10), ax=ax, color='blue')
plt.xlabel("Annual Median Income ($) for Households")
plt.ylabel("Percentage of Population Fully Vaccinated")

#Labels: Use annotate and specify the (labels, (x_coordinate,y_coordinate))
for k, v in combined.iterrows():
    ax.annotate(k,(v["Median Income"],v["vaccinated_percent"]))

# Add a trend line (a polynomial, degreee 1) based on dataframe. x-axis is the first column. y-axis is the second column
z = np.polyfit(combined.iloc[:,0], combined.iloc[:,1], 1)
p = np.poly1d(z)