Explicit indexes
Indexes make subsetting simpler by using the loc
method, but index values do not need to be unique. Multi-level indexes (hierarchical indexes) can also be created by setting multiple columns as the index. However, using indexes can make our code more complicated and violate the concept of tidy data, where each variable has its column.
Setting and removing indexes
# Look at temperatures
print(temperatures)
# Set the index of temperatures to city
temperatures_ind = temperatures.set_index('city')
# Look at temperatures_ind
print(temperatures_ind)
# Reset the temperatures_ind index, keeping its contents
print(temperatures_ind.reset_index())
# Reset the temperatures_ind index, dropping its contents
print(temperatures_ind.reset_index(drop=True))
Subsetting with .loc[]
# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]
# Subset temperatures using square brackets
print(temperatures[temperatures['city'].isin(cities)])
# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])
Setting multi-level indexes
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(['country', 'city'])
# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [('Brazil', 'Rio De Janeiro'), ('Pakistan', 'Lahore')]
# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])
Sorting by index values
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())
# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level='city'))
# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=['country', 'city'], ascending=[True, False]))
Slicing and subsetting with .loc and .iloc
Slicing is a technique used to select elements from lists or DataFrames in pandas. We can slice rows and columns by specifying the first and last values to include in the slice. The loc
method is used to slice by index values and the iloc
method is used to slice by row/column number. We can also slice by partial dates. in pandas. It is important to sort the index before slicing in a DataFrame. When slicing inner index levels, it is necessary to pass the first and last positions as tuples.
Slicing index values
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()
# Subset rows from Pakistan to Russia
print(temperatures_srt.loc['Pakistan':'Russia'])
# Try to subset rows from Lahore to Moscow
print(temperatures_srt.loc['Lahore':'Moscow'])
# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[('Pakistan', 'Lahore'):('Russia', 'Moscow')])
Slicing in both directions
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq","Baghdad")])
# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:,"date":"avg_temp_c"])
# Subset in both directions at once
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq","Baghdad"), 'date':'avg_temp_c'])
Slicing time series
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures['date'] >= '2010-01-01') & (temperatures['date'] <= '2011-12-31')]
print(temperatures_bool)
# Set date as the index and sort the index
temperatures_ind = temperatures.set_index('date').sort_index()
# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc['2010-01-01':'2011-12-31'])
# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc['2010-08-01':'2011-02-28'])
Subsetting by row/column number
# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22, 1])
# Use slicing to get the first 5 rows
print(temperatures.iloc[:5])
# Use slicing to get columns 3 to 4
print(temperatures.iloc[:,2:4])
# Use slicing in both directions at once
print(temperatures.iloc[:5,2:4])
Working with pivot tables
In this section, we will perform subsetting and calculations on pivot tables.
Pivot temperature by city and year
# Add a year column to temperatures
temperatures['year'] = temperatures['date'].dt.year
# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table(values='avg_temp_c', index=['country', 'city'], columns='year')
# See the result
print(temp_by_country_city_vs_year)
Subsetting pivot tables
# Subset for Egypt to India
print(temp_by_country_city_vs_year.loc['Egypt':'India'])
# Subset for Egypt, Cairo to India, Delhi
print(temp_by_country_city_vs_year.loc[('Egypt', 'Cairo'):('India', 'Delhi')])
# Subset for Egypt, Cairo to India, Delhi, and 2005 to 2010
print(temp_by_country_city_vs_year.loc[('Egypt', 'Cairo'):('India', 'Delhi'), '2005':'2010'])
Calculating on a pivot table
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()
# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year==mean_temp_by_year.max()])
# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis='columns')
# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city==mean_temp_by_city.min()])