Introduction to DataFrame
Pandas is a popular data science in the Python data science community. Pandas is built on top of NumPy which provides a multidimensional array of objects for easy data manipulation that pandas use to store data and Matplotlib to visualize our data.
Rectangular data, also known as tabular data, is the most common form for data analysis and is represented as a DataFrame object in pandas or a table in SQL.
Example Rectangular Data:
Username | Gender | Created At | |
johndoe | Male | johndoe@email.com | 2023-01-18 |
user3 | Female | user3@email.com | 2023-01-30 |
user1 | Male | user1@email.com | 2023-01-30 |
Inspecting a DataFrame
.head(n=5)
: Returns the first n rows (the “head” of the DataFrame) the default value ofn
is 5.info()
: Print a concise summary of a DataFrame which shows the information on each of the columns, such as the data type and the number of missing values..shape
: Returns the number of rows and columns of the DataFrame in Tuple..describe()
: Generate descriptive statistics. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excludingNaN
values.
# Print the head of the homelessness data
print(homelessness.head())
# Print information about homelessness
print(homelessness.info())
# Print the shape of homelessness
print(homelessness.shape)
# Print a description of homelessness
print(homelessness.describe())
Parts of a DataFrame
.values
: Return a Numpy representation of the DataFrame. this is deprecated and according to documentation we have been suggested to use.to_numpy()
method instead.columns
: Return the column labels of the DataFrame..index
: Return The index (row labels) of the DataFrame.
# Import pandas using the alias pd
import pandas as pd
# Print the values of homelessness
print(homelessness.values)
# Print the column index of homelessness
print(homelessness.columns)
# Print the row index of homelessness
print(homelessness.index)
Sorting and subsetting
The simplest ways to find interesting parts of the DataFrame
Sorting, by sorting DataFrame we can find the most interesting data is at the top of DataFrame.
Subsetting, by extracting a subset of data from a larger dataset we can select specific columns, and filter the dataset with the logical condition.
Sorting rows
We can use sort_values()
method to sorting rows in pandas DataFrame.
Sort on … | Syntax |
one column | df.sort_values("breed") |
multiple columns | df.sort_values(["breed", "weight_kg"]) |
# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values(['individuals'])
# Print the top few rows
print(homelessness_ind.head())
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values(['family_members'], ascending = False)
# Print the top few rows
print(homelessness_fam.head())
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False])
# Print the top few rows
print(homelessness_reg_fam.head())
Subsetting columns
We can use square brackets ([]
) to select specific columns that make sense to us.
# Select the individuals column
individuals = homelessness['individuals']
# Print the head of the result
print(individuals.head())
# Select the state and family_members columns
state_fam = homelessness[['state', 'family_members']]
# Print the head of the result
print(state_fam.head())
# Select only the individuals and state columns, in that order
ind_state = homelessness[['individuals', 'state']]
# Print the head of the result
print(ind_state.head())
Subsetting rows
This is sometimes known as filtering rows or selecting rows.
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]
# See the result
print(ind_gt_10k)
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region'] == 'Mountain']
# See the result
print(mountain_reg)
# Filter for rows where family_members is less than 1000
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')]
# See the result
print(fam_lt_1k_pac)
Subsetting rows by categorical variables
When we are filtering rows we often use "or" (|
) to select rows from multiple categories. This can get tedious when we want to filter more than two categories
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[homelessness['region'].isin(['South Atlantic', 'Mid-Atlantic'])]
# See the result
print(south_mid_atlantic)
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]
# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]
# See the result
print(mojave_homelessness)
New columns
Adding a column to a DataFrame in pandas involves creating a new column and assigning values to it. This process is commonly referred to as "mutating," "transforming," or "feature engineering."
Adding new columns
# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']
# Add p_individuals col as proportion of total that are individuals
homelessness['p_individuals'] = homelessness['individuals'] / homelessness['total']
# See the result
print(homelessness)
Combo-attack!
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness['individuals'] / homelessness['state_pop']
# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness['indiv_per_10k'] > 20]
# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values(['indiv_per_10k'], ascending=[False])
# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[['state', 'indiv_per_10k']]
# See the result
print(result)