User Tools

Site Tools


python:datawrangling

Data wrangling in Python

https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/

Cours donné par Stijn Van Hoey et Joris Vanden Bossche les 25/11 et 02/12/2024 à l'Université de Gand :

https://www.flames-statistics.com/iu-courses/250/

Day 0 : préparation

Après avoir installé Anaconda, vérifier qu'il est à jour : ouvrir une console :

conda update -n base conda

il faut activer l'environnement pour le cours : dans Anaconda, après l'avoir installé

conda activate FLAMES-python

Télécharger le fichier pour préciser l'environnement :
<code>wget https://raw.githubusercontent.com/jorisvandenbossche/FLAMES-python-data-wrangling/main/environment.yml

qui contient :

name: FLAMES-python
channels:
- conda-forge
dependencies:
- python=3.11
- ipython
- jupyter
- jupyterlab>=3
- numpy
- pandas>=1
- matplotlib>3
- mplleaflet
- ipympl
- seaborn
- plotnine
- pyproj
- requests
- openpyxl

Ensuite, dans Anaconda : base (root) > Terminal :

conda install -n base conda-libmamba-solver
conda config --set solver libmamba
conda config --add channels conda-forge
conda config --set channel_priority strict
cd FOLDER_PATH_TO_ENVIRONMENT_FILE
conda env create -f environment.yml

où FOLDER_PATH_TO_ENVIRONMENT_FILE = là on on a téléchargé environment.yml précédemment. Ensuite, taper :

conda activate FLAMES-python
ipython

pour ouvrir un terminal python où taper :

import pandas
import matplotlib

Day 1

git clone https://github.com/jorisvandenbossche/FLAMES-python-data-wrangling.git
cd FLAMES-python-data-wrangling
conda activate FLAMES-python
python check_environment.py
conda activate FLAMES-python
jupyter lab

Ne pas fermer cette fenêtre. Une fenêtre du browser s'ouvre, sélectionner dans l'explorateur 'Notebooks', c'est là que le cours se trouve.

  • pour créer une nouvelle fenêtre dans laquelle écrire du code : cliquer sur une fenêtre existente, appuyer sur la touche “Escape” et ensuite sur la touche “b”
  • pour faire tourner du code, cliquer sur la fenêtre (pour y avoir le curseur), ensuite shift + touche “Return”

Pour avoir de l'aide sur un objet

help(object)

Pour avoir de l'aide sur les paramètres d'une fonction, faire shift + “Tab” après avoir ouvert la parenthèse.

Pandas

dataframe : object with rows and columns

Each column in a dataframe is a series.

data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries

Liste des variables :

countries.columns

2 premiers/derniers rows:

countries.head(2)
countries.bottom(2)

tri :

countries.sort_values("population")

graphique (2e et 3e lignes = pareil) :

countries.plot()
countries["population"].plot(kind="box")
countries["population"].plot.box()

import data :

pd.read_csv

export data : method of dataframe object

countries.to_excel

Exercice 1

import pandas as pd
# importer les données dans un dataframe
df=pd.read_csv("data/titanic.csv")
# 5 premières lignes
df.head(5)
# nombre de records
len(df)
df.shape[0]
# select Age column
df['Age']
# make a boxplot of Age
df.plot.bot('Age')
# sort by age desc
df.sort_values("Age", ascending=False)
df.sort_values(by="Age", ascending=False)

Exercice 2

#Import pandas as pd and read the Titanic dataset
df=pd.read_csv("data/titanic.csv")
# what is the average age of the passengers?
df['Age'].mean()
# Plot the age distribution of the titanic passengers
df['Age'].plot.hist()
# What is the survival rate? (the relative number of people that survived)
df['Survived'].sum() / len(df['Survived'])
df['Survived'].mean()
# What is the maximum Fare? And the median?
df['Fare'].max()
df['Fare'].median()
# Calculate the 75th percentile (quantile) of the Fare price 
help(df.quantile)
df['Fare'].quantile(0.75)
df.quantile(0.75, numeric_only=True)
# Calculate the scaled Fares (scaled relative to its mean), and add this as a new column ('Fare_scaled') to the DataFrame
df['Fare_scaled'] = df['Fare'] / df['Fare'].mean()
# Calculate the log of the Fares
import numpy as np
fare_log = np.log(df['Fare'])
print(fare_log)

“Magics” : pas dans Python, mais dans l'environnement des notebooks (commencent par '%')

#liste des variables définies
%whos
a = np.arange(1000)
%timeit a**2

Subsetting data

# select columns:
countries[['area', 'population']] # double [[]]

Double crochets parce qu'on a une liste dans [] #select rows: mask = countries['population'] > 50 mask countries[mask] countries[0:100:10] # 1er, 11e, 21e…</code>

Conditions (return True or False = boolean)

  • ==, !=, <, >=
  • & (and), | (or)

.isin() operator = between 2 boundaries

→ boolean mask

to select data:

data_filtered = data[boolean_mask]
countries['country'] == "Belgium"
countries['country'].isin(["Belgium", "France"])
capital = countries['capital']
capital == "Brussels"
s = "Brussels"
s.startwith("A")
s.startwith("B")
capital.str.startwith("B")

Excercice 3

# calculate mean age of males
df_males = df[df['Sex' == 'male']]
df_males['Age'].mean()
# Number of passenger aged > 70
(df['Age']>70).sum()
# select the passenger between 30 and 40 years old
aged30_40 = df[(df['Age'] >=30) & (df['Age'] <= 40)]
# For a single string name = 'Braund, Mr. Owen Harris', split this string (check the split() method of a string) and get the first element of the resulting list
name = 'Braund, Mr. Owen Harris'
name.split(",")[0]
# Convert the solution of the previous exercise to all strings of the Name column at once. Split the 'Name' column on the ,, extract the first part (the surname), and add this as new column 'Surname'
df['Surname'] = df['Name'].str.split(",").str.get(0)
# Select all passenger that have a surname starting with 'Williams'
df[df['Surname'].str.startswith('Williams')]
# Select all rows for the passengers with a surname of more than 15 characters
df[df['Surname'].str.len() > 15]

Dates and times in pandas

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
plt.style.use('ggplot')
 
s = pd.Series(['2016-12-09 10:00:00', '2016-12-09 11:00:00', '2016-12-09 12:00:00'])
# string > date
ts = pd.to_datetime(s)
# get years
ts.dt.year
ts.dt.weekday
ts.dt.hour
 
#single row : no need to call "dt"
ts[0].dayofweek
 
#pour eviter la confusion :
pd.to_datetime("2021-12-01", dayfirst=True)
pd.to_datetime("09/12/2016", format="%d/%m/%Y")

A detailed overview of how to specify the format string, see the table in the python documentation: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

# add an amount of time to each element of the series
ts + pd.Timedelta('6 days')
#create time series
pd.date_range(start="2024-01-01", periods=10, freq="D")
pd.date_range(start="2024-01-01", end=2024-06-01, freq="MS") # start of the month

Set time as index in a dataframe:

data["Time"] = pd.to_datetime(data['Time'])
data = data.set_index("Time")
data["L06_347"] # show time as index
data.index
data.index.dayofyear

Select data (a slice)

data[pd.to_datetime("2012-01-05"):pd.to_datetime("2012-01-10")]
data["2012-01-05":"2012-05-10"] # the same
data["2012-01":"2012-02"] # from start of Jan to the end of Feb
data["2012-01":"2012-01"] # whole Jan month
data["2012":"2012"] # whole year 2012
data["2012":] # until the end
data[:"2012"] # from start until the end of 2012

Exercice 4

data = pd.read_csv("data/vmm_flowdata.csv", index_col=0, parse_dates=True)
# select dates from start of 2012 to the end
data["2012":]
# Select all data in January for all different years
data[data.index.month == 1]
# Select all data in April, May and June for all different years
data[(data.index.month == 4) | (data.index.month == 5) | (data.index.month == 6)]
data[data.index.month.isin([4, 5, 6])]
# Select all 'daytime' data (between 8h and 20h) for all days
data[(data.index.hour>8) & (data.index.hour<20)]

Resample

Taking the mean of each month/year(end) (all data for a day are put together):

data.resample('D').mean()
data.resample('YE').mean()
data.resample('YE').max()

liste des intervalles : http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

Exercice 4 (suite)

#...
# Plot the monthly mean and median values for the years 2011-2012 for 'L06_347'. Use the last day of each month to represent the month.
data[data.index.year.isin([2011, 2012])]['L06_347'].resample("ME").agg(["mean", "median"]).plot()
subset = data['2011':'2012']['L06_347']
subset.resample('ME').agg(['mean', 'median']).plot()
# Plot the monthly mininum and maximum daily average value of the 'LS06_348' column. Use the first day of each month to represent the month.
daily_avg = data["LS06_348"].resample("D").mean()
daily_avg.resample("ME").agg(["min", "max"])
# Make a bar plot of the mean of the stations in year of 2013
data["2013":"2013"].mean().plot.barh()

Data processing

Exercice 5 : bike 1

# read the file
df = pd.read_csv("data/fietstelpaal-coupure-links-2022-gent.zip", sep=';')
#Convert the 'Ordening' column into a pandas Timestamp Series, and assign the data to a new column named "timestamp". Make sure to read the data as UTC
df["timestamp"] = pd.to_datetime (df["Ordening"], format="%Y-%m-%dT%H:%M:%S%z", utc=True)
# Set the resulting "timestamp" column as the index of the df DataFrame
df = df.set_index("timestamp")
# drop columns
df2022 = df.drop(columns=['Datum', 'Uur5Minuten', 'Ordening', 'Code'])
# rename columns
df2022 = df2022.rename(columns={"Tegenrichting":"direction_centre", "Hoofdrichting":"direction_mariakerke",
                          "Locatie": "location", "Totaal": "total"})
# method chaining
(df2022 = df
    .set_index("timestamp")
    .drop(columns=['Datum', 'Uur5Minuten', 'Ordening', 'Code'])
    .rename(columns={"Tegenrichting":"direction_centre", "Hoofdrichting":"direction_mariakerke", "Locatie": "location", "Totaal": "total"})

Group by

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
 
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
 
# calculate mean by key
df["data"][df["key"] == "A"].mean()
df["data"][df["key"] == "B"].mean()
df["data"][df["key"] == "C"].mean()
 
# faster:
df.groupby("key")["data"].mean()
df.groupby("key")["data"].agg("mean")

Exercice 6

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
 
df = pd.read_csv("data/titanic.csv")
 
# Using groupby(), calculate the average age for each sex
df.groupby("Sex")["Age"].mean()
# Calculate the average survival ratio for all passengers
df['Survived'].mean()
# Calculate this survival ratio for all passengers younger than 25
df[(df['Age'] < 25)]["Survived"].mean() 
# What is the difference in the survival ratio between the sexes?
df.groupby("Sex")["Survived"].mean()
# Make a bar plot of the survival ratio for the different classes ('Pclass' column)
df.groupby("Pclass")["Survived"].mean().plot.bar()
# Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is divided in separate classes using the pd.cut() function as provided below
df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))
df.groupby('AgeClass', observed=False)['Fare'].mean().plot.bar(rot=0)

groupby sur plusieurs variables :

df.groupby(["Pclass", "Sex"])["Survived"].mean()

Day 2

Seaborn

  • built on matlibplot
  • works wel with panda
import seaborn sas sns

Puis, sns. et tab → les possibilités

titanic[['Age', 'Fare']].plot()

OK pour un premier coup d'oeil

#For each class at the Titanic aneach gender, what was the average age
titanic.groupby(['Pclass', 'Sex'])[['Age'].mean()
type(titanic.groupby(['Pclass', 'Sex'])[['Age'].mean()) # type = multiindex
#add a new index
age_stat = titanic.groupby(['Pclass', 'Sex'])[['Age'].mean().reset_index() #.reset_index() = output dataframe instead of series
 
#how to vizualize it?
titanic.groupby(['Pclass', 'Sex'])[['Age'].mean().plot() #too limited
 
sns.catplot(data=age_stat, x="Sex", y="Age", col="Pclass", kind="bar")
#with colours for classes
sns.catplot(data=age_stat, x="Sex", y="Age", hue="Pclass", kind="bar")
#another way of seeing it
sns.catplot(data=age_stat, hue="Sex", y="Age", x="Pclass", kind="bar")

type of plots (3 functions : relplot, catplot & displot and several kinds kind=“pointplot”):

  • relation plot (relplot) : scatterplot, lineplot
  • distribution plot (distplot) : histpolot, kdeplot, ecdfplot, rugplot
  • categorical plot (catplot) : stripplot, swarmplot, boxplot, violinplot, pointplot, barplot
sns.relplot
sns.catplot
sns.displot(data=titanic, x='Age', kind="hist")
# col = 2 separate plots (hue = colour for survival)
age_fare = sns.relplot(data=titanic, x="Age", y="Fare",
                       hue="Survived",
                       col="Sex")
 
#see that more females survived: better this:
age_fare = sns.relplot(data=titanic, x="Age", y="Fare",
                       col="Survived",
                       hue="Sex")

relplot : col= , row= pour splitter le graphique en plusieurs, mais pas possible avec les 2 autres fonctions

Exercices

#Make a histogram of the age, split up in two subplots by the Sex of the passengers.
#Put both subplots underneath each other.
#Use the height and aspect arguments of the plot function to adjust the size of the figure.
sns.displot(data=titanic, x="Age", kind="hist", row="Sex", height=4, aspect=2)
#more bins:
sns.displot(data=titanic, x="Age", kind="hist", row="Sex", height=4, aspect=2, bins=60)
 
# Make a violin plot showing the Age distribution in each of the Pclass categories comparing for Sex
# Figure based
sns.catplot(data=titanic, x="Pclass", y="Age", 
            hue="Sex", split=True,
            palette="Set2", kind="violin")
sns.despine(left=True)
# Axes based
sns.violinplot(data=titanic, x="Pclass", y="Age", 
               hue="Sex", split=True,
               palette="Set2")
sns.despine(left=True)

Aussi jointplot et pairplot qui montre la distribution le long des axes (jointplot) ou des carrés de graphiques (pairplot)

#linear regression plot
sns.lmplot()
sns.regplot()

Exercices

casualties["gender"].unique()</code
all values in gender variable
 
Create a barplot with the number of victims ("n_victims") for each hour of the day for each category in the gender column. Before plotting, calculate the total number of victims for each hour of the day and each gender with Pandas and assign it to the variable victims_gender_hour_of_day.
 
Create a separate subplot for each gender category in a separate row.
 
Make sure to include the NaN values of the "gender" column as a separate subplot, called "unknown" without changing the casualties DataFrame data.
<code python>victims_gender_hour_of_day = casualties.groupby([casualties["datetime"].dt.hour, "gender"], 
                                                dropna=False)["n_victims"].sum().reset_index()
victims_gender_hour_of_day.head()
 
victims_gender_hour_of_day = casualties.groupby([casualties["datetime"].dt.hour, "gender"], dropna=False)["n_victims"].sum().reset_index()
victims_gender_hour_of_day.head()
sns.catplot(data=victims_gender_hour_of_day.fillna("unknown"), x='datetime', y='n_victims', row='gender', kind='bar', height=3, aspect=3)

Tidy data

In a tidy dataset :

  • each variable is a row
  • each observation is a line
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
# create dataset
data = pd.DataFrame({
   'WWTP': ['Destelbergen', 'Landegem', 'Dendermonde', 'Eeklo'],
   'Treatment A': [8.0, 7.5, 8.3, 6.5],
   'Treatment B': [6.3, 5.2, 6.2, 7.2]
})
data
 
# go from a wide to a long data format
pd.melt(data, id_vars=["WWTP"])
 
data_long = pd.melt(data, id_vars=["WWTP"], 
                    value_name="pH", var_name="Treatment")
data_long
 
data_long.groupby("Treatment")["pH"].mean()
data_long.groupby("WWTP")["pH"].mean()
 
sns.catplot(data=data_long, x="WWTP", y="pH", 
            hue="Treatment", kind="bar")
sns.catplot(data=data_long, x="Treatment", y="pH", 
            hue="WWTP", kind="bar")

Exercices

#Read the "verbruiksgegevens-per-maand.xlsx" file (in the "data/" directory) into a DataFrame df.
df = pd.read_excel("data/verbruiksgegevens-per-maand.xlsx")
#Drop the "Regio" column (this column has a constant value "Regio 1" and thus is not that interesting).
df = df.drop(columns=["Regio"])
df

The actual data (consumption numbers) is spread over multiple columns: one column per month. Make a tidy version of this dataset with a single “consumption” column, and an additional “time” column.

Make sure to keep the “Hoofdgemeente”, “Energie” and “SLP” columns in the data set. The “SLP” column contains additional categories about the type of elektricity or gas consumption (eg household vs non-household consumption).

Use pd.melt() to create a long or tidy version of the dataset, and call the result df_tidy.

df_tidy = pd.melt(df, id_vars=["Hoofdgemeente", "Energie", "SLP"], var_name="time", value_name="consumption")
df_tidy
#Convert the "time" column to a column with a datetime data type using pd.to_datetime.
df_tidy["time"] = pd.to_datetime(df_tidy["time"], format="%Y%m")
df_tidy.head()

Calculate the total consumption of elektricity and gas over all municipalities (“Hoofdgemeente”) for each month.

Assign the result to a dataframe called df_overall.

Using df_overall, make a line plot of the consumption of elektricity vs gas over time.

Create a separate subplot for elektricity and for gas, putting them next to each other.

Ensure that the y-limit starts at 0 for both subplots.

df_overall = df_tidy.groupby(["time", "Energie"])[["consumption"]].sum() # or with .reset_index()
df_overall.head()
 
facet = sns.relplot(x="time", y="consumption", col="Energie",
                    data=df_overall, kind="line")
facet.set(ylim=(0, None)) # (min, max)

Pivot table

  • .pivot : no recalculations, only rearrange
  • .pivot_table : recalculations = common way to summarize the data
df = pd.read_csv("data/titanic.csv")
df.pivot_table(index='Sex', columns='Pclass', values='Fare')
 
df_full.pivot_table(columns="Pclass", index="Sex", values="Fares")
df_full.pivot_table(columns="Pclass", index="Sex", values="Fares", aggfunc="mean") # mean = default
df_full.pivot_table(columns="Pclass", index="Sex", values="Fares", aggfunc="min") # or max, std, count, etc.
# function crosstab = pivot_table with aggfunc="count"

for columns and index, you can pass lists [] to have more than one variable

Exercices

# Make a pivot table with the survival rates for Pclass vs Sex.
df = pd.read_csv("data/titanic.csv")
df.pivot_table(index='Pclass', columns='Sex', 
               values='Survived', aggfunc='mean')
# extra : graph
fig, ax1 = plt.subplots()
(df.pivot_table(index='Pclass', columns='Sex', 
               values='Survived', aggfunc='mean')
   .plot.bar(rot=0, ax=ax1)
)
ax1.set_ylabel('Survival ratio')
 
#Make a table of the median Fare payed by aged/underaged vs Sex.
df['Underaged'] = df['Age'] <= 18 # boolean variable
(df.pivot_table(
    index='Underaged', columns='Sex', values='Fare', aggfunc='median')
    .rename(columns={'False' : 'underaged', 'True':'aged'}
)
 

A pivot table aggregates values for each combination of the new row index and column values. That reminds of the “groupby” operation.

Can you mimick the pivot table of the first exercise (a pivot table with the survival rates for Pclass vs Sex) using groupby()?

df_survival = df.groupby(["Pclass", "Sex"])["Survived"].mean().reset_index()
df_survival
df_survival.pivot(index="Pclass", columns="Sex", values="Survived")

Case study 3 : bacterial resistance

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Seaborn custom color palettes : doc "choosing color palettes"
https://colorbrewer2.org/
3 categories :
  * sequential : variable continue qui augmente
  * diverging : e.g. temperatures (valeurs centrales, puis d'autres valeures des 2 côtés, + et -)
  * qualitative : catégories

<code python>sns.catplot(..., palette="colorblind")
# select an pre-built set globally, not always working, better as parameter
sns.colorpalette("icefire", as_cmap=True")

Matplotlib: visualisation with Python

  • very adaptable (e.g. plot within a plot)
  • difficult to master
  • output for a paper/report
  • add a line at some specific threshold to a figure generated with Seaborn
import numpy as np
import panda as pd
import matlibplot.pyplot as plt
 
ax = plt.axes() #x- and y-axis drawn
plt.subplots()
#output = 2 objects:
fig, ax = plt.subplots()
 
#you can then work on these 2 objects
ax.set_title("title 1")
 
fig, ax = plt.subplots(nrows=1, ncols=2)
print(ax) # actually axes
 
ax[0].set_title=("LEFT")
ax[1].set_title=("RIGHT")
 
fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2)
 
ax0.set_title=("LEFT")
ax1.set_title=("RIGHT")
 
fig.savefig() #save the figure in file
 
plt.plot(x, y, "-")
plt.title("title 2")
#not object-oriented programme, try to avoid -> assign it to an object
fig, ax = plt.subplots()
ax.plot(x, y, "-")
ax.set_title("title2")
# gives the same result
 
# avoid 1e7 on the y-axis
ax.yaxis.set_major_formatter(ticker;FormatStrFormatter("%.1f"))

Reference graph with objects that can be customised :

  • major tick
  • minor tick
  • line
  • grid
  • title
  • legend
  • spine
  • x-label
  • etc.

different styles

plt.style.available
 
#define at the script level
plt.style.use("tableau-colorblind10")
 
#set up several plots
axd = plt.figure(constrained_layout=True).subplot_mosaic(
    """
    ABD
    CCD
    """
)
axd
axd["A"].set_title("A it is")

Graphs generated with pandas are actually generated by matplotlib → assign to an object that you can then customize with matplotlib:

my_output = flowdata.plot.iline()
my_output.set_title("My title")
my_output.set_ylabel('r'Discharge m3/s')

Dans pandas, on peut passer un axe créé avec matplotlib comme paramètre:

fig, (ax0, ax1) = plt.subplots(1, 2)
flowdata.plot()
histplot = sns.displot(data=tidy_experiment, x="optical_density", color="grey", edgecolor="white"))
histplot.axes.shape[0][0] # axes = array 2 x 2
histplot.axes[0, 0].avxline(0.4, color="blue") # vertical blue line at x=0.4
histplot.axes[0, 0].ahxline(600, color="red") # horizontal: avx? ahy?
# cf. arrays:
my_array = np.array([[1, 2], [3, 4]])
my_array[1, 1]
 
#seaborn returns an array, even if only one figure :
my_array = np.array([[1]])
my_arry[0, 0] 

Data cleaning in Pandas

  • drop, rename columns
  • replace, explode values

https://excalidraw.com/ : draw figures online

Not seen in the course, see by yourself.

Missing values in Pandas

  • NaN (not a number) = missing value
  • NaN is a float (not an integer)
  • NaN = np.nan
import numpy as np
import pandas as pd
 
df = pd.DataFrame({'A': [1, 2, np.nan],
                   'B': [4, np.nan, np.nan],
                   'C': [7, 8, 9]})
df
df.types # columns without NaN = int, those with NaN = float

By default, .mean() skips NaN (missing values) Can be avoided:

df["A"].mean(skipna=False) # if there are NaN, the mean with also be NaN
 
df["A"] + 10 # add 10 to each number from the column - N.B.: NaN+10 = NaN
 
df["A"] > 1 # comparing with NaN always returns False : when selecting, NaN values will be excluded (always False)
df["A"].isna()
df[df["A"].isna()] # select missing values
df[df["A"].notna()] # select non missing values
 
# count missing values in a column
df["A"].isna().sum()
 
df.dropna() # drop rows where at least one NaN
df.dropna(subset="A") # drop rows where NaN in column "A"
df.dropna(subset=["A", "B"], how="all")
 
df.fillna(0) # replace NaN values by 0
df.fillna(mdf.median()) # replace NaN values by the median of the non missing values of that column
 
pd.Series.interpolate(method='linear') # to fill in the NaN values
 
df.ffill() #forward fill: looks at last non missing value and find the next value to attribute
df.bfill() #backward fill
df["B"].interpolate(method="nearest") # nearest value

Conclusion

  • some extra chapters not seen
    • Pandas : combining data (concatenate, join, merge several dataframes)
  • some extra cases to discover (case 4 : missing values ; case 2 : fishing = groupby, duplicate…)
python/datawrangling.txt · Last modified: by carl