Table of Contents
Data wrangling in Python
https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/
Cours donné par Stijn Van Hoey et Joris Vanden Bossche les 25/11 et 02/12/2024 à l'Université de Gand :
Day 0 : préparation
Après avoir installé Anaconda, vérifier qu'il est à jour : ouvrir une console :
conda update -n base conda
il faut activer l'environnement pour le cours : dans Anaconda, après l'avoir installé
conda activate FLAMES-python Télécharger le fichier pour préciser l'environnement : <code>wget https://raw.githubusercontent.com/jorisvandenbossche/FLAMES-python-data-wrangling/main/environment.yml
qui contient :
name: FLAMES-python channels: - conda-forge dependencies: - python=3.11 - ipython - jupyter - jupyterlab>=3 - numpy - pandas>=1 - matplotlib>3 - mplleaflet - ipympl - seaborn - plotnine - pyproj - requests - openpyxl
Ensuite, dans Anaconda : base (root) > Terminal :
conda install -n base conda-libmamba-solver conda config --set solver libmamba conda config --add channels conda-forge conda config --set channel_priority strict cd FOLDER_PATH_TO_ENVIRONMENT_FILE conda env create -f environment.yml
où FOLDER_PATH_TO_ENVIRONMENT_FILE = là on on a téléchargé environment.yml précédemment. Ensuite, taper :
conda activate FLAMES-python ipython
pour ouvrir un terminal python où taper :
import pandas import matplotlib
Day 1
git clone https://github.com/jorisvandenbossche/FLAMES-python-data-wrangling.git cd FLAMES-python-data-wrangling conda activate FLAMES-python python check_environment.py conda activate FLAMES-python jupyter lab
Ne pas fermer cette fenêtre. Une fenêtre du browser s'ouvre, sélectionner dans l'explorateur 'Notebooks', c'est là que le cours se trouve.
- pour créer une nouvelle fenêtre dans laquelle écrire du code : cliquer sur une fenêtre existente, appuyer sur la touche “Escape” et ensuite sur la touche “b”
- pour faire tourner du code, cliquer sur la fenêtre (pour y avoir le curseur), ensuite shift + touche “Return”
Pour avoir de l'aide sur un objet
help(object)
Pour avoir de l'aide sur les paramètres d'une fonction, faire shift + “Tab” après avoir ouvert la parenthèse.
Pandas
dataframe : object with rows and columns
Each column in a dataframe is a series.
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], 'population': [11.3, 64.3, 81.3, 16.9, 64.9], 'area': [30510, 671308, 357050, 41526, 244820], 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']} countries = pd.DataFrame(data) countries
Liste des variables :
countries.columns
2 premiers/derniers rows:
countries.head(2) countries.bottom(2)
tri :
countries.sort_values("population")
graphique (2e et 3e lignes = pareil) :
countries.plot() countries["population"].plot(kind="box") countries["population"].plot.box()
import data :
pd.read_csv
export data : method of dataframe object
countries.to_excel
Exercice 1
import pandas as pd # importer les données dans un dataframe df=pd.read_csv("data/titanic.csv") # 5 premières lignes df.head(5) # nombre de records len(df) df.shape[0] # select Age column df['Age'] # make a boxplot of Age df.plot.bot('Age') # sort by age desc df.sort_values("Age", ascending=False) df.sort_values(by="Age", ascending=False)
Exercice 2
#Import pandas as pd and read the Titanic dataset df=pd.read_csv("data/titanic.csv") # what is the average age of the passengers? df['Age'].mean() # Plot the age distribution of the titanic passengers df['Age'].plot.hist() # What is the survival rate? (the relative number of people that survived) df['Survived'].sum() / len(df['Survived']) df['Survived'].mean() # What is the maximum Fare? And the median? df['Fare'].max() df['Fare'].median() # Calculate the 75th percentile (quantile) of the Fare price help(df.quantile) df['Fare'].quantile(0.75) df.quantile(0.75, numeric_only=True) # Calculate the scaled Fares (scaled relative to its mean), and add this as a new column ('Fare_scaled') to the DataFrame df['Fare_scaled'] = df['Fare'] / df['Fare'].mean() # Calculate the log of the Fares import numpy as np fare_log = np.log(df['Fare']) print(fare_log)
“Magics” : pas dans Python, mais dans l'environnement des notebooks (commencent par '%')
#liste des variables définies %whos a = np.arange(1000) %timeit a**2
Subsetting data
# select columns: countries[['area', 'population']] # double [[]]
Double crochets parce qu'on a une liste dans [] #select rows: mask = countries['population'] > 50 mask countries[mask] countries[0:100:10] # 1er, 11e, 21e…</code>
Conditions (return True or False = boolean)
- ==, !=, <, >=
- & (and), | (or)
.isin() operator = between 2 boundaries
→ boolean mask
to select data:
data_filtered = data[boolean_mask]
countries['country'] == "Belgium" countries['country'].isin(["Belgium", "France"]) capital = countries['capital'] capital == "Brussels" s = "Brussels" s.startwith("A") s.startwith("B") capital.str.startwith("B")
Excercice 3
# calculate mean age of males df_males = df[df['Sex' == 'male']] df_males['Age'].mean() # Number of passenger aged > 70 (df['Age']>70).sum() # select the passenger between 30 and 40 years old aged30_40 = df[(df['Age'] >=30) & (df['Age'] <= 40)] # For a single string name = 'Braund, Mr. Owen Harris', split this string (check the split() method of a string) and get the first element of the resulting list name = 'Braund, Mr. Owen Harris' name.split(",")[0] # Convert the solution of the previous exercise to all strings of the Name column at once. Split the 'Name' column on the ,, extract the first part (the surname), and add this as new column 'Surname' df['Surname'] = df['Name'].str.split(",").str.get(0) # Select all passenger that have a surname starting with 'Williams' df[df['Surname'].str.startswith('Williams')] # Select all rows for the passengers with a surname of more than 15 characters df[df['Surname'].str.len() > 15]
Dates and times in pandas
import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.style.use('ggplot') s = pd.Series(['2016-12-09 10:00:00', '2016-12-09 11:00:00', '2016-12-09 12:00:00']) # string > date ts = pd.to_datetime(s) # get years ts.dt.year ts.dt.weekday ts.dt.hour #single row : no need to call "dt" ts[0].dayofweek #pour eviter la confusion : pd.to_datetime("2021-12-01", dayfirst=True) pd.to_datetime("09/12/2016", format="%d/%m/%Y")
A detailed overview of how to specify the format string, see the table in the python documentation: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
# add an amount of time to each element of the series ts + pd.Timedelta('6 days') #create time series pd.date_range(start="2024-01-01", periods=10, freq="D") pd.date_range(start="2024-01-01", end=2024-06-01, freq="MS") # start of the month
Set time as index in a dataframe:
data["Time"] = pd.to_datetime(data['Time']) data = data.set_index("Time") data["L06_347"] # show time as index data.index data.index.dayofyear
Select data (a slice)
data[pd.to_datetime("2012-01-05"):pd.to_datetime("2012-01-10")] data["2012-01-05":"2012-05-10"] # the same data["2012-01":"2012-02"] # from start of Jan to the end of Feb data["2012-01":"2012-01"] # whole Jan month data["2012":"2012"] # whole year 2012 data["2012":] # until the end data[:"2012"] # from start until the end of 2012
Exercice 4
data = pd.read_csv("data/vmm_flowdata.csv", index_col=0, parse_dates=True) # select dates from start of 2012 to the end data["2012":] # Select all data in January for all different years data[data.index.month == 1] # Select all data in April, May and June for all different years data[(data.index.month == 4) | (data.index.month == 5) | (data.index.month == 6)] data[data.index.month.isin([4, 5, 6])] # Select all 'daytime' data (between 8h and 20h) for all days data[(data.index.hour>8) & (data.index.hour<20)]
Resample
Taking the mean of each month/year(end) (all data for a day are put together):
data.resample('D').mean() data.resample('YE').mean() data.resample('YE').max()
liste des intervalles : http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
Exercice 4 (suite)
#... # Plot the monthly mean and median values for the years 2011-2012 for 'L06_347'. Use the last day of each month to represent the month. data[data.index.year.isin([2011, 2012])]['L06_347'].resample("ME").agg(["mean", "median"]).plot() subset = data['2011':'2012']['L06_347'] subset.resample('ME').agg(['mean', 'median']).plot() # Plot the monthly mininum and maximum daily average value of the 'LS06_348' column. Use the first day of each month to represent the month. daily_avg = data["LS06_348"].resample("D").mean() daily_avg.resample("ME").agg(["min", "max"]) # Make a bar plot of the mean of the stations in year of 2013 data["2013":"2013"].mean().plot.barh()
Data processing
Exercice 5 : bike 1
# read the file df = pd.read_csv("data/fietstelpaal-coupure-links-2022-gent.zip", sep=';') #Convert the 'Ordening' column into a pandas Timestamp Series, and assign the data to a new column named "timestamp". Make sure to read the data as UTC df["timestamp"] = pd.to_datetime (df["Ordening"], format="%Y-%m-%dT%H:%M:%S%z", utc=True) # Set the resulting "timestamp" column as the index of the df DataFrame df = df.set_index("timestamp") # drop columns df2022 = df.drop(columns=['Datum', 'Uur5Minuten', 'Ordening', 'Code']) # rename columns df2022 = df2022.rename(columns={"Tegenrichting":"direction_centre", "Hoofdrichting":"direction_mariakerke", "Locatie": "location", "Totaal": "total"}) # method chaining (df2022 = df .set_index("timestamp") .drop(columns=['Datum', 'Uur5Minuten', 'Ordening', 'Code']) .rename(columns={"Tegenrichting":"direction_centre", "Hoofdrichting":"direction_mariakerke", "Locatie": "location", "Totaal": "total"})
Group by
import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.style.use('seaborn-v0_8-whitegrid') df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'], 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]}) df # calculate mean by key df["data"][df["key"] == "A"].mean() df["data"][df["key"] == "B"].mean() df["data"][df["key"] == "C"].mean() # faster: df.groupby("key")["data"].mean() df.groupby("key")["data"].agg("mean")
Exercice 6
import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.style.use('seaborn-v0_8-whitegrid') df = pd.read_csv("data/titanic.csv") # Using groupby(), calculate the average age for each sex df.groupby("Sex")["Age"].mean() # Calculate the average survival ratio for all passengers df['Survived'].mean() # Calculate this survival ratio for all passengers younger than 25 df[(df['Age'] < 25)]["Survived"].mean() # What is the difference in the survival ratio between the sexes? df.groupby("Sex")["Survived"].mean() # Make a bar plot of the survival ratio for the different classes ('Pclass' column) df.groupby("Pclass")["Survived"].mean().plot.bar() # Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is divided in separate classes using the pd.cut() function as provided below df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10)) df.groupby('AgeClass', observed=False)['Fare'].mean().plot.bar(rot=0)
groupby sur plusieurs variables :
df.groupby(["Pclass", "Sex"])["Survived"].mean()
Day 2
Seaborn
- built on matlibplot
- works wel with panda
import seaborn sas sns
Puis, sns. et tab → les possibilités
titanic[['Age', 'Fare']].plot()
OK pour un premier coup d'oeil
#For each class at the Titanic aneach gender, what was the average age titanic.groupby(['Pclass', 'Sex'])[['Age'].mean() type(titanic.groupby(['Pclass', 'Sex'])[['Age'].mean()) # type = multiindex #add a new index age_stat = titanic.groupby(['Pclass', 'Sex'])[['Age'].mean().reset_index() #.reset_index() = output dataframe instead of series #how to vizualize it? titanic.groupby(['Pclass', 'Sex'])[['Age'].mean().plot() #too limited sns.catplot(data=age_stat, x="Sex", y="Age", col="Pclass", kind="bar") #with colours for classes sns.catplot(data=age_stat, x="Sex", y="Age", hue="Pclass", kind="bar") #another way of seeing it sns.catplot(data=age_stat, hue="Sex", y="Age", x="Pclass", kind="bar")
type of plots (3 functions : relplot, catplot & displot and several kinds kind=“pointplot”):
- relation plot (relplot) : scatterplot, lineplot
- distribution plot (distplot) : histpolot, kdeplot, ecdfplot, rugplot
- categorical plot (catplot) : stripplot, swarmplot, boxplot, violinplot, pointplot, barplot
sns.relplot sns.catplot sns.displot(data=titanic, x='Age', kind="hist") # col = 2 separate plots (hue = colour for survival) age_fare = sns.relplot(data=titanic, x="Age", y="Fare", hue="Survived", col="Sex") #see that more females survived: better this: age_fare = sns.relplot(data=titanic, x="Age", y="Fare", col="Survived", hue="Sex")
relplot : col= , row= pour splitter le graphique en plusieurs, mais pas possible avec les 2 autres fonctions
Exercices
#Make a histogram of the age, split up in two subplots by the Sex of the passengers. #Put both subplots underneath each other. #Use the height and aspect arguments of the plot function to adjust the size of the figure. sns.displot(data=titanic, x="Age", kind="hist", row="Sex", height=4, aspect=2) #more bins: sns.displot(data=titanic, x="Age", kind="hist", row="Sex", height=4, aspect=2, bins=60) # Make a violin plot showing the Age distribution in each of the Pclass categories comparing for Sex # Figure based sns.catplot(data=titanic, x="Pclass", y="Age", hue="Sex", split=True, palette="Set2", kind="violin") sns.despine(left=True) # Axes based sns.violinplot(data=titanic, x="Pclass", y="Age", hue="Sex", split=True, palette="Set2") sns.despine(left=True)
Aussi jointplot et pairplot qui montre la distribution le long des axes (jointplot) ou des carrés de graphiques (pairplot)
#linear regression plot sns.lmplot() sns.regplot()
Exercices
casualties["gender"].unique()</code all values in gender variable Create a barplot with the number of victims ("n_victims") for each hour of the day for each category in the gender column. Before plotting, calculate the total number of victims for each hour of the day and each gender with Pandas and assign it to the variable victims_gender_hour_of_day. Create a separate subplot for each gender category in a separate row. Make sure to include the NaN values of the "gender" column as a separate subplot, called "unknown" without changing the casualties DataFrame data. <code python>victims_gender_hour_of_day = casualties.groupby([casualties["datetime"].dt.hour, "gender"], dropna=False)["n_victims"].sum().reset_index() victims_gender_hour_of_day.head() victims_gender_hour_of_day = casualties.groupby([casualties["datetime"].dt.hour, "gender"], dropna=False)["n_victims"].sum().reset_index() victims_gender_hour_of_day.head() sns.catplot(data=victims_gender_hour_of_day.fillna("unknown"), x='datetime', y='n_victims', row='gender', kind='bar', height=3, aspect=3)
Tidy data
In a tidy dataset :
- each variable is a row
- each observation is a line
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # create dataset data = pd.DataFrame({ 'WWTP': ['Destelbergen', 'Landegem', 'Dendermonde', 'Eeklo'], 'Treatment A': [8.0, 7.5, 8.3, 6.5], 'Treatment B': [6.3, 5.2, 6.2, 7.2] }) data # go from a wide to a long data format pd.melt(data, id_vars=["WWTP"]) data_long = pd.melt(data, id_vars=["WWTP"], value_name="pH", var_name="Treatment") data_long data_long.groupby("Treatment")["pH"].mean() data_long.groupby("WWTP")["pH"].mean() sns.catplot(data=data_long, x="WWTP", y="pH", hue="Treatment", kind="bar") sns.catplot(data=data_long, x="Treatment", y="pH", hue="WWTP", kind="bar")
Exercices
#Read the "verbruiksgegevens-per-maand.xlsx" file (in the "data/" directory) into a DataFrame df. df = pd.read_excel("data/verbruiksgegevens-per-maand.xlsx") #Drop the "Regio" column (this column has a constant value "Regio 1" and thus is not that interesting). df = df.drop(columns=["Regio"]) df
The actual data (consumption numbers) is spread over multiple columns: one column per month. Make a tidy version of this dataset with a single “consumption” column, and an additional “time” column.
Make sure to keep the “Hoofdgemeente”, “Energie” and “SLP” columns in the data set. The “SLP” column contains additional categories about the type of elektricity or gas consumption (eg household vs non-household consumption).
Use pd.melt() to create a long or tidy version of the dataset, and call the result df_tidy.
df_tidy = pd.melt(df, id_vars=["Hoofdgemeente", "Energie", "SLP"], var_name="time", value_name="consumption") df_tidy #Convert the "time" column to a column with a datetime data type using pd.to_datetime. df_tidy["time"] = pd.to_datetime(df_tidy["time"], format="%Y%m") df_tidy.head()
Calculate the total consumption of elektricity and gas over all municipalities (“Hoofdgemeente”) for each month.
Assign the result to a dataframe called df_overall.
Using df_overall, make a line plot of the consumption of elektricity vs gas over time.
Create a separate subplot for elektricity and for gas, putting them next to each other.
Ensure that the y-limit starts at 0 for both subplots.
df_overall = df_tidy.groupby(["time", "Energie"])[["consumption"]].sum() # or with .reset_index() df_overall.head() facet = sns.relplot(x="time", y="consumption", col="Energie", data=df_overall, kind="line") facet.set(ylim=(0, None)) # (min, max)
Pivot table
- .pivot : no recalculations, only rearrange
- .pivot_table : recalculations = common way to summarize the data
df = pd.read_csv("data/titanic.csv") df.pivot_table(index='Sex', columns='Pclass', values='Fare') df_full.pivot_table(columns="Pclass", index="Sex", values="Fares") df_full.pivot_table(columns="Pclass", index="Sex", values="Fares", aggfunc="mean") # mean = default df_full.pivot_table(columns="Pclass", index="Sex", values="Fares", aggfunc="min") # or max, std, count, etc. # function crosstab = pivot_table with aggfunc="count"
for columns and index, you can pass lists [] to have more than one variable
Exercices
# Make a pivot table with the survival rates for Pclass vs Sex. df = pd.read_csv("data/titanic.csv") df.pivot_table(index='Pclass', columns='Sex', values='Survived', aggfunc='mean') # extra : graph fig, ax1 = plt.subplots() (df.pivot_table(index='Pclass', columns='Sex', values='Survived', aggfunc='mean') .plot.bar(rot=0, ax=ax1) ) ax1.set_ylabel('Survival ratio') #Make a table of the median Fare payed by aged/underaged vs Sex. df['Underaged'] = df['Age'] <= 18 # boolean variable (df.pivot_table( index='Underaged', columns='Sex', values='Fare', aggfunc='median') .rename(columns={'False' : 'underaged', 'True':'aged'} )
A pivot table aggregates values for each combination of the new row index and column values. That reminds of the “groupby” operation.
Can you mimick the pivot table of the first exercise (a pivot table with the survival rates for Pclass vs Sex) using groupby()?
df_survival = df.groupby(["Pclass", "Sex"])["Survived"].mean().reset_index() df_survival df_survival.pivot(index="Pclass", columns="Sex", values="Survived")
Case study 3 : bacterial resistance
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Seaborn custom color palettes : doc "choosing color palettes"
https://colorbrewer2.org/
3 categories :
* sequential : variable continue qui augmente
* diverging : e.g. temperatures (valeurs centrales, puis d'autres valeures des 2 côtés, + et -)
* qualitative : catégories
<code python>sns.catplot(..., palette="colorblind")
# select an pre-built set globally, not always working, better as parameter
sns.colorpalette("icefire", as_cmap=True")
Matplotlib: visualisation with Python
- very adaptable (e.g. plot within a plot)
- difficult to master
- output for a paper/report
- add a line at some specific threshold to a figure generated with Seaborn
import numpy as np import panda as pd import matlibplot.pyplot as plt ax = plt.axes() #x- and y-axis drawn plt.subplots() #output = 2 objects: fig, ax = plt.subplots() #you can then work on these 2 objects ax.set_title("title 1") fig, ax = plt.subplots(nrows=1, ncols=2) print(ax) # actually axes ax[0].set_title=("LEFT") ax[1].set_title=("RIGHT") fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2) ax0.set_title=("LEFT") ax1.set_title=("RIGHT") fig.savefig() #save the figure in file plt.plot(x, y, "-") plt.title("title 2") #not object-oriented programme, try to avoid -> assign it to an object fig, ax = plt.subplots() ax.plot(x, y, "-") ax.set_title("title2") # gives the same result # avoid 1e7 on the y-axis ax.yaxis.set_major_formatter(ticker;FormatStrFormatter("%.1f"))
Reference graph with objects that can be customised :
- major tick
- minor tick
- line
- grid
- title
- legend
- spine
- x-label
- etc.
different styles
plt.style.available #define at the script level plt.style.use("tableau-colorblind10") #set up several plots axd = plt.figure(constrained_layout=True).subplot_mosaic( """ ABD CCD """ ) axd axd["A"].set_title("A it is")
Graphs generated with pandas are actually generated by matplotlib → assign to an object that you can then customize with matplotlib:
my_output = flowdata.plot.iline() my_output.set_title("My title") my_output.set_ylabel('r'Discharge m3/s')
Dans pandas, on peut passer un axe créé avec matplotlib comme paramètre:
fig, (ax0, ax1) = plt.subplots(1, 2) flowdata.plot()
histplot = sns.displot(data=tidy_experiment, x="optical_density", color="grey", edgecolor="white")) histplot.axes.shape[0][0] # axes = array 2 x 2 histplot.axes[0, 0].avxline(0.4, color="blue") # vertical blue line at x=0.4 histplot.axes[0, 0].ahxline(600, color="red") # horizontal: avx? ahy? # cf. arrays: my_array = np.array([[1, 2], [3, 4]]) my_array[1, 1] #seaborn returns an array, even if only one figure : my_array = np.array([[1]]) my_arry[0, 0]
Data cleaning in Pandas
- drop, rename columns
- replace, explode values
https://excalidraw.com/ : draw figures online
Not seen in the course, see by yourself.
Missing values in Pandas
- NaN (not a number) = missing value
- NaN is a float (not an integer)
- NaN = np.nan
import numpy as np import pandas as pd df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, np.nan], 'C': [7, 8, 9]}) df df.types # columns without NaN = int, those with NaN = float
By default, .mean() skips NaN (missing values) Can be avoided:
df["A"].mean(skipna=False) # if there are NaN, the mean with also be NaN df["A"] + 10 # add 10 to each number from the column - N.B.: NaN+10 = NaN df["A"] > 1 # comparing with NaN always returns False : when selecting, NaN values will be excluded (always False) df["A"].isna() df[df["A"].isna()] # select missing values df[df["A"].notna()] # select non missing values # count missing values in a column df["A"].isna().sum() df.dropna() # drop rows where at least one NaN df.dropna(subset="A") # drop rows where NaN in column "A" df.dropna(subset=["A", "B"], how="all") df.fillna(0) # replace NaN values by 0 df.fillna(mdf.median()) # replace NaN values by the median of the non missing values of that column pd.Series.interpolate(method='linear') # to fill in the NaN values df.ffill() #forward fill: looks at last non missing value and find the next value to attribute df.bfill() #backward fill df["B"].interpolate(method="nearest") # nearest value
Conclusion
- some extra chapters not seen
- Pandas : combining data (concatenate, join, merge several dataframes)
- some extra cases to discover (case 4 : missing values ; case 2 : fishing = groupby, duplicate…)
