Data Analysis with Python¶

Pandas, Jupyter, and Friends¶

Andreas Herten, 4 May 2017

»The data analyst's three foundations in Python«

Matplotlib • Pandas • Jupyter Notebook

Matplotlib¶

Standard for plotting with Python
Recently v. 2.0.0 released
→ https://matplotlib.org/index.html

Using the global API¶

Using the MATLAB-like interface
Everything works through plt.…

In [1]:

import matplotlib.pyplot as plt
x = range(10)
y = [i**2 for i in range(10)]

In [3]:

plt.plot(x, y)
plt.show()

Option Showcase¶

In [4]:

import numpy as np
x = np.arange(0, 100, 0.2)
y = np.sin(np.sqrt(x))
plt.plot(x, y, color="green")
plt.ylim([-0.6,1.1])
plt.xlabel("Numbers")
plt.ylabel("$\sin(\sqrt{Numbers})$")
plt.show()

Object API¶

Instead of operation on global objects with plt, rather use Figure and Axis (axes ≈ plots)
Cleaner approach (IMHO)
Used under the hood of global API by leveraging plt.gca().… (get current axis)

In [5]:

x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)

In [7]:

fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Use like this')
ax.set_xlabel("Numbers again")

Out[7]:

<matplotlib.text.Text at 0x112c8fb38>

Multiple Plots¶

In [8]:

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
ax1.plot(x, y)
ax1.set_title('Default Plot Style')
ax2.scatter(x, y, marker="D")
ax2.set_title('Scattered (Diamonds)')
fig.suptitle("Two Plots in One!")

Out[8]:

<matplotlib.text.Text at 0x112dddf60>

Pandas¶

Introduction

Introduction¶

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Most important feature: DataFrames and operations with them
→ http://pandas.pydata.org/

In [9]:

import pandas as pd

Creating a DataFrame¶

Using a dictionary as an input

In [10]:

frame = pd.DataFrame({
    "A": 1.2,
    "B": pd.Timestamp('20170503'),
    "C": [(-1)**i * np.sqrt(i) + np.e * (-1)**(i-1) for i in range(5)],
    "D": pd.Categorical(["This", "column", "has", "entries", "entries"]),
    "E": "Same"
})
frame

Out[10]:

	A	B	C	D	E
0	1.2	2017-05-03	-2.718282	This	Same
1	1.2	2017-05-03	1.718282	column	Same
2	1.2	2017-05-03	-1.304068	has	Same
3	1.2	2017-05-03	0.986231	entries	Same
4	1.2	2017-05-03	-0.718282	entries	Same

Also available: .read_csv and .read_excel

Popular Functions on Frames¶

In [11]:

frame.describe()

Out[11]:

	A	C
count	5.0	5.000000
mean	1.2	-0.407224
std	0.0	1.781963
min	1.2	-2.718282
25%	1.2	-1.304068
50%	1.2	-0.718282
75%	1.2	0.986231
max	1.2	1.718282

In [12]:

frame.head(2)

Out[12]:

	A	B	C	D	E
0	1.2	2017-05-03	-2.718282	This	Same
1	1.2	2017-05-03	1.718282	column	Same

Popular Functions on Frames II¶

In [13]:

frame.transpose()

Out[13]:

	0	1	2	3	4
A	1.2	1.2	1.2	1.2	1.2
B	2017-05-03 00:00:00	2017-05-03 00:00:00	2017-05-03 00:00:00	2017-05-03 00:00:00	2017-05-03 00:00:00
C	-2.71828	1.71828	-1.30407	0.986231	-0.718282
D	This	column	has	entries	entries
E	Same	Same	Same	Same	Same

In [14]:

frame.sort_values("C")

Out[14]:

	A	B	C	D	E
0	1.2	2017-05-03	-2.718282	This	Same
2	1.2	2017-05-03	-1.304068	has	Same
4	1.2	2017-05-03	-0.718282	entries	Same
3	1.2	2017-05-03	0.986231	entries	Same
1	1.2	2017-05-03	1.718282	column	Same

Popular Functions on Frames III¶

In [15]:

round(frame,2)
frame.round(2)

Out[15]:

	A	B	C	D	E
0	1.2	2017-05-03	-2.72	This	Same
1	1.2	2017-05-03	1.72	column	Same
2	1.2	2017-05-03	-1.30	has	Same
3	1.2	2017-05-03	0.99	entries	Same
4	1.2	2017-05-03	-0.72	entries	Same

In [16]:

frame.sum()

Out[16]:

A    6.000000
C   -2.036119
dtype: float64

In [17]:

frame.round(2).sum()

Out[17]:

A    6.00
C   -2.03
dtype: float64

Popular Functions on Frames IV¶

In [18]:

print(frame.round(2).to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &    A &          B &     C &        D &     E \\
\midrule
0 &  1.2 & 2017-05-03 & -2.72 &     This &  Same \\
1 &  1.2 & 2017-05-03 &  1.72 &   column &  Same \\
2 &  1.2 & 2017-05-03 & -1.30 &      has &  Same \\
3 &  1.2 & 2017-05-03 &  0.99 &  entries &  Same \\
4 &  1.2 & 2017-05-03 & -0.72 &  entries &  Same \\
\bottomrule
\end{tabular}

Index, Columns¶

In [19]:

frame["NewIdx"] = pd.date_range('20170504', periods=5)
frame.head(3)

Out[19]:

	A	B	C	D	E	NewIdx
0	1.2	2017-05-03	-2.718282	This	Same	2017-05-04
1	1.2	2017-05-03	1.718282	column	Same	2017-05-05
2	1.2	2017-05-03	-1.304068	has	Same	2017-05-06

Index, Columns II¶

In [20]:

frame = frame.set_index("NewIdx")  # Also: inplace=True
frame.head(3)

Out[20]:

	A	B	C	D	E
NewIdx
2017-05-04	1.2	2017-05-03	-2.718282	This	Same
2017-05-05	1.2	2017-05-03	1.718282	column	Same
2017-05-06	1.2	2017-05-03	-1.304068	has	Same

In [21]:

frame.index

Out[21]:

DatetimeIndex(['2017-05-04', '2017-05-05', '2017-05-06', '2017-05-07',
               '2017-05-08'],
              dtype='datetime64[ns]', name='NewIdx', freq=None)

In [22]:

frame.columns

Out[22]:

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

Slicing¶

Select only column "A"

In [23]:

frame["A"]

Out[23]:

NewIdx
2017-05-04    1.2
2017-05-05    1.2
2017-05-06    1.2
2017-05-07    1.2
2017-05-08    1.2
Name: A, dtype: float64

Select columns "A" and "C"

In [24]:

frame[["A", "C"]].sort_values("C")

Out[24]:

	A	C
NewIdx
2017-05-04	1.2	-2.718282
2017-05-06	1.2	-1.304068
2017-05-08	1.2	-0.718282
2017-05-07	1.2	0.986231
2017-05-05	1.2	1.718282

Slicing II¶

In [25]:

frame[1:3]

Out[25]:

	A	B	C	D	E
NewIdx
2017-05-05	1.2	2017-05-03	1.718282	column	Same
2017-05-06	1.2	2017-05-03	-1.304068	has	Same

In [26]:

frame.loc["2017-05-06"]

Out[26]:

A                    1.2
B    2017-05-03 00:00:00
C               -1.30407
D                    has
E                   Same
Name: 2017-05-06 00:00:00, dtype: object

In [27]:

frame.iloc[2]

Out[27]:

A                    1.2
B    2017-05-03 00:00:00
C               -1.30407
D                    has
E                   Same
Name: 2017-05-06 00:00:00, dtype: object

Slicing III¶

In [28]:

frame[frame["C"] > 0]

Out[28]:

	A	B	C	D	E
NewIdx
2017-05-05	1.2	2017-05-03	1.718282	column	Same
2017-05-07	1.2	2017-05-03	0.986231	entries	Same

In [29]:

frame[(frame["C"] > 0) & (frame["D"] == "has")]

Out[29]:

	A	B	C	D	E
NewIdx

Plotting¶

In [30]:

frame[["A", "C"]].head(3)

Out[30]:

	A	C
NewIdx
2017-05-04	1.2	-2.718282
2017-05-05	1.2	1.718282
2017-05-06	1.2	-1.304068

In [31]:

frame[["A", "C"]].plot()

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x114187160>

Plotting II¶

In [32]:

frame[["A", "C"]].plot(
        color=["red", "green"], 
        style=[".--","*"], 
        grid=True, 
        secondary_y=["C"]
    )

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x1141c75c0>

Plotting III¶

In [33]:

frame[["A", "C"]].plot(kind="bar")

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x11433d5f8>

Plotting III (2)¶

In [34]:

frame[["A", "C"]].plot(kind="bar", stacked=True)

Out[34]:

<matplotlib.axes._subplots.AxesSubplot at 0x1143d89b0>

Plotting III (3)¶

In [35]:

frame[["A", "C"]].reset_index().plot(kind="bar", subplots=True, figsize=(6,2))

Out[35]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x1144b3438>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x114593668>], dtype=object)

Further kinds: barh, box, hist, kde (a better histogram!), scatter; more: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
Instead of .plot(kind="bar"), also possible: .plot.bar()

Advanced Plotting¶

Combine Pandas & Matplotlib¶

Combine Pandas and Matplotlib by letting Pandas draw to an axis with ax

In [36]:

fig, ax = plt.subplots()
frame[["A", "C"]].plot(kind="bar", ax=ax)
ax.set_xlabel("Datetime")
ax.set_ylabel("Value")
fig.savefig("barplot.pdf")

Combination II¶

In [38]:

fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, nrows=1, figsize=(12,3))
ax1 = frame["A"].plot.line(ax=ax1)
ax2 = frame["C"].plot.box(ax=ax2)
ax3 = frame["C"].plot.hist(ax=ax3, color="orange")
fig.suptitle("Stupid plots")

Out[38]:

<matplotlib.text.Text at 0x1148029b0>

Seaborn¶

Seaborn is a library for making attractive and informative statistical graphics in Python

Provides plotting interfaces
And sets nice defaults
Also: Colormaps
→ http://seaborn.pydata.org/

In [70]:

import seaborn as sns
sns.set(rc={"figure.figsize": (5, 3)})
frame["C"].plot(marker="s", linestyle="--")

Out[70]:

<matplotlib.axes._subplots.AxesSubplot at 0x117fae240>

Seaborn Color Palette¶

In [71]:

frame["G"] = [(-1)**i * np.sqrt(i) + np.pi * (-1)**(i-1) for i in range(len(frame.index))]
frame["H"] = [(-1)**i * np.sqrt(i) + np.pi * (-1.1)**(i-1) for i in range(len(frame.index))]

In [72]:

with sns.color_palette("hls", 2):
    fig, ax = plt.subplots()
    sns.regplot(x="C", y="G", data=frame, ax=ax)
    sns.regplot(x="C", y="H", data=frame, ax=ax)

Seaborn Color Palette II¶

In [73]:

sns.palplot(sns.color_palette())

In [74]:

sns.palplot(sns.color_palette("hls", 10))

In [75]:

sns.palplot(sns.color_palette("hls", 20))

In [76]:

sns.palplot(sns.color_palette("Paired", 10))

Seaborn Color Palette III / KDE Plot¶

In [77]:

x, y = np.random.multivariate_normal([0, 0], [[1, -.5], [-.5, 1]], size=300).T
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
sns.kdeplot(x, y, cmap=cmap, shade=True);

Seaborn Color Palette IV / Jointplot¶

In [78]:

sns.jointplot(x=x, y=y, kind="reg")

Out[78]:

<seaborn.axisgrid.JointGrid at 0x1188c15f8>

Complex Data¶

Some real data…¶

Some PAPI counters for different number of particles (=program run lengths), compiled with different compilers

In [79]:

dfCounters = pd.read_csv("juron-jube-add_one_to_list.csv")
dfCounters.head(2)

Out[79]:

	modules	compiler	n_particles	hwc	HWC
0	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_INS	32809671
1	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_CYC	21246423

In [80]:

dfCounters = dfCounters.rename(columns={
    "modules": "Modules",
    "compiler": "Compiler",
    "n_particles": "Number of Particles",
    "hwc": "Counter Name",
    "HWC": "Counter Value"
})
dfCounters.head(2)

Out[80]:

	Modules	Compiler	Number of Particles	Counter Name	Counter Value
0	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_INS	32809671
1	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_CYC	21246423

Massaging¶

I want some relative values…

In [81]:

dfCounters["Counter Value (rel.)"] = dfCounters["Counter Value"] / dfCounters["Number of Particles"]
dfCounters.head(2)

Out[81]:

	Modules	Compiler	Number of Particles	Counter Name	Counter Value	Counter Value (rel.)
0	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_INS	32809671	328.09671
1	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_CYC	21246423	212.46423

Some Values¶

Plot relative values of PAPI_TOT_CYC for gfortoran

In [82]:

dfCounters[
    (dfCounters["Compiler"] == "gfortran")
    & 
    (dfCounters["Counter Name"] == "PAPI_TOT_CYC")
]["Counter Value (rel.)"]\
.plot(marker="P")

Out[82]:

<matplotlib.axes._subplots.AxesSubplot at 0x1185e66d8>

More Values¶

Plot same relativ values, but also those of counter PAPI_TOT_INS

In [83]:

dfCounters[
    (dfCounters["Compiler"] == "gfortran")
    & 
    ((dfCounters["Counter Name"] == "PAPI_TOT_CYC") | (dfCounters["Counter Name"] == "PAPI_TOT_INS"))
]["Counter Value (rel.)"]\
.plot(marker="P")

Out[83]:

<matplotlib.axes._subplots.AxesSubplot at 0x1186db4a8>

More Values¶

Plot same relativ values, but also those of counter PAPI_TOT_INS

Nope! Because

In [84]:

dfCounters[
    (dfCounters["Compiler"] == "gfortran")
    & 
    ((dfCounters["Counter Name"] == "PAPI_TOT_CYC") | (dfCounters["Counter Name"] == "PAPI_TOT_INS"))
].head(3)

Out[84]:

	Modules	Compiler	Number of Particles	Counter Name	Counter Value	Counter Value (rel.)
0	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_INS	32809671	328.096710
1	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_CYC	21246423	212.464230
5	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	1000000	PAPI_TOT_INS	328081236	328.081236

Workaround¶

Create a canvas with matplotlib and explicitly draw to it

Care about legend etc.

In [85]:

fig, ax = plt.subplots()
ax = dfCounters[(dfCounters["Compiler"] == "gfortran") & (dfCounters["Counter Name"] == "PAPI_TOT_INS")]["Counter Value (rel.)"].plot(marker="P", ax=ax, label="PAPI_TOT_INS")
ax = dfCounters[(dfCounters["Compiler"] == "gfortran") &  (dfCounters["Counter Name"] == "PAPI_TOT_CYC")]["Counter Value (rel.)"].plot(marker="o", ax=ax, label="PAPI_TOT_CYC")
ax.legend(loc="best", frameon=True, fontsize=15, framealpha=0.5)
ax.set_xlabel("Measurement number")
ax.set_ylabel("Counter Value (rel.)")

Out[85]:

<matplotlib.text.Text at 0x118817c50>

Wouldn't be cool if Pandas could do this for us?

Pivoting!¶

Basically: Combine similar categorial data in a DataFrame

In [86]:

dfCounters.head(2)

Out[86]:

	Modules	Compiler	Number of Particles	Counter Name	Counter Value	Counter Value (rel.)
0	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_INS	32809671	328.09671
1	gcc/5.4.0 openmpi/2.0.2-gcc_5.4.0 PAPI/5.5.0-cuda	gfortran	100000	PAPI_TOT_CYC	21246423	212.46423

Some data massaging: I want to remove Modules column; but to prevent double-entries, I want to rename all mpifort Compiler entries run with module openmpi/1.10.2-pgi_16.10 loaded to PGI+MPI

In [87]:

dfCounters.loc[
    dfCounters["Modules"].str.contains("openmpi/1.10.2-pgi_16.10") 
    & 
    (dfCounters["Compiler"] == "mpifort"),
    "Compiler"
]  = "PGI+MPI"

In [88]:

dfCounters = dfCounters.drop("Modules", axis=1)

In [89]:

dfCounters.head(2)

Out[89]:

	Compiler	Number of Particles	Counter Name	Counter Value	Counter Value (rel.)
0	gfortran	100000	PAPI_TOT_INS	32809671	328.09671
1	gfortran	100000	PAPI_TOT_CYC	21246423	212.46423

Pivoting, Actually¶

index: What should be my new index? If array → hierarchical multi-index
values: What value should be printed in the cells
columns: What should be the new columns? If array → hierarchical

In [90]:

dfPivot = dfCounters.pivot_table(
    index="Number of Particles",
    values="Counter Value (rel.)", 
    columns=["Compiler", "Counter Name"]
)
dfPivot.head(3)

Out[90]:

Compiler	PGI+MPI					gfortran					pgfortran
Counter Name	PAPI_L1_DCM	PAPI_L2_DCM	PAPI_STL_ICY	PAPI_TOT_CYC	PAPI_TOT_INS	PAPI_L1_DCM	PAPI_L2_DCM	PAPI_STL_ICY	PAPI_TOT_CYC	PAPI_TOT_INS	PAPI_L1_DCM	PAPI_L2_DCM	PAPI_STL_ICY	PAPI_TOT_CYC	PAPI_TOT_INS
Number of Particles
100000	3.032350	0.010760	479.309470	747.119030	780.156460	5.305490	0.002150	137.864120	212.464230	328.096710	1.088945	0.006175	232.514715	436.386840	672.144140
1000000	3.039885	0.008920	479.860810	747.309137	780.122863	2.744860	0.001581	137.640959	212.281606	328.081236	5.081417	0.005163	233.430218	436.841017	672.114449
2500000	3.419826	0.008527	479.873831	746.905123	780.120623	6.243841	0.001350	137.736993	214.782047	328.080074	9.485501	0.005273	233.018239	436.255372	672.112393

Pivot and Stack¶

Maybe getting the counters to the index side is more useful?

In [91]:

dfPivot.stack().head(6)

Out[91]:

	Compiler	PGI+MPI	gfortran	pgfortran
Number of Particles	Counter Name
100000	PAPI_L1_DCM	3.032350	5.30549	1.088945
	PAPI_L2_DCM	0.010760	0.00215	0.006175
	PAPI_STL_ICY	479.309470	137.86412	232.514715
	PAPI_TOT_CYC	747.119030	212.46423	436.386840
	PAPI_TOT_INS	780.156460	328.09671	672.144140
1000000	PAPI_L1_DCM	3.039885	2.74486	5.081417

… which is the same as

In [92]:

dfCounters.pivot_table(
    index=["Number of Particles", "Counter Name"],
    values="Counter Value (rel.)", 
    columns="Compiler"
).head(6)

Out[92]:

	Compiler	PGI+MPI	gfortran	pgfortran
Number of Particles	Counter Name
100000	PAPI_L1_DCM	3.032350	5.30549	1.088945
	PAPI_L2_DCM	0.010760	0.00215	0.006175
	PAPI_STL_ICY	479.309470	137.86412	232.514715
	PAPI_TOT_CYC	747.119030	212.46423	436.386840
	PAPI_TOT_INS	780.156460	328.09671	672.144140
1000000	PAPI_L1_DCM	3.039885	2.74486	5.081417

Plotting Pivoted DataFrames¶

In [93]:

dfPivot.plot(kind="bar", figsize=(12,5))

Out[93]:

<matplotlib.axes._subplots.AxesSubplot at 0x1188ff2b0>

Plotting Pivoted DataFrames II¶

In [94]:

dfPivot.stack().plot(kind="bar", figsize=(11,5))

Out[94]:

<matplotlib.axes._subplots.AxesSubplot at 0x118bedd68>

Plotting Pivoted DataFrames III¶

Focus on four counters, plot them next to each other

In [95]:

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(ncols=2, nrows=2, sharex=True, figsize=(12,5))
for (ax, counter) in zip([ax1, ax2, ax3, ax4], ["PAPI_TOT_INS", "PAPI_TOT_CYC", "PAPI_L1_DCM", "PAPI_STL_ICY"]):
    ax = dfPivot.stack().loc[(slice(None), counter),:].plot(kind="bar", ax=ax, legend=False)
    labels = [int(label.get_text().split(",")[0][1:-1]) for label in ax.get_xticklabels()]
    ax.set_title(counter)
    ax.set_xlabel("Number of Particles")
    ax.set_ylabel("Counter Value per Particle")
    ax.set_xticklabels(labels)

Jupyter Notebooks¶

Introduction¶

Use Python in your browser, interactively

Jupyter Notebook

Inline Magic I¶

Inline Matplotlib

Inline Magic II¶

In [96]:

%timeit np.sin(range(1000))

1000 loops, best of 3: 698 µs per loop

In [97]:

%ls .

Pandas-Analysis.ipynb             juron-jube-add_one_to_list.csv
Pandas-Analysis.slides.html       notebook-screenshot--inline1.png
convertNotebookToHtmlSlides.sh*   notebook-screenshot.png
convertNotebookToPdfDocument.sh*  reveal.js/
convertSlidesToPdf.sh*            serveSlidesForPresentation.sh*
custom.css

In [98]:

!pip install something

Collecting something
  Could not find a version that satisfies the requirement something (from versions: )
No matching distribution found for something

In [99]:

%lsmagic

Out[99]:

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Converting¶

Notebooks are rendered at Github and our Gitlab server
Can be converted to static HTML
Can be converted to PDF, Markdown, reST, Slides

This presentation is one large Jupyter Notebook

Data Analysis with Python¶

Pandas, Jupyter, and Friends¶

Matplotlib¶

Using the global API¶

Option Showcase¶

Object API¶

Multiple Plots¶

Pandas¶

Introduction¶

Creating a DataFrame¶

Popular Functions on Frames¶

Popular Functions on Frames II¶

Popular Functions on Frames III¶

Popular Functions on Frames IV¶

Index, Columns¶

Index, Columns II¶

Slicing¶

Slicing II¶

Slicing III¶

Plotting¶

Plotting II¶

Plotting III¶

Plotting III (2)¶

Plotting III (3)¶

Advanced Plotting¶

Combine Pandas & Matplotlib¶

Combination II¶

Seaborn¶

Seaborn Color Palette¶

Seaborn Color Palette II¶

Seaborn Color Palette III / KDE Plot¶

Seaborn Color Palette IV / Jointplot¶

Complex Data¶

Some real data…¶

Massaging¶

Some Values¶

More Values¶

More Values¶

Workaround¶

Pivoting!¶

Pivoting, Actually¶

Pivot and Stack¶

Plotting Pivoted DataFrames¶

Plotting Pivoted DataFrames II¶

Plotting Pivoted DataFrames III¶

Jupyter Notebooks¶

Introduction¶

Inline Magic I¶

Inline Magic II¶

Converting¶

The End¶