12. Pandas package

Let’s start by creating a simple Pandas DataFrame, one to do this is using a dict of Numpy arrays:

In [1]:
import numpy as np
import pandas as pd

data_dict = {
"col1": np.arange(10, 35),
"col2": np.arange(975, 1000),
"col3": np.random.random(25),
}

dataf = pd.DataFrame(data_dict)

Here’s a way to select one or more columns:

In [2]:
#Select a single column
dataf["col1"]
dataf.col1

#Select multiple columns
dataf[["col1", "col1"]]

#Rename dataf columns names
dataf.columns = ['new_name', 'col2', 'col3']

#Now let's use the new column name
dataf["new_name"]
Out[2]:
0     10
1     11
2     12
3     13
4     14
5     15
6     16
7     17
8     18
9     19
10    20
11    21
12    22
13    23
14    24
15    25
16    26
17    27
18    28
19    29
20    30
21    31
22    32
23    33
24    34
Name: new_name, dtype: int64

Using the method iloc you can obtain the columns in a similar way as we work with 2-d Numpy arrays:

In [3]:
dataf.iloc[0:10, 0:1]
dataf.iloc[0:10, 1]
dataf.iloc[0:10, :]
dataf.iloc[0:10]
dataf.iloc[4:10, 1:3]
dataf.iloc[4:10:2, 1:3]
dataf.iloc[:, 1:3]

#Equivalent to dataf.iloc[:5]
dataf.head()

#Equivalent to dataf.iloc[-5:]
dataf.tail()
Out[3]:
new_name col2 col3
20 30 995 0.236813
21 31 996 0.170865
22 32 997 0.689839
23 33 998 0.832092
24 34 999 0.149977

You can also give names to your lines using the method set_index:

In [4]:
dfindx = "a b c d e f g h i j k l m n o p q r s t u v x w y".split()
dataf.set_index(np.array(dfindx), inplace=True)

Note that inplace = True is somewhat equivalent to:

In [5]:
dataf = dataf.set_index(np.array(dfindx))

We can then use those names with the method loc:

In [6]:
dataf.loc[["a", "q"]]
dataf.loc[["a", "q"], "col2"]
dataf.loc[:, "col2"]
Out[6]:
a    975
b    976
c    977
d    978
e    979
f    980
g    981
h    982
i    983
j    984
k    985
l    986
m    987
n    988
o    989
p    990
q    991
r    992
s    993
t    994
u    995
v    996
x    997
w    998
y    999
Name: col2, dtype: int64

We can also use iloc and loc to set new values for the cells of the data frame:

In [7]:
dataf.iloc[5, 0] = 42.3
dataf.iloc[4:7, 1] = [3.2, 2, 3]
dataf.iloc[1, :2] = np.array([3., 2])
dataf.iloc[1] = 2
dataf.loc[["a", "q"]] = np.random.random((2, 3))
dataf.col2 = np.random.random(25)

Note: due to technical reasons that are beyond the scope of this introduction, you should avoid chained assignment like in the examples below:

In [8]:
#ATTENTION: avoid this!!
dataf.col2.iloc[5] = 42.3
dataf.loc[["a", "q"]].iloc[:, :2] = np.random.random((2, 2))

The problem is that dataf.loc[["a", "q"]] might return you a copy of dataf (instead of a view of it), and therefore, you will be changing this copy, not dataf itself. Hopefully, if you try any of those two examples above, Pandas should give you a warning with a link for more details.

12.1. CSV files

Pandas has a method called read_csv which you can use to read csv files from local folders or even directly from an URL:

In [9]:
dataf = pd.read_csv(
'https://raw.githubusercontent.com/openmundi/world.csv/master/countries(249)_num3.csv'
)