6. Pandas

Let’s start by creating a simple Pandas DataFrame, one to do this is using a dict of Numpy arrays:

[1]:

import numpy as np
import pandas as pd

data_dict = {
"col1": np.arange(10, 35),
"col2": np.arange(975, 1000),
"col3": np.random.random(25),
}

dataf = pd.DataFrame(data_dict)

Here’s a way to select one or more columns:

[2]:

#Select a single column
dataf["col1"]
dataf.col1

#Select multiple columns
dataf[["col1", "col1"]]

#Rename dataf columns names
dataf.columns = ['new_name', 'col2', 'col3']

#Now let's use the new column name
dataf["new_name"]

[2]:

0     10
1     11
2     12
3     13
4     14
5     15
6     16
7     17
8     18
9     19
10    20
11    21
12    22
13    23
14    24
15    25
16    26
17    27
18    28
19    29
20    30
21    31
22    32
23    33
24    34
Name: new_name, dtype: int64

Using the method iloc you can obtain the columns in a similar way as we work with 2-d Numpy arrays:

[3]:

dataf.iloc[0:10, 0:1]
dataf.iloc[0:10, 1]
dataf.iloc[0:10, :]
dataf.iloc[0:10]
dataf.iloc[4:10, 1:3]
dataf.iloc[4:10:2, 1:3]
dataf.iloc[:, 1:3]

#Equivalent to dataf.iloc[:5]
dataf.head()

#Equivalent to dataf.iloc[-5:]
dataf.tail()

[3]:

	new_name	col2	col3
20	30	995	0.236813
21	31	996	0.170865
22	32	997	0.689839
23	33	998	0.832092
24	34	999	0.149977

You can also give names to your lines using the method set_index:

[4]:

dfindx = "a b c d e f g h i j k l m n o p q r s t u v x w y".split()
dataf.set_index(np.array(dfindx), inplace=True)

Note that inplace = True is somewhat equivalent to:

[5]:

dataf = dataf.set_index(np.array(dfindx))

We can then use those names with the method loc:

[6]:

dataf.loc[["a", "q"]]
dataf.loc[["a", "q"], "col2"]
dataf.loc[:, "col2"]

[6]:

a    975
b    976
c    977
d    978
e    979
f    980
g    981
h    982
i    983
j    984
k    985
l    986
m    987
n    988
o    989
p    990
q    991
r    992
s    993
t    994
u    995
v    996
x    997
w    998
y    999
Name: col2, dtype: int64

We can also use iloc and loc to set new values for the cells of the data frame:

[7]:

dataf.iloc[5, 0] = 42.3
dataf.iloc[4:7, 1] = [3.2, 2, 3]
dataf.iloc[1, :2] = np.array([3., 2])
dataf.iloc[1] = 2
dataf.loc[["a", "q"]] = np.random.random((2, 3))
dataf.col2 = np.random.random(25)

Note: due to technical reasons that are beyond the scope of this introduction, you should avoid chained assignment like in the examples below:

[8]:

#ATTENTION: avoid this!!
dataf.col2.iloc[5] = 42.3
dataf.loc[["a", "q"]].iloc[:, :2] = np.random.random((2, 2))

The problem is that dataf.loc[["a", "q"]] might return you a copy of dataf (instead of a view of it), and therefore, you will be changing this copy, not dataf itself. Hopefully, if you try any of those two examples above, Pandas should give you a warning with a link for more details.

6.1. CSV files

Pandas has a method called read_csv which you can use to read csv files from local folders or even directly from an URL:

[9]:

dataf = pd.read_csv(
'https://raw.githubusercontent.com/openmundi/world.csv/master/countries(249)_num3.csv'
)