6. Pandas
Let’s start by creating a simple Pandas DataFrame, one to do this is using a dict of Numpy arrays:
[1]:
import numpy as np
import pandas as pd
data_dict = {
"col1": np.arange(10, 35),
"col2": np.arange(975, 1000),
"col3": np.random.random(25),
}
dataf = pd.DataFrame(data_dict)
Here’s a way to select one or more columns:
[2]:
#Select a single column
dataf["col1"]
dataf.col1
#Select multiple columns
dataf[["col1", "col1"]]
#Rename dataf columns names
dataf.columns = ['new_name', 'col2', 'col3']
#Now let's use the new column name
dataf["new_name"]
[2]:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30
21 31
22 32
23 33
24 34
Name: new_name, dtype: int64
Using the method iloc you can obtain the columns in a similar way as we work with 2-d Numpy arrays:
[3]:
dataf.iloc[0:10, 0:1]
dataf.iloc[0:10, 1]
dataf.iloc[0:10, :]
dataf.iloc[0:10]
dataf.iloc[4:10, 1:3]
dataf.iloc[4:10:2, 1:3]
dataf.iloc[:, 1:3]
#Equivalent to dataf.iloc[:5]
dataf.head()
#Equivalent to dataf.iloc[-5:]
dataf.tail()
[3]:
new_name | col2 | col3 | |
---|---|---|---|
20 | 30 | 995 | 0.236813 |
21 | 31 | 996 | 0.170865 |
22 | 32 | 997 | 0.689839 |
23 | 33 | 998 | 0.832092 |
24 | 34 | 999 | 0.149977 |
You can also give names to your lines using the method set_index:
[4]:
dfindx = "a b c d e f g h i j k l m n o p q r s t u v x w y".split()
dataf.set_index(np.array(dfindx), inplace=True)
Note that inplace = True is somewhat equivalent to:
[5]:
dataf = dataf.set_index(np.array(dfindx))
We can then use those names with the method loc:
[6]:
dataf.loc[["a", "q"]]
dataf.loc[["a", "q"], "col2"]
dataf.loc[:, "col2"]
[6]:
a 975
b 976
c 977
d 978
e 979
f 980
g 981
h 982
i 983
j 984
k 985
l 986
m 987
n 988
o 989
p 990
q 991
r 992
s 993
t 994
u 995
v 996
x 997
w 998
y 999
Name: col2, dtype: int64
We can also use iloc and loc to set new values for the cells of the data frame:
[7]:
dataf.iloc[5, 0] = 42.3
dataf.iloc[4:7, 1] = [3.2, 2, 3]
dataf.iloc[1, :2] = np.array([3., 2])
dataf.iloc[1] = 2
dataf.loc[["a", "q"]] = np.random.random((2, 3))
dataf.col2 = np.random.random(25)
Note: due to technical reasons that are beyond the scope of this introduction, you should avoid chained assignment like in the examples below:
[8]:
#ATTENTION: avoid this!!
dataf.col2.iloc[5] = 42.3
dataf.loc[["a", "q"]].iloc[:, :2] = np.random.random((2, 2))
The problem is that dataf.loc[["a", "q"]] might return you a copy of dataf (instead of a view of it), and therefore, you will be changing this copy, not dataf itself. Hopefully, if you try any of those two examples above, Pandas should give you a warning with a link for more details.
6.1. CSV files
Pandas has a method called read_csv which you can use to read csv files from local folders or even directly from an URL:
[9]:
dataf = pd.read_csv(
'https://raw.githubusercontent.com/openmundi/world.csv/master/countries(249)_num3.csv'
)