Welcome back!
In the last two workshops, we had talked about Python basics, and we are going to finally use it for data analysis here! (Yeah!!!)
Please feel free to refer to material from the last 2 workshops:
Workshop I : Intro to Python
Workshop II : Intro to Data Structures
Workshop Best Practices
Data Cleaning / Data Manipulation:
Data Visualization:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# directly shows graph without plt.show()
Basic Data Types | Example | Meaning/ Usage |
---|---|---|
Integer (int) | 1, 2, 3 | Integer numbers |
String (str) | "apple" | Text |
Float (float) | 5.55 | Floating/ decimal point numbers |
Boolean (bool) | True, False | True/False (Boolean) values |
Collection Data Types | Example | Properties |
---|---|---|
List (list) | ["apple", "orange", 5] | Changeable, Ordered, Duplicate allowed |
Tuple (tuple) | ("apple", "orange", 5) | Unchangeable, Ordered, Duplicate allowed |
Set (set) | {"apple", "orange", 5} | Unchangeable, Unordered, No duplicates |
Dictionary (dict) | {"apple": 5, "orange": 10} | Key:Value pairs, Unchangeable, Unordered, No duplicates |
Array module (documentation)
import array as arr
# first arg: type code, which specifies the data type in the array to be integer
array_1 = arr.array('i',[3, 6, 9, 12])
NumPy package (documentation | what we'll use today)
import numpy as np
array_2 = np.array(['one', 2, 3, 4])
NumPy is a Python library that importantly provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
NumPy’s array class is called ndarray. Ndarray is also the main object in the NumPy library.
It is a table of elements (usually numbers), all of the same type, indexed by a tuple of non-negative integers.
Again, numpy.array (shorthand: np.array) is not the same as the Standard Python Library class array.array (shorthand: arr.array), which only handles one-dimensional arrays and offers less functionality.
a = np.array(['one', 2, 3, 4])
print(type(a))
<class 'numpy.ndarray'>
There are several ways to create arrays.
For example, you can create an array from a regular Python list or tuple using the np.array function. The type of the resulting array is deduced from the type of the elements in the sequences.
Below are some other important attributes of an ndarray object:
We can only see/ perceive up to 3 dimensions, but NumPy ndarrays allows us to deal with data with many more dimensions.
1-D Arrays
# homogenous
a = np.array([2, 3, 4])
print(a)
print("Number of axes (dimensions) of a: ", a.ndim)
print("Dimensionlity of a (if there is only 1 row, the output will ignore it): ", a.shape)
print("Number of elements in a: ", a.size)
print("Data type in a: ", a.dtype)
[2 3 4] Number of axes (dimensions) of a: 1 Dimensionlity of a (if there is only 1 row, the output will ignore it): (3,) Number of elements in a: 3 Data type in a: int64
# mixed
b = np.array([1.2, 3.5, "blue"])
print("Number of axes (dimensions) of b: ", b.ndim)
print("Dimensionlity of b (if there is only 1 row, the output will ignore it): ", b.shape)
print("Number of elements in b: ", b.size)
print("Data types in b: ", b.dtype) # mixed data type in an array
Number of axes (dimensions) of b: 1 Dimensionlity of b (if there is only 1 row, the output will ignore it): (3,) Number of elements in b: 3 Data types in b: <U32
Read more about what <U32 means here.
2-D Arrays
#homogenous
c = np.array([(1.5, 2, 3),
(4, 5, 6)])
print(c)
print("Number of axes (dimensions) of c: ", c.ndim)
print("Dimensionlity of c: ", c.shape)
print("Number of elements in c: ", c.size)
print("Data type in c: ", c.dtype)
[[1.5 2. 3. ] [4. 5. 6. ]] Number of axes (dimensions) of c: 2 Dimensionlity of c: (2, 3) Number of elements in c: 6 Data type in c: float64
# specify the type to complex number
d = np.array([[1, 2],
[3, 4]],
dtype = complex)
print(d)
print("Number of axes (dimensions) of c: ", d.ndim)
print("Dimensionlity of c (row x col): ", d.shape)
[[1.+0.j 2.+0.j] [3.+0.j 4.+0.j]] Number of axes (dimensions) of c: 2 Dimensionlity of c (row x col): (2, 2)
# if you are want to create n-D arrays (n>1) with mixed data type
# you must specify the dtype to "object"
e = np.array([["hi",1,2,3],
["bye",4,5,6]],
dtype = object)
PRACTICE: What is the number of axes, dimensionality, and number of elements in e?
# Write your code here!
Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation.
The function zeros creates an array full of zeros, the function ones creates an array full of ones, and the function empty creates an array whose initial content is random and depends on the state of the memory. By default, the dtype of the created array is float64, but it can be specified via the key word argument dtype.
np.zeros((3, 4))
array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])
np.ones((2, 3, 4), dtype = np.int16)
array([[[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]], [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]], dtype=int16)
np.empty((2, 3))
array([[0.e+000, 0.e+000, 5.e-324], [0.e+000, 0.e+000, 5.e-324]])
To create sequences of numbers, NumPy provides the arange function which is analogous to the Python built-in range, but returns an array.
# np.arange(start, end, increment)
# the end position is always not included
np.arange(10, 31, 5)
array([10, 15, 20, 25, 30])
np.arange(0, 2, 0.3) # it accepts float arguments
array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])
PRACTICE: Create an array that starts from 0 and end at 100 (including), incrementing by 10
# (you should end up with the output below)
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
Arithmetic operators on arrays apply elementwise. A new array is created and filled with the result.
a = np.array([20, 30, 40, 50])
print(a)
b = np.arange(4)
print(b)
c = a - b
print(c)
[20 30 40 50] [0 1 2 3] [20 29 38 47]
# squaring
b = np.arange(4)
print(b)
b**2
[0 1 2 3]
array([0, 1, 4, 9])
# sin functions
10 * np.sin(a)
array([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854])
# run a logical statement through an array
a < 35
array([ True, True, False, False])
Now that we have learned about NumPy ndarrays, let's pivot into Pandas DataFrames, which builds on ndarrays.
In this task, we are going to work with a dataset of cars.
Datasets are commonly stored as CSV files or Excel files.
3 Ways to Load CSV files into Colab: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92
Step 1: Import the library, authenticate, and create the interface to csv files.
# Mount the Google Drive to your computer
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
# Connect to Google Cloud SDK
# Tools and libraries for interacting with Google Cloud products and services.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
Step 2: Import the data from csv file as a Pandas DataFrame object
*When working with your own csv file stored on Google drive, you simply need to get the sharable link and substitute the link.
**Also, rename myfile to the file name of your choice.
# The shareable link to our csv file,
# "https://drive.google.com/file/d/1PTzHOzz_ecK0aeyn2ZyCBG-S8IyrqUsr/view?usp=sharing"
id = "1PTzHOzz_ecK0aeyn2ZyCBG-S8IyrqUsr"
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('cars.csv')
cars = pd.read_csv('cars.csv') # --> reads the csv, turns it into Pandas DataFrame
# for full DataFrame
# cars
# for first 5 rows in the DataFrame
cars.head()
YEAR | Make | Model | Size | (kW) | Unnamed: 5 | TYPE | CITY (kWh/100 km) | HWY (kWh/100 km) | COMB (kWh/100 km) | CITY (Le/100 km) | HWY (Le/100 km) | COMB (Le/100 km) | (g/km) | RATING | (km) | TIME (h) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
1 | 2012 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
2 | 2013 | FORD | FOCUS ELECTRIC | COMPACT | 107 | A1 | B | 19.0 | 21.1 | 20.0 | 2.1 | 2.4 | 2.2 | 0 | NaN | 122 | 4 |
3 | 2013 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
4 | 2013 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
type(cars)
pandas.core.frame.DataFrame
Pandas is a library that builds on NumPy, providing more functionality, especially for data science.
Pandas DataFrames (documentation) are two-dimensional, size-mutable, potentially heterogeneous tabular data.
data: ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.
index: Index or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columns: Index or array-like
Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
dtype: dtype, default None Data type to force. Only a single dtype is allowed. If None, infer.
copy: bool or None, default None Copy data from inputs. For dict data, the default of None behaves like copy=True. For DataFrame or 2d ndarray input, the default of None behaves like copy=False.
array1 = np.array([["Fruit","Cost","Number of Items"],
["Apple",1,5],
["Pear",2,6],
["Berry",3,7]])
print(array1)
[['Fruit' 'Cost' 'Number of Items'] ['Apple' '1' '5'] ['Pear' '2' '6'] ['Berry' '3' '7']]
# make array into table-like dataframe with the pd.DataFrame() constructor
df = pd.DataFrame(array1)
print(df)
0 1 2 0 Fruit Cost Number of Items 1 Apple 1 5 2 Pear 2 6 3 Berry 3 7
# change first row into column name
df.columns = df.iloc[0]
# drop the first row, so it won't appear twice
df = df[1:]
df
Fruit | Cost | Number of Items | |
---|---|---|---|
1 | Apple | 1 | 5 |
2 | Pear | 2 | 6 |
3 | Berry | 3 | 7 |
To refresh your memory, our dataset has been transformed from a csv file into a pd.DataFrame, and its variable name is cars.
Now, let's take a look at the dataset!
# .head() lets you look at the first 5 rows of the data by default
# but you can also pass in an arg to change the rows of data you want to view
cars.head()
YEAR | Make | Model | Size | (kW) | Unnamed: 5 | TYPE | CITY (kWh/100 km) | HWY (kWh/100 km) | COMB (kWh/100 km) | CITY (Le/100 km) | HWY (Le/100 km) | COMB (Le/100 km) | (g/km) | RATING | (km) | TIME (h) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
1 | 2012 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
2 | 2013 | FORD | FOCUS ELECTRIC | COMPACT | 107 | A1 | B | 19.0 | 21.1 | 20.0 | 2.1 | 2.4 | 2.2 | 0 | NaN | 122 | 4 |
3 | 2013 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
4 | 2013 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
Column names are ideally descriptive.
Column names are case sensitive.
# 1. Rename column "Make" into "Brands" for more clarity
cars = cars.rename({"Make":'Brands'}, axis=1)
# axis = 1 refers to the column names, whereas axis = 0 refers to the row indices
# 2. Unify all column names to capitalized words
cars.columns = cars.columns.str.capitalize()
cars.head()
Year | Brands | Model | Size | (kw) | Unnamed: 5 | Type | City (kwh/100 km) | Hwy (kwh/100 km) | Comb (kwh/100 km) | City (le/100 km) | Hwy (le/100 km) | Comb (le/100 km) | (g/km) | Rating | (km) | Time (h) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
1 | 2012 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
2 | 2013 | FORD | FOCUS ELECTRIC | COMPACT | 107 | A1 | B | 19.0 | 21.1 | 20.0 | 2.1 | 2.4 | 2.2 | 0 | NaN | 122 | 4 |
3 | 2013 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
4 | 2013 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
Carefully check the columns and rows. Understand what each column/row stands for. Each dataframe has several attributes. Exploring some of the attributes can help us quickly get to know the data. Here, the attributes are simply the properties of an object.
Some useful functions/methods
Function | Explanation |
---|---|
DataFrame.shape | Return a tuple representing the dimensionality of the DataFrame. |
DataFrame.columns | returns the column names of the data |
DataFrame.dtypes | returns the data type of each column |
When accessing an attribute for a dataframe, you can simply replace DataFrame using the name of your own data.
Process is very similar to getting attributes of NumPy arrays
cars.columns = cars.columns.str.capitalize()
#Tuple: (Rows, Columns)
cars.shape
(53, 17)
cars.columns
Index(['Year', 'Brands', 'Model', 'Size', '(kw)', 'Unnamed: 5', 'Type', 'City (kwh/100 km)', 'Hwy (kwh/100 km)', 'Comb (kwh/100 km)', 'City (le/100 km)', 'Hwy (le/100 km)', 'Comb (le/100 km)', '(g/km)', 'Rating', '(km)', 'Time (h)'], dtype='object')
cars.dtypes
Year int64 Make object Model object Size object (kw) int64 Unnamed: 5 object Type object City (kwh/100 km) float64 Hwy (kwh/100 km) float64 Comb (kwh/100 km) float64 City (le/100 km) float64 Hwy (le/100 km) float64 Comb (le/100 km) float64 (g/km) int64 Rating float64 (km) int64 Time (h) int64 dtype: object
The following table more clearly shows the common Pandas dtype and their corresponding Python Data Type
Pandas dtype | Corresponding Python type | Usage |
---|---|---|
object | str | Text |
int64 | int | Integer numbers |
float64 | float | Floating point numbers |
bool | bool | True/False (Boolean) values |
datetime64 | NA | Date and time values |
category | NA | Finite list of text values |
int64
indicates that data in the column are integer numbersfloat64
indicates that data in the column are float numbersobject
indicates that data in the column are stringsdatetime64
is a date-and-time value, which is not a default Python data type, but could be useful in analytical workEach row and column in the dataframe has its own index and label
Index (Numeral)
(1) Indexing of a row
Each row has its unique index, starting from zero.
(2) Indexing of a column
Each column has its unique index, starting from zero as well.
Label
(1) Label of a row
each column has its unique label. It is given in the left-most column in bold font. Usually, it is the same as the index.
DataFrame.index
method returns the labels of the rows
(2) Label of a column
each column has its unique label. It is given in the first row of the spreadsheet.
DataFrame.columns
method returns the labels of the columns
#print(cars)
print(cars.index) # index for the rows, numerals
print(cars.columns) # index for the columns, which are labels in text
RangeIndex(start=0, stop=53, step=1) Index(['Year', 'Make', 'Model', 'Size', '(kw)', 'Unnamed: 5', 'Type', 'City (kwh/100 km)', 'Hwy (kwh/100 km)', 'Comb (kwh/100 km)', 'City (le/100 km)', 'Hwy (le/100 km)', 'Comb (le/100 km)', '(g/km)', 'Rating', '(km)', 'Time (h)'], dtype='object')
We might be interested in keeping only certain columns.
The direct way is to use
DataFrame[["column1","column3",...]]
where "column1", "column3", etc are the labels of the columns you want to select.
Note: If you only used single brackets to get a column, we will get Pandas series. Pandas series is a one-dimensional data structure in Pandas. It is very similar to the list/array we have seen before. However, each element has its label.
# lets get ONLY the 'Year' and 'Brands' information
cars[['Year','Brands']]
Year | Brands | |
---|---|---|
0 | 2012 | MITSUBISHI |
1 | 2012 | NISSAN |
2 | 2013 | FORD |
3 | 2013 | MITSUBISHI |
4 | 2013 | NISSAN |
5 | 2013 | SMART |
6 | 2013 | SMART |
7 | 2013 | TESLA |
8 | 2013 | TESLA |
9 | 2013 | TESLA |
10 | 2013 | TESLA |
11 | 2014 | CHEVROLET |
12 | 2014 | FORD |
13 | 2014 | MITSUBISHI |
14 | 2014 | NISSAN |
15 | 2014 | SMART |
16 | 2014 | SMART |
17 | 2014 | TESLA |
18 | 2014 | TESLA |
19 | 2014 | TESLA |
20 | 2015 | BMW |
21 | 2015 | CHEVROLET |
22 | 2015 | FORD |
23 | 2015 | KIA |
24 | 2015 | MITSUBISHI |
25 | 2015 | NISSAN |
26 | 2015 | SMART |
27 | 2015 | SMART |
28 | 2015 | TESLA |
29 | 2015 | TESLA |
30 | 2015 | TESLA |
31 | 2015 | TESLA |
32 | 2015 | TESLA |
33 | 2015 | TESLA |
34 | 2016 | BMW |
35 | 2016 | CHEVROLET |
36 | 2016 | FORD |
37 | 2016 | KIA |
38 | 2016 | MITSUBISHI |
39 | 2016 | NISSAN |
40 | 2016 | NISSAN |
41 | 2016 | SMART |
42 | 2016 | SMART |
43 | 2016 | TESLA |
44 | 2016 | TESLA |
45 | 2016 | TESLA |
46 | 2016 | TESLA |
47 | 2016 | TESLA |
48 | 2016 | TESLA |
49 | 2016 | TESLA |
50 | 2016 | TESLA |
51 | 2016 | TESLA |
52 | 2016 | TESLA |
PRACTICE: Get only the "Brands", "Size", and "Rating" information
# you should get the output below
Brands | Size | Rating | |
---|---|---|---|
0 | MITSUBISHI | SUBCOMPACT | NaN |
1 | NISSAN | MID-SIZE | NaN |
2 | FORD | COMPACT | NaN |
3 | MITSUBISHI | SUBCOMPACT | NaN |
4 | NISSAN | MID-SIZE | NaN |
5 | SMART | TWO-SEATER | NaN |
6 | SMART | TWO-SEATER | NaN |
7 | TESLA | FULL-SIZE | NaN |
8 | TESLA | FULL-SIZE | NaN |
9 | TESLA | FULL-SIZE | NaN |
10 | TESLA | FULL-SIZE | NaN |
11 | CHEVROLET | SUBCOMPACT | NaN |
12 | FORD | COMPACT | NaN |
13 | MITSUBISHI | SUBCOMPACT | NaN |
14 | NISSAN | MID-SIZE | NaN |
15 | SMART | TWO-SEATER | NaN |
16 | SMART | TWO-SEATER | NaN |
17 | TESLA | FULL-SIZE | NaN |
18 | TESLA | FULL-SIZE | NaN |
19 | TESLA | FULL-SIZE | NaN |
20 | BMW | SUBCOMPACT | NaN |
21 | CHEVROLET | SUBCOMPACT | NaN |
22 | FORD | COMPACT | NaN |
23 | KIA | STATION WAGON - SMALL | NaN |
24 | MITSUBISHI | SUBCOMPACT | NaN |
25 | NISSAN | MID-SIZE | NaN |
26 | SMART | TWO-SEATER | NaN |
27 | SMART | TWO-SEATER | NaN |
28 | TESLA | FULL-SIZE | NaN |
29 | TESLA | FULL-SIZE | NaN |
30 | TESLA | FULL-SIZE | NaN |
31 | TESLA | FULL-SIZE | NaN |
32 | TESLA | FULL-SIZE | NaN |
33 | TESLA | FULL-SIZE | NaN |
34 | BMW | SUBCOMPACT | 10.0 |
35 | CHEVROLET | SUBCOMPACT | 10.0 |
36 | FORD | COMPACT | 10.0 |
37 | KIA | STATION WAGON - SMALL | 10.0 |
38 | MITSUBISHI | SUBCOMPACT | 10.0 |
39 | NISSAN | MID-SIZE | 10.0 |
40 | NISSAN | MID-SIZE | 10.0 |
41 | SMART | TWO-SEATER | 10.0 |
42 | SMART | TWO-SEATER | 10.0 |
43 | TESLA | FULL-SIZE | 10.0 |
44 | TESLA | FULL-SIZE | 10.0 |
45 | TESLA | FULL-SIZE | 10.0 |
46 | TESLA | FULL-SIZE | 10.0 |
47 | TESLA | FULL-SIZE | 10.0 |
48 | TESLA | FULL-SIZE | 10.0 |
49 | TESLA | FULL-SIZE | 10.0 |
50 | TESLA | FULL-SIZE | 10.0 |
51 | TESLA | SUV - STANDARD | 10.0 |
52 | TESLA | SUV - STANDARD | 10.0 |
Remember our old friend Index?
You can also get certain columns in the dataframe using indices and the .iloc method
.iloc takes in 2 positional arguments: [rows, columns]
# all rows, from columns 1 to 3
selected_df = DataFrame.iloc[: , 1:4]
# all rows, only column 1 and 3
selected_df = DataFrame.iloc[: , [1,3]]
PRACTICE: Get a dataframe that includes the first column and the last column
Data cleaning can be a tedious task. Lots of business analysts spend lots of time working on preparing the data. The 80/20 rule states that a analyst spends around 80% of the time doing data cleaning before even moving on analytics!
Some of the important tasks related to data cleaning include
Pandas provides many powerful methods to help us perform data cleaning very efficiently.
For a quick reference, click here for a nice cheatsheet.
Let's illustrate some of the basic tasks using this sample data.
cars.head()
Year | Brands | Model | Size | (kw) | Unnamed: 5 | Type | City (kwh/100 km) | Hwy (kwh/100 km) | Comb (kwh/100 km) | City (le/100 km) | Hwy (le/100 km) | Comb (le/100 km) | (g/km) | Rating | (km) | Time (h) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
1 | 2012 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
2 | 2013 | FORD | FOCUS ELECTRIC | COMPACT | 107 | A1 | B | 19.0 | 21.1 | 20.0 | 2.1 | 2.4 | 2.2 | 0 | NaN | 122 | 4 |
3 | 2013 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
4 | 2013 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
# .nunique() --> Check the number of unique values in "Unnamed: 5"
cars["Unnamed: 5"].nunique()
# .unique() --> Get the unique value(s) in "Unnamed: 5"
cars["Unnamed: 5"].unique()
array(['A1'], dtype=object)
# .value_counts() --> Count number of rows with each unique value of variable
print(cars["Unnamed: 5"].value_counts())
A1 53 Name: Unnamed: 5, dtype: int64
# See the number of unique values in each column for the whole dataframe
print(cars.nunique())
# when the sample size is small,
# it is very likely that that the dataset does not have all the categories of a specific column.
Year 5 Make 8 Model 23 Size 7 (kw) 15 Unnamed: 5 1 Type 1 City (kwh/100 km) 19 Hwy (kwh/100 km) 19 Comb (kwh/100 km) 19 City (le/100 km) 10 Hwy (le/100 km) 6 Comb (le/100 km) 8 (g/km) 1 Rating 1 (km) 19 Time (h) 7 dtype: int64
cars.columns
Index(['Year', 'Make', 'Model', 'Size', '(kw)', 'Unnamed: 5', 'Type', 'City (kwh/100 km)', 'Hwy (kwh/100 km)', 'Comb (kwh/100 km)', 'City (le/100 km)', 'Hwy (le/100 km)', 'Comb (le/100 km)', '(g/km)', 'Rating', '(km)', 'Time (h)'], dtype='object')
# Suppose we decided to drop the column "Unnamed: 5"
# because it only contain one unique value
cars.drop(columns = ["Unnamed: 5"], inplace = True)
# inplace = True means that we directly make changes to the specified object, without creating new objects
cars.columns
Index(['Year', 'Brands', 'Model', 'Size', '(kw)', 'Type', 'City (kwh/100 km)', 'Hwy (kwh/100 km)', 'Comb (kwh/100 km)', 'City (le/100 km)', 'Hwy (le/100 km)', 'Comb (le/100 km)', '(g/km)', 'Rating', '(km)', 'Time (h)'], dtype='object')
Pandas DataFrame directly transforms missing values into NumPy NaN values.
# previously, we have 17 columns
# and the shape of our data frame was 53 rows X 17 columns
# use the .shape method to see the dimensionality now
cars.shape
(53, 16)
One thing we immediately notice is that there are missing values in the dataset.
Pandas provides many methods we can use to work with missing values. A good tutorial can be found here.
Some commonly used methods are here.
method | explanation |
---|---|
DataFrame.isna() | Returns Boolean value for each cell indicating whether a number is a missing value (True) or not (False) |
DataFrame.fillna() | Fill in the missing values with a specific method. For example backward, forward fill, mean, median, sum... |
DataFrame.interpolate() | Fill in the missing values with more sophisticated math methods |
DataFrame.dropna() | Drop missing values |
Let's give DataFrame.isna()
a shot
#cars.isna()
#cars.head().isna()
How do we proceed?
Maybe we are interested in understanding how many missing values we have. Now, it is a good place for us to talk about some basic Pandas calculations we can work on.
Function | Explanation |
---|---|
DataFrame.sum() | sum all the values column wise. add axis=1 if row-wise. |
DataFrame.cumsum() | Perform cumulative sum column wise. add axis=1 if row-wise. |
DataFrame.prod() | multiply all the values column wise. add axis=1 if row-wise. |
DataFrame.cumprod() | Perform cumulative multiplication column wise. add axis=1 if row-wise. |
cars.sum()
# not every sum makes sense
Year 106781 Brands MITSUBISHINISSANFORDMITSUBISHINISSANSMARTSMART... Model i-MiEVLEAFFOCUS ELECTRICi-MiEVLEAFFORTWO ELECT... Size SUBCOMPACTMID-SIZECOMPACTSUBCOMPACTMID-SIZETWO... (kw) 10103 Type BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... City (kwh/100 km) 1041.3 Hwy (kwh/100 km) 1146.6 Comb (kwh/100 km) 1088.7 City (le/100 km) 117 Hwy (le/100 km) 128.4 Comb (le/100 km) 122 (g/km) 0 Rating 190 (km) 12676 Time (h) 449 dtype: object
# However, combining isna with sum, we could get the number of missing values
cars.isna().sum()
Year 0 Brands 0 Model 0 Size 0 (kw) 0 Type 0 City (kwh/100 km) 0 Hwy (kwh/100 km) 0 Comb (kwh/100 km) 0 City (le/100 km) 0 Hwy (le/100 km) 0 Comb (le/100 km) 0 (g/km) 0 Rating 34 (km) 0 Time (h) 0 dtype: int64
In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
cars.head()
Year | Make | Model | Size | (kw) | Type | City (kwh/100 km) | Hwy (kwh/100 km) | Comb (kwh/100 km) | City (le/100 km) | Hwy (le/100 km) | Comb (le/100 km) | (g/km) | Rating | (km) | Time (h) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
1 | 2012 | NISSAN | LEAF | MID-SIZE | 80 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
2 | 2013 | FORD | FOCUS ELECTRIC | COMPACT | 107 | B | 19.0 | 21.1 | 20.0 | 2.1 | 2.4 | 2.2 | 0 | NaN | 122 | 4 |
3 | 2013 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | NaN | 100 | 7 |
4 | 2013 | NISSAN | LEAF | MID-SIZE | 80 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | NaN | 117 | 7 |
# summary statistics (for numerical values)
cars.describe()
Year | (kw) | City (kwh/100 km) | Hwy (kwh/100 km) | Comb (kwh/100 km) | City (le/100 km) | Hwy (le/100 km) | Comb (le/100 km) | (g/km) | Rating | (km) | Time (h) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 53.000000 | 53.000000 | 53.00000 | 53.000000 | 53.000000 | 53.000000 | 53.000000 | 53.000000 | 53.0 | 19.0 | 53.000000 | 53.000000 |
mean | 2014.735849 | 190.622642 | 19.64717 | 21.633962 | 20.541509 | 2.207547 | 2.422642 | 2.301887 | 0.0 | 10.0 | 239.169811 | 8.471698 |
std | 1.227113 | 155.526429 | 3.00100 | 1.245753 | 1.979455 | 0.344656 | 0.143636 | 0.212576 | 0.0 | 0.0 | 141.426352 | 2.991036 |
min | 2012.000000 | 35.000000 | 15.20000 | 18.800000 | 16.800000 | 1.700000 | 2.100000 | 1.900000 | 0.0 | 10.0 | 100.000000 | 4.000000 |
25% | 2014.000000 | 80.000000 | 17.00000 | 20.800000 | 18.700000 | 1.900000 | 2.300000 | 2.100000 | 0.0 | 10.0 | 117.000000 | 7.000000 |
50% | 2015.000000 | 107.000000 | 19.00000 | 21.700000 | 20.000000 | 2.100000 | 2.400000 | 2.200000 | 0.0 | 10.0 | 135.000000 | 8.000000 |
75% | 2016.000000 | 283.000000 | 22.40000 | 22.500000 | 22.100000 | 2.500000 | 2.500000 | 2.500000 | 0.0 | 10.0 | 402.000000 | 12.000000 |
max | 2016.000000 | 568.000000 | 23.90000 | 23.300000 | 23.600000 | 2.700000 | 2.600000 | 2.600000 | 0.0 | 10.0 | 473.000000 | 12.000000 |
Note: Notice that all the object variables are eliminated from the summary table.
Let's take look at the columns one by one first.
# Year
cars["Year"].value_counts()
2016 19 2015 14 2014 9 2013 9 2012 2 Name: Year, dtype: int64
# Brands
cars["Brands"].value_counts()
TESLA 23 SMART 8 NISSAN 6 MITSUBISHI 5 FORD 4 CHEVROLET 3 BMW 2 KIA 2 Name: Brands, dtype: int64
PRACTICE: Create a function to print out value_counts() results for all the objective variables
# you should get the output below
Year Brands Model Size (kw) Type City (kwh/100 km) Hwy (kwh/100 km) Comb (kwh/100 km) City (le/100 km) Hwy (le/100 km) Comb (le/100 km) (g/km) Rating (km) Time (h) 2016 TESLA MODEL X P90D SUV - STANDARD 568 B 23.6 23.3 23.5 2.7 2.6 2.6 0 10.0 402 12 1 SMART FORTWO ELECTRIC DRIVE COUPE TWO-SEATER 35 B 17.2 22.5 19.6 1.9 2.5 2.2 0 10.0 109 8 1 CHEVROLET SPARK EV SUBCOMPACT 104 B 16.0 19.6 17.8 1.8 2.2 2.0 0 10.0 131 7 1 FORD FOCUS ELECTRIC COMPACT 107 B 19.0 21.1 20.0 2.1 2.4 2.2 0 10.0 122 4 1 KIA SOUL EV STATION WAGON - SMALL 81 B 17.5 22.7 19.9 2.0 2.6 2.2 0 10.0 149 4 1 MITSUBISHI i-MiEV SUBCOMPACT 49 B 16.9 21.4 18.7 1.9 2.4 2.1 0 10.0 100 7 1 NISSAN LEAF (24 kWh battery) MID-SIZE 80 B 16.5 20.8 18.4 1.9 2.3 2.1 0 10.0 135 5 1 LEAF (30 kWh battery) MID-SIZE 80 B 17.0 20.7 18.6 1.9 2.3 2.1 0 10.0 172 6 1 SMART FORTWO ELECTRIC DRIVE CABRIOLET TWO-SEATER 35 B 17.2 22.5 19.6 1.9 2.5 2.2 0 10.0 109 8 1 TESLA MODEL S (60 kWh battery) FULL-SIZE 283 B 22.2 21.7 21.9 2.5 2.4 2.5 0 10.0 335 10 1 MODEL X 90D SUV - STANDARD 386 B 23.2 22.2 22.7 2.6 2.5 2.6 0 10.0 414 12 1 MODEL S (70 kWh battery) FULL-SIZE 283 B 23.8 23.2 23.6 2.7 2.6 2.6 0 10.0 377 12 1 MODEL S (85/90 kWh battery) FULL-SIZE 283 B 23.8 23.2 23.6 2.7 2.6 2.6 0 10.0 426 12 1 MODEL S 70D FULL-SIZE 386 B 20.8 20.6 20.7 2.3 2.3 2.3 0 10.0 386 12 1 MODEL S 85D/90D FULL-SIZE 386 B 22.0 19.8 21.0 2.5 2.2 2.4 0 10.0 435 12 1 MODEL S 90D (Refresh) FULL-SIZE 386 B 20.8 19.7 20.3 2.3 2.2 2.3 0 10.0 473 12 1 MODEL S P85D/P90D FULL-SIZE 568 B 23.4 21.5 22.5 2.6 2.4 2.5 0 10.0 407 12 1 MODEL S P90D (Refresh) FULL-SIZE 568 B 22.9 21.0 22.1 2.6 2.4 2.5 0 10.0 435 12 1 BMW i3 SUBCOMPACT 125 B 15.2 18.8 16.8 1.7 2.1 1.9 0 10.0 130 4 1 dtype: int64
matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
Sample plots: https://matplotlib.org/stable/tutorials/introductory/sample_plots.html#sphx-glr-tutorials-introductory-sample-plots-py
list1 = [1,2,3,4]
plt.plot(list1)
plt.ylabel('some numbers')
# because we have written %matplotlib inline
# we don't need to write plt.show() every time
# but you could add it as well
plt.show()
PRACTICE: plt.plot also takes NumPy ndarrays as data. Define an NumPy ndarray and plot it!
You may be wondering why the x-axis ranges from 0 - 3 and the y-axis from 1 - 4. If you provide a single list or array to plot, matplotlib assumes it is a sequence of y values, and automatically generates the x values for you. Since python ranges start with 0, the default x vector has the same length as y but starts with 0. Hence the x data are [0, 1, 2, 3].
# histogram
plt.hist(cars["Year"])
plt.show()
However, histogram is commonly used to visualize the distribution of a continuous variable.
In this case, since Year is a categorical variable, bar plot is more appropriate.
cars["Year"].value_counts()
# left column is index
# right column is values
2016 19 2015 14 2014 9 2013 9 2012 2 Name: Year, dtype: int64
# using bar plot for category variables
year = cars["Year"].value_counts().index
print(year)
values = cars["Year"].value_counts().values
print(values)
Int64Index([2016, 2015, 2014, 2013, 2012], dtype='int64') [19 14 9 9 2]
# bar plots takes in 2 arguments
plt.bar(year, values)
plt.xlabel("Year")
plt.ylabel("Count")
Text(0, 0.5, 'Count')
fig1 = plt.pie(values)
plt.show()
# customize labels
mylabels = year
# send argument into plt.pie()
plt.pie(values, labels = mylabels)
plt.show()
Colors are customizable as well, and you could do it via its name or hex values.
# be as creative as you want, but the goal is to increase readability
# you can create list with color names,
mycolors_names = ['lightseagreen', 'mediumpurple', 'orange', 'skyblue', 'khaki']
# or create list with their corresponding hex codes
mycolors_hex = ['#20B2AA','#9370DB','#FFA500','#87CEEB','#F0E68C']
# send argument into plt.pie()
plt.pie(values, labels = mylabels, colors = mycolors_names)
plt.show()
# add legend with plt.legend(title = "")
plt.pie(values, labels = mylabels, colors = mycolors_names)
plt.legend(title = 'Years')
<matplotlib.legend.Legend at 0x7f5e71eafc10>
More on pie chart customization: click here
# Sometimes, you want to compare different plots
# We do that by creating subplots
names = ['group_a', 'group_b', 'group_c']
values = [1, 10, 100]
plt.figure(figsize = (9, 3))
#plt.figure(figsize = (12, 3))
plt.subplot(1,3,1)
plt.bar(names, values)
plt.subplot(1,3,2)
plt.scatter(names, values)
plt.subplot(1,3,3)
plt.plot(names, values)
plt.suptitle('Categorical Plotting')
Text(0.5, 0.98, 'Categorical Plotting')
# groupby function
df = cars[["Brands", "Size", "(kw)"]].groupby(["Brands", "Size"]).mean()
df
(kw) | ||
---|---|---|
Brands | Size | |
BMW | SUBCOMPACT | 125.000000 |
CHEVROLET | SUBCOMPACT | 104.000000 |
FORD | COMPACT | 107.000000 |
KIA | STATION WAGON - SMALL | 81.000000 |
MITSUBISHI | SUBCOMPACT | 49.000000 |
NISSAN | MID-SIZE | 80.000000 |
SMART | TWO-SEATER | 35.000000 |
TESLA | FULL-SIZE | 332.952381 |
SUV - STANDARD | 477.000000 |
Basic Data Types | Example | Meaning/ Usage |
---|---|---|
Integer (int) | 1, 2, 3 | Integer numbers |
String (str) | "apple" | Text |
Float (float) | 5.55 | Floating/ decimal point numbers |
Boolean (bool) | True, False | True/False (Boolean) values |
Collection Data Types | Example | Properties |
---|---|---|
List (list) | ["apple", "orange", 5] | Changeable, Ordered, Duplicate allowed |
Tuple (tuple) | ("apple", "orange", 5) | Unchangeable, Ordered, Duplicate allowed |
Set (set) | {"apple", "orange", 5} | Unchangeable, Unordered, No duplicates |
Dictionary (dict) | {"apple": 5, "orange": 10} | Key:Value pairs, Unchangeable, Unordered, No duplicates |
Pandas/ NumPy dtype | Corresponding Python type | Usage |
---|---|---|
object | str | Text |
int64 | int | Integer numbers |
float64 | float | Floating point numbers |
bool | bool | True/False (Boolean) values |
datetime64 (only Pandas) | NA | Date and time values |
category (only Pandas) | NA | Finite list of text values |
When to use lists/ NumPy ndarrays/ Pandas DataFrame?
Explore more datasets in the drive.
Tasks can include:
If you need help with any material in this notebook, please contact NYU Shanghai Library at shanghai.library@nyu.edu
Ending credits
Tutorial framework:
https://www.w3schools.com/python/default.asp
Images:
https://towardsdatascience.com/python-list-numpy-and-pandas-3a32f1aee948
https://predictivehacks.com/tips-about-numpy-arrays/
Modified and organized by: Pamela Pan, Jie Chen