Welcome back!

In the last two workshops, we had talked about Python basics, and we are going to finally use it for data analysis here! (Yeah!!!)

Please feel free to refer to material from the last 2 workshops:

Workshop I : Intro to Python

Workshop II : Intro to Data Structures

Getting Started

Workshop Best Practices

Task 0: Installing Python Packages

Data Cleaning / Data Manipulation:

Data Visualization:

Task 1: Understanding N-Dimensional Arrays (ndarray)

Recap: Previous Data Types

  1. Workshop 1
Basic Data Types Example Meaning/ Usage
Integer (int) 1, 2, 3 Integer numbers
String (str) "apple" Text
Float (float) 5.55 Floating/ decimal point numbers
Boolean (bool) True, False True/False (Boolean) values
  1. Workshop 2
Collection Data Types Example Properties
List (list) ["apple", "orange", 5] Changeable, Ordered, Duplicate allowed
Tuple (tuple) ("apple", "orange", 5) Unchangeable, Ordered, Duplicate allowed
Set (set) {"apple", "orange", 5} Unchangeable, Unordered, No duplicates
Dictionary (dict) {"apple": 5, "orange": 10} Key:Value pairs, Unchangeable, Unordered, No duplicates

New Data Type: Arrays

Array module (documentation)

    import array as arr
    # first arg: type code, which specifies the data type in the array to be integer
    array_1 = arr.array('i',[3, 6, 9, 12])

NumPy package (documentation | what we'll use today)

    import numpy as np
    array_2 = np.array(['one', 2, 3, 4])

Use NumPy to Create N-Dimensional Arrays (ndarrays)

NumPy is a Python library that importantly provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

NumPy’s array class is called ndarray. Ndarray is also the main object in the NumPy library.

Array Creation & Properties

There are several ways to create arrays.

For example, you can create an array from a regular Python list or tuple using the np.array function. The type of the resulting array is deduced from the type of the elements in the sequences.

Below are some other important attributes of an ndarray object:

  1. ndarray.ndim
  1. ndarray.shape
  1. ndarray.size
  1. ndarray.dtype

image.png

We can only see/ perceive up to 3 dimensions, but NumPy ndarrays allows us to deal with data with many more dimensions.

1-D Arrays

Read more about what <U32 means here.

2-D Arrays


PRACTICE: What is the number of axes, dimensionality, and number of elements in e?


Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation.

The function zeros creates an array full of zeros, the function ones creates an array full of ones, and the function empty creates an array whose initial content is random and depends on the state of the memory. By default, the dtype of the created array is float64, but it can be specified via the key word argument dtype.

To create sequences of numbers, NumPy provides the arange function which is analogous to the Python built-in range, but returns an array.


PRACTICE: Create an array that starts from 0 and end at 100 (including), incrementing by 10


Basic Operations

Arithmetic operators on arrays apply elementwise. A new array is created and filled with the result.

Task 2: From ndarray in NumPy to Pandas DataFrame

Now that we have learned about NumPy ndarrays, let's pivot into Pandas DataFrames, which builds on ndarrays.

In this task, we are going to work with a dataset of cars.

Reading / Writing Dataset

Datasets are commonly stored as CSV files or Excel files.

3 Ways to Load CSV files into Colab: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92

Step 1: Import the library, authenticate, and create the interface to csv files.

Step 2: Import the data from csv file as a Pandas DataFrame object

*When working with your own csv file stored on Google drive, you simply need to get the sharable link and substitute the link.

**Also, rename myfile to the file name of your choice.

New Data Type: Pandas DataFrame

Pandas is a library that builds on NumPy, providing more functionality, especially for data science.

Pandas DataFrames (documentation) are two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data Parameter: ndarray, and more

image.png

Change a np.array into pd.DataFrame

Basic Data Manipulation

To refresh your memory, our dataset has been transformed from a csv file into a pd.DataFrame, and its variable name is cars.

Now, let's take a look at the dataset!

Cleaning up Column Names

Data attributes

Carefully check the columns and rows. Understand what each column/row stands for. Each dataframe has several attributes. Exploring some of the attributes can help us quickly get to know the data. Here, the attributes are simply the properties of an object.

Some useful functions/methods

Function Explanation
DataFrame.shape Return a tuple representing the dimensionality of the DataFrame.
DataFrame.columns returns the column names of the data
DataFrame.dtypes returns the data type of each column

When accessing an attribute for a dataframe, you can simply replace DataFrame using the name of your own data.

Process is very similar to getting attributes of NumPy arrays

The following table more clearly shows the common Pandas dtype and their corresponding Python Data Type

Pandas dtype Corresponding Python type Usage
object str Text
int64 int Integer numbers
float64 float Floating point numbers
bool bool True/False (Boolean) values
datetime64 NA Date and time values
category NA Finite list of text values

Table Index and label

Each row and column in the dataframe has its own index and label

Index (Numeral)

(1) Indexing of a row

Each row has its unique index, starting from zero.

(2) Indexing of a column

Each column has its unique index, starting from zero as well.

Label

(1) Label of a row

each column has its unique label. It is given in the left-most column in bold font. Usually, it is the same as the index.

DataFrame.index method returns the labels of the rows

(2) Label of a column

each column has its unique label. It is given in the first row of the spreadsheet.

DataFrame.columns method returns the labels of the columns

Get columns based on labels

We might be interested in keeping only certain columns.

The direct way is to use

DataFrame[["column1","column3",...]]

where "column1", "column3", etc are the labels of the columns you want to select.

Note: If you only used single brackets to get a column, we will get Pandas series. Pandas series is a one-dimensional data structure in Pandas. It is very similar to the list/array we have seen before. However, each element has its label.


PRACTICE: Get only the "Brands", "Size", and "Rating" information


Get columns based on index

Remember our old friend Index?

You can also get certain columns in the dataframe using indices and the .iloc method

.iloc takes in 2 positional arguments: [rows, columns]

# all rows, from columns 1 to 3
selected_df = DataFrame.iloc[: , 1:4] 
# all rows, only column 1 and 3
selected_df = DataFrame.iloc[: , [1,3]]

PRACTICE: Get a dataframe that includes the first column and the last column


Data Cleaning

Data cleaning can be a tedious task. Lots of business analysts spend lots of time working on preparing the data. The 80/20 rule states that a analyst spends around 80% of the time doing data cleaning before even moving on analytics!

Some of the important tasks related to data cleaning include

Pandas provides many powerful methods to help us perform data cleaning very efficiently.

For a quick reference, click here for a nice cheatsheet.

Let's illustrate some of the basic tasks using this sample data.

Pandas DataFrame directly transforms missing values into NumPy NaN values.

Missing values

One thing we immediately notice is that there are missing values in the dataset.

Pandas provides many methods we can use to work with missing values. A good tutorial can be found here.

Some commonly used methods are here.

method explanation
DataFrame.isna() Returns Boolean value for each cell indicating whether a number is a missing value (True) or not (False)
DataFrame.fillna() Fill in the missing values with a specific method. For example backward, forward fill, mean, median, sum...
DataFrame.interpolate() Fill in the missing values with more sophisticated math methods
DataFrame.dropna() Drop missing values

Let's give DataFrame.isna() a shot

How do we proceed?

Maybe we are interested in understanding how many missing values we have. Now, it is a good place for us to talk about some basic Pandas calculations we can work on.

Function Explanation
DataFrame.sum() sum all the values column wise. add axis=1 if row-wise.
DataFrame.cumsum() Perform cumulative sum column wise. add axis=1 if row-wise.
DataFrame.prod() multiply all the values column wise. add axis=1 if row-wise.
DataFrame.cumprod() Perform cumulative multiplication column wise. add axis=1 if row-wise.

Task 3: Exploratory Data Analysis (EDA)

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

Note: Notice that all the object variables are eliminated from the summary table.

Single variable

Let's take look at the columns one by one first.


PRACTICE: Create a function to print out value_counts() results for all the objective variables


Plotting & Intro to pyplot

matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

Sample plots: https://matplotlib.org/stable/tutorials/introductory/sample_plots.html#sphx-glr-tutorials-introductory-sample-plots-py


PRACTICE: plt.plot also takes NumPy ndarrays as data. Define an NumPy ndarray and plot it!


You may be wondering why the x-axis ranges from 0 - 3 and the y-axis from 1 - 4. If you provide a single list or array to plot, matplotlib assumes it is a sequence of y values, and automatically generates the x values for you. Since python ranges start with 0, the default x vector has the same length as y but starts with 0. Hence the x data are [0, 1, 2, 3].

Histogram

However, histogram is commonly used to visualize the distribution of a continuous variable.

In this case, since Year is a categorical variable, bar plot is more appropriate.

Bar Plot

Pie chart

Colors are customizable as well, and you could do it via its name or hex values.

More on pie chart customization: click here

Subplots

Multiple variables

More interesting visualization websites

Summary about Data Types

  1. Workshop 09-10 : Intro to Python
Basic Data Types Example Meaning/ Usage
Integer (int) 1, 2, 3 Integer numbers
String (str) "apple" Text
Float (float) 5.55 Floating/ decimal point numbers
Boolean (bool) True, False True/False (Boolean) values
  1. Workshop 09-17 : Intro to Data Structures
Collection Data Types Example Properties
List (list) ["apple", "orange", 5] Changeable, Ordered, Duplicate allowed
Tuple (tuple) ("apple", "orange", 5) Unchangeable, Ordered, Duplicate allowed
Set (set) {"apple", "orange", 5} Unchangeable, Unordered, No duplicates
Dictionary (dict) {"apple": 5, "orange": 10} Key:Value pairs, Unchangeable, Unordered, No duplicates
  1. Workshop 09-24 : Intro to Data Analysis with Python
Pandas/ NumPy dtype Corresponding Python type Usage
object str Text
int64 int Integer numbers
float64 float Floating point numbers
bool bool True/False (Boolean) values
datetime64 (only Pandas) NA Date and time values
category (only Pandas) NA Finite list of text values

When to use lists/ NumPy ndarrays/ Pandas DataFrame?

image.png

Exercise

Explore more datasets in the drive.

image.png

Tasks can include:

If you need help with any material in this notebook, please contact NYU Shanghai Library at shanghai.library@nyu.edu

Ending credits

Tutorial framework:

https://www.w3schools.com/python/default.asp

Images:

https://towardsdatascience.com/python-list-numpy-and-pandas-3a32f1aee948

https://predictivehacks.com/tips-about-numpy-arrays/

Modified and organized by: Pamela Pan, Jie Chen