NumPy and Pandas Crash Course for Data Analysts

Introduction

In data analytics, you rarely work with single numbers. It is almost always a collection of some data type. In the simplest terms, you will have a collection of numbers or strings. Python provides a list structure to hold a collection. Python lists are heterogeneous, meaning each element can be of a different data type. Although this feature adds flexibility, it comes at a performance cost. If your data collection is of a single type, Python lists consume more space and time for simple operations like calculating the mean and standard deviation compared to the data structures in Python's statistical libraries.

When dealing with a large amount of homogeneous data, Python lists are not ideal. Instead, you should use NumPy Arrays or Pandas Series. A Pandas Series is also a building block for a Pandas DataFrame, which is used for tabular data. Both NumPy and Pandas are open-source libraries designed for data analysis. In this book, you will learn the important aspects of using these modules. To succeed with this eBook, you should have basic Python programming knowledge. If not, please read our Python eBook before starting this one.

No analysis is complete without diagrams like bar charts, scatter plots, or histograms. In this eBook, you will also learn to create simple diagrams using the Matplotlib and Seaborn libraries.

If you have followed our Python eBook, you should have already installed Anaconda on your machine. In that case, NumPy and Pandas are already installed along with Jupyter Lab, so you can start creating a notebook and using the packages immediately.

Alternatively, you can install Python, NumPy, Pandas, Seaborn, and JupyterLab separately. We recommend downloading Anaconda to have all the necessary packages available so you can focus on analyzing data.

Another convenient option is to create your notebook on Google's Cloud Platform for Jupyter notebooks, Colab. Details are on this page: https://ebooks.mobibootcamp.com/python/cloudnotebook.html

Google's Colab environment includes many of the necessary libraries commonly used by data scientists and analysts. However, if you need a package that is not available in the Colab environment, you can install it using the ! command-line directive, as shown below:

!apt install proj-bin libproj-dev libgeos-dev
!pip install https://github.com/matplotlib/basemap/archive/v1.1.0.tar.gz

Updating Libraries: To update an installed library, add the -U flag to the install command. For example, to update the Plotly library on Colab to the latest version, run the following command:

pip install -U plotly

Using bash

If all the statements in a code cell are command-line directives, you can use %%bash instead. Here is an example:

%%bash
ls
pwd
apt install proj-bin libproj-dev libgeos-dev

**Package Managers: pip and apt & Jupyter NB plugins**

pip is used to download and install packages directly from the PyPI repository. PyPI is hosted by the Python Software Foundation.
apt is used to download and install packages from Ubuntu repositories, which are hosted by Canonical. Canonical only provides packages for selected Python modules. In contrast, PyPI hosts a much broader range of Python modules.
There are some useful plugins for Jupyter Notebook that can improve your productivity. Refer to this article: Jupyter Notebook Extensions