Pandas
While NumPy is known for its efficient N-dimensional arrays, the Pandas library is famous for its DataFrame data structure, which is used for manipulating data in a tabular format. DataFrames are not only efficient when working with large amounts of data but also provide useful and powerful functions to slice, dice, and manipulate data. Pandas has been used extensively in financial applications. The name Pandas is derived from the term 'panel data'.
Pandas uses NumPy structures internally, leveraging its N-dimensional arrays to provide higher-level, more powerful methods for manipulation.
Just like NumPy, Pandas is also an open source project and here is the GitHub link: https://github.com/pandas-dev/pandas
How exactly does Pandas help?
The Pandas library solves routine data manipulation tasks with easy-to-use and high-performance API calls. Here are a few examples:
- It is easy to handle missing data (represented as
NaN) in both floating-point and non-floating-point data. - DataFrame columns can be inserted and/or deleted by either returning a new DataFrame or manipulating the existing DataFrame in-place.
- The
groupbyfunctionality for datasets helps in creating a variety of sub-collections. This is very similar to functions seen in relational databases. - It offers label-based slicing and custom indexing capabilities.
- It can merge and join datasets.
- API calls are available for loading data from flat files (CSV and delimited), Excel files, databases, etc.
- It provides a variety of time series-specific functionalities.
To use the Pandas module, you first have to install and then import it in your notebook. Since you have downloaded Anaconda, you already have Pandas installed on your computer, so you can import the module directly with import pandas as pd. Although you can import Pandas with any other alias, it is common practice to import it as pd.
Now, let us understand the two main data structures of Pandas: Series and DataFrames.
Series
The Series is the primary building block of Pandas. A Series represents a one-dimensional, labeled, indexed array based on the NumPy ndarray. Therefore, a Series has only one axis (axis 0), called the 'index'. Like an array, a Series can hold only homogeneous data. A Series can be created and initialized by passing a NumPy ndarray, a Python list, or a Python dictionary as a parameter to the Series constructor. Here are examples of defining a Series:
Series built with list values
import pandas as pd
a = pd.Series([1, 25, 65])
print(a)
print(a[0])
Output
0 1
1 25
2 65
dtype: int64
1
In the example above, the index values for the Series are assigned automatically, similar to a NumPy array or a Python list. Accessing index 0 will get you the value of the first element, which is 1 in this example.
Note: If you construct a Series with heterogeneous data, the dtype of the resulting Series will be chosen to accommodate all of the data types involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype. If only integers are involved, the dtype will be int.
Series built with index values
You can also build a Series with your own specific index values. You can provide any combination of alphanumeric characters as the index values.
b = pd.Series([1, 25, 65, pd.np.nan], index=["a", "Bx", "c", "d"])
print(b)
print(b["Bx"])
Output
a 1.0
Bx 25.0
c 65.0
d NaN
dtype: float64
25.0
In this example, note that we created a Series with a NaN (Not a Number) value by referencing the NaN type from NumPy. Using the pd namespace, we accessed the np namespace and then referenced the nan type. Because we have a nan value, the dtype was automatically assigned as float.
To access the second element in the Series, we use the index label 'Bx', as we replaced the default integer index with our own custom labels by passing them as the second parameter to the Series constructor.
Slicing works similarly to an ndarray or a list, except you can also use your custom index. Notice that the default integer index, starting from 0, also works along with the custom index.
import pandas as pd
b = pd.Series([1, 25, 65, pd.np.nan], index=["a", "Bx", "c", "d"])
print(b[0:2])
print(b["c":])
Output:
a 1.0
Bx 25.0
dtype: float64
c 65.0
d NaN
dtype: float64
Top or Bottom n entries
Using nlargest(n) on a Series object will return the top 'n' largest values, where 'n' can be any number. The complement to this is nsmallest(n).
Sorting values and indexes
You can sort the values of a Series using sort_values() and sort the index using sort_index(). Here is an example:
a = pd.Series([10, 2, 65, 35, 64], index=["c", "aa", "ac", "a", "e"])
sorted_index = a.sort_index()
sorted_values = a.sort_values()
Output:
a 35 aa 2 ac 65 c 10 e 64 dtype: int64 aa 2 c 10 a 35 e 64 ac 65 dtype: int64
By default, the sort is in ascending order. You can change this by adding the parameter ascending=False.
Once sorted, you can also get the n largest or n smallest values by slicing. To get the top 3 and bottom 3 values, you could use:
a.sort_values()[:3] and a.sort_values()[-3:]
Note: By default, the sort functions used above return a new Series object with the sorted values. If you want to sort the original Series in place, you must use the inplace=True keyword argument.
Map Function on Series
The standard functions to calculate mean(), max(), min(), etc., that you learned for NumPy are also available for Pandas Series. In addition, a very handy function that you will see quite often is the map function. map is used for transforming data from one format to another. The map function expects you to pass a function (like a lambda function) that will be applied to each value in the Series, returning a new Series with the transformed values. For example, suppose you want to strip the '
sign from a price value. You would use map with a lambda function. The example below shows this:
import pandas as pd
prices = pd.Series(["$1.5", "$25", "$65"])
print(prices)
prices = prices.map(lambda price_value: price_value.lstrip("$"))
print(prices)
Output:
0 $1.5 1 $25 2 $65 dtype: object 0 1.5 1 25 2 65 dtype: object
Note, however, that map returns a new Series object and does not modify the existing one. Hence, the prices variable is reassigned to the value returned from map.
Although we used map for the above scenario, you do not need to use map for such simple operations. You can directly manipulate the string value in this case by applying string functions. Here is an example showing the same functionality derived from directly applying the str.lstrip() string function:
import pandas as pd
prices = pd.Series(["$1.5", "$25", "$65"])
print(prices)
prices = prices.str.lstrip("$")
print(prices)
Output:
0 $1.5 1 $25 2 $65 dtype: object 0 1.5 1 25 2 65 dtype: object
map is typically used for more complex transformations.
Also, all the broadcasting rules that you learned for NumPy are applicable to Series objects.
Queries
You can run queries to filter certain values. Here are some examples:
You can use relational and logical operators to filter records.
In the example below, you are getting values that are greater than or equal to 10:
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s2 = s1[s1 >= 10]
print(s2, type(s2))
Note that the returned value is also a Series object.
In the example below, you get all the values that are greater than 14 and less than 50 by using the & operator between the two relational expressions.
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s2 = s1[(s1 > 14) & (s1 < 50)]
s2
The logical OR operator can be applied using the | notation.
Note: It is better to enclose the relational operators within parentheses, as shown in the example, to avoid unforeseen errors during computation.
Selecting Elements
There are multiple ways to select an element in a Series. Let's understand them one by one.
Using index number
A Series supports the standard indexing mechanism to get an element, with 0 being the starting index. You can also go in the reverse direction starting from -1. This is very similar to Python lists or NumPy arrays that you have dealt with earlier.
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1[0] # gets you the first element of the Series
s1[-1] # gets you the last element of the Series
Note: For negative indexes to work, you must have a custom index defined. It does not work for the default integer index.
Using iloc
You can also use the iloc accessor when you are dealing with the index number. Here is an example;
a = pd.Series([100, 200, 300], index=["a", "b", "c"])
a.iloc[0] # gets you the first element of the Series
Output:
100
Retrieving multiple values
You can also retrieve multiple values with iloc. Here is an example:
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1.iloc[[0, 3]]
Output:
a 20
d 50
With iloc, you can also get a range of values similar to slicing. Here is an example:
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1.iloc[0:3] # gets all values from index 0 to index 3 excluding 3
Traditional methods using positional index numbers
Of course, the regular slice function and retrieval using the positional index also work on a Series.
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1[0:3] # This gives the same result as iloc[0:3]
s1[0] # this retrieves the first element of the Series
Updating values
A Series object is mutable, so you can update or delete any element. Here are some examples of using the positional index and iloc to get this done:
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1[0] = 100 # changes the first element to 100
s1.iloc[2] = 200 # updates the third element to 200
If iloc and the regular index give the same results, which one should you use?
Well, if you are retrieving or updating only one value, either one is fine. But if you are retrieving or updating multiple values, then .iloc would be a better fit. Here is an example:
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1.iloc[[0, 3]] = (
100,
200,
) # You can update both the first and fourth elements with one statement using `iloc`.
** Using loc property**
You can also use loc instead of iloc in all the above examples. However, if you have a non-numerical index, you must use loc to retrieve values using the non-numerical index labels. Here is an example:
s1 = pd.Series([20, 14, 32, 50, 49, 100], index=["a", "b", "z", "d", "e", "f"])
print(s1.loc[["a", "d"]])
print(s1.iloc[[0, 3]])
Both of the print statements above retrieve the first and fourth values in the Series, but iloc retrieves them using the numerical index, and loc uses the custom index labels.
If a custom index label is not provided, then both loc and iloc behave the same, and you can use the positional index to get the values.