NumPy

Over the years, statisticians and programmers have built many libraries that make routine calculations like mean, median, and standard deviation on a set of numbers, and data operations like matrix addition/multiplication, not only simple but also highly scalable and efficient. One such popular library is NumPy.

NumPy Array

NumPy is famous for its N-dimensional array data structure and the ease with which complex operations can be performed on these arrays. It provides powerful functions that efficiently operate on this array object to calculate statistical and mathematical metrics.

By using these structures and their methods, you will not only eliminate the for-loops for iterating through a basic Python list object but also be able to work on several times more data with the same hardware. In this lesson, you will learn some common operations that can be performed on them.

While Python lists can hold heterogeneous data, NumPy arrays can only hold homogeneous data (data of one type). Here is an example of constructing a 1-dimensional array:

import numpy as np

a = np.array([15, 35, 55])
print(type(a))
print(a)

Output:

<class 'numpy.ndarray'>
[15 35 55]

In the above example, you first import the NumPy module by adding the statement:

import numpy as np

This statement directs the Python interpreter to load the NumPy module, which is already installed on your computer. It also directs the interpreter to refer to NumPy using the np alias. Although you can use any alias for NumPy, the standard convention is to use np. It is also standard practice to create an alias instead of importing NumPy directly.

Now that NumPy is imported, you can use the dot operator on np to invoke the array method to create a NumPy array, which is then assigned to the variable a. If you print the data type of a, you will see that it is np.ndarray. You can also print all elements of an array by passing the ndarray to the print function, similar to a Python list.

Important attributes of ndarray

ndarray.ndim: The number of axes (dimensions) of the array.
ndarray.shape: A tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, the shape will be (n, m). The length of the shape tuple is, therefore, the number of axes, ndim.
ndarray.size: The total number of elements of the array. This is equal to the product of the elements of shape.
ndarray.dtype: Provides the data type of the elements in the array. You can specify standard Python data types or use additional data types provided by NumPy, e.g., numpy.int32, numpy.int16, and numpy.float64.

print(a.dtype)  # Prints the datatype of items stored
print(a.ndim)  # Prints the array dimension
print(a.shape)  # Prints the array shape
print(a.size)  # Prints the array size

Output:

int64
1
(3,)
3

Notice that, by default, it assigned int64 as the data type for integers. To change the data type to complex numbers, you would specify dtype=complex, as shown below:

a = np.array([15, 35, 55], dtype=complex)
print(a)

Output:

[ 15.+0.j 35.+0.j 55.+0.j]

Important functions for statistics

Until now, although you have constructed an array that holds numbers in less memory than Python lists, you have not seen any other functions. Now, you will see some useful statistical functions that you can apply to an array of numbers:

scores = np.array([92, 34, 88, 80, 73, 100, 100])
print(scores.mean())  # Prints the mean of the array elements
print(scores.max())  # Prints the max value in the array
print(scores.min())  # Prints the min value in the array
print(scores.argmin())  # Prints the index number of the min value
print(scores.std())  # Prints the standard deviation of the array elements

Output:

81.0
100
34
1
21.2670099987

If there are NaN values in the dataset, you can use np.nanmax(scores), np.nanstd(scores), etc., which ignore the NaN values while calculating the metrics. Otherwise, if even a single NaN value is present, the result is always NaN.

Although you could have used Python's built-in functions to find the max and min values in a list, NumPy provides these functions as part of the array itself, along with functions for mean and standard deviation.

To find the median and mode, you can use the np.median function and SciPy's stats module. Here is the example:

np.median(scores)

from scipy import stats

stats.mode(scores)

Output:

88.0
ModeResult(mode=array([100]), count=array([2]))

Consider how many loops we avoided by using NumPy. You will see more of these convenience methods that eliminate multiple loops in the next lesson.

Arrays of different data types

Here are some more examples of creating homogeneous arrays of different types.

print(np.array([1.0, 1.5, 2.0, 2.5]).dtype)  # Float type
print(np.array([True, False, True]).dtype)  # Boolean type
print(np.array(["AL", "AK", "AZ", "AR", "CA"]).dtype)  # Unicode 2 characters

Output:

float64 bool <U2

Array accessing and slicing

Array element access and slicing are similar to Python lists. Here are some examples:

countries = np.array(["US", "CA", "MX"])
print(countries[0])  # Prints out the first element of the array
print(countries[1:])  # Get elements from index 1 (included) to the end of the list
print(countries[:1])  # Get elements from the beginning to index 1 (excluded)
print(countries[0:2])  # Get elements from 0 index to 2 index (excluded)
print(countries[:])  # Get all elements

Output:

US
['CA' 'MX']
['US']
['US' 'CA']
['US' 'CA' 'MX']

Slicing in multidimensional arrays

You can convert a 1D array to a 2D array by applying the reshape method. Here is an example:

x = np.reshape(np.arange(0, 12), (3, 4))
x

Output:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

You can also reshape by setting the shape property of an existing array, as shown below:

x = np.array([2, 3, 5, 8, 0, 1])
x.shape = (3, 2)

With multidimensional arrays, you can slice by providing the index values for each dimension. Here are some examples:

Example	Output	Comments
`x[1]`	array([4, 5, 6, 7])	gets the second row completely
`x[1:]`	array([[ 4, 5, 6, 7], [ 8, 9, 10, 11]])	gets all the rows starting from second
`x[:2]`	array([[0, 1, 2, 3], [4, 5, 6, 7]])	gets all the rows until 2 - excluding 2
`x[:,2]`	array([ 2, 6, 10])	gets all the rows of the third column
`x[:,(1,3)]`	array([[ 1, 3], [ 5, 7], [ 9, 11]])	gets all the rows of the given two column values after the :
`x[(2):,(1,3)]`	array([[ 9, 11]])	gets given two (1,3) column values of the third row (2)

Note: Adding a comma after the : changes the selection completely!

Using an ellipsis to select all dimensions except the innermost one

In NumPy, you can use an ellipsis (...) to select full slices (:) of all the outer dimensions and select only a few indices of the inner dimension after the ellipsis. Here is an example:

a = np.arange(16).reshape(2, 2, 2, 2)
a[..., 0]

Output:

array([[[ 0,  2],
        [ 4,  6]],

       [[ 8, 10],
        [12, 14]]])

The above slice selects all indices of the outer dimensions and only the 0th index of the innermost dimension.

Logical Operators

You can apply logical expressions to array data. Here are some examples:

Example	Output	Comments
`x[x>9]`	array([10, 11])	gets all array elements meeting the given expression
`np.any([x>2])`	True	gets the result as applied to the entire array
`np.all([x>2], axis=1)`	array([[False, False, False, True]])	adding axis=1 will get you the result as applied to columns
`np.any([x>2], axis=0)`	array([[False, False, False, True], [ True, True, True, True], [ True, True, True, True]])	adding axis=0 will get you the result as applied to each element
`x[np.logical_and(x > 2, x < 9)]`	array([3, 4, 5, 6, 7, 8])	'and' operators between multiple logical operators
`np.logical_or(x < 4, x > 6)`	array([[ True, True, True, True], [False, False, False, True], [ True, True, True, True]])	'or' operator that returns boolean by applying the expression on each element

Similarly, you have logical_not and logical_xor that apply the not and xor operators, respectively.

Note:

You cannot use multiple logical operators together, as in the example below:

x[(x > 9) or (x < 2)]

This raises a ValueError because it cannot determine whether you want a vector of booleans or a single boolean. Instead, use the any or all operator, similar to Python, as shown below:

np.any([x > 9, x < 2])

NumPy arrays can only be homogeneous

With a NumPy array, you cannot construct an array with heterogeneous data as you can with Python lists. Here is an attempt to construct one, showing how the data types are converted:

mixed_array = np.array(["1", 2, np.nan])
print(mixed_array)
print(type(mixed_array[1]))

Output:

['1' '2' 'nan']
<class 'numpy.str_'>

Note that all values are converted to strings.

Convenience functions commonly used

Example	Output	Comments
`np.arange(2)`	array([0, 1])	similar to Python range except, what you get is ndarray instead of Python list
`np.arange(0, 1, 0.3)`	array([0. , 0.3, 0.6, 0.9])	same as above but this includes the stop and the step value
`np.zeros(2)`	array([0., 0.])	get an array of zeroes of the given size
`np.zeros((2,3))`	array([[0., 0., 0.], [0., 0., 0.]])	get a multidimensional array of zeroes of the given size
`np.ones(2)`	array([1., 1.])	using this you get 1's instead. You can create multidimensional here also
`x = np.array([3, 4])` `np.any(x>0)`	True	returns True if the expression is true for any element
`np.random.default_rng().random(2)`	array([0.xxxxxxx, 0.xxxxxx])	returns ndarray of random numbers of uniformly distributed float between 0 and 1 of the given size
`np.random.default_rng().integers(2, 10)`	x	returns 'x' a random integer between the given range
`np.random.default_rng().standard_normal(10)`	array(10 floats)	returns ndarray of standard normal distributed floats between 0 and 1 of the given size

Note: The legacy random.random implementation should be avoided in favor of the default_rng method. Refer: https://numpy.org/doc/stable/reference/random/index.html

Reference

Reference for more math functions: https://numpy.org/doc/stable/reference/routines.math.html