NumPy
Over the years, statisticians and programmers have built many libraries that make routine calculations like mean, median, and standard deviation on a set of numbers, and data operations like matrix addition/multiplication, not only simple but also highly scalable and efficient. One such popular library is NumPy.
NumPy Array
NumPy is famous for its N-dimensional array data structure and the ease with which complex operations can be performed on these arrays. It provides powerful functions that efficiently operate on this array object to calculate statistical and mathematical metrics.
By using these structures and their methods, you will not only eliminate the for-loops for iterating through a basic Python list object but also be able to work on several times more data with the same hardware. In this lesson, you will learn some common operations that can be performed on them.
While Python lists can hold heterogeneous data, NumPy arrays can only hold homogeneous data (data of one type). Here is an example of constructing a 1-dimensional array:
import numpy as np
a = np.array([15, 35, 55])
print(type(a))
print(a)
Output:
<class 'numpy.ndarray'>
[15 35 55]
In the above example, you first import the NumPy module by adding the statement:
import numpy as np
This statement directs the Python interpreter to load the NumPy module, which is already installed on your computer. It also directs the interpreter to refer to NumPy using the np alias. Although you can use any alias for NumPy, the standard convention is to use np.
It is also standard practice to create an alias instead of importing NumPy directly.
Now that NumPy is imported, you can use the dot operator on np to invoke the array method to create a NumPy array, which is then assigned to the variable a. If you print the data type of a, you will see that it is np.ndarray.
You can also print all elements of an array by passing the ndarray to the print function, similar to a Python list.
Important attributes of ndarray
- ndarray.ndim: The number of axes (dimensions) of the array.
- ndarray.shape: A tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, the shape will be (n, m). The length of the shape tuple is, therefore, the number of axes,
ndim. - ndarray.size: The total number of elements of the array. This is equal to the product of the elements of shape.
- ndarray.dtype: Provides the data type of the elements in the array. You can specify standard Python data types or use additional data types provided by NumPy, e.g.,
numpy.int32,numpy.int16, andnumpy.float64.
print(a.dtype) # Prints the datatype of items stored
print(a.ndim) # Prints the array dimension
print(a.shape) # Prints the array shape
print(a.size) # Prints the array size
Output:
int64
1
(3,)
3
Notice that, by default, it assigned int64 as the data type for integers. To change the data type to complex numbers, you would specify dtype=complex, as shown below:
a = np.array([15, 35, 55], dtype=complex)
print(a)
Output:
[ 15.+0.j 35.+0.j 55.+0.j]
Important functions for statistics
Until now, although you have constructed an array that holds numbers in less memory than Python lists, you have not seen any other functions. Now, you will see some useful statistical functions that you can apply to an array of numbers:
scores = np.array([92, 34, 88, 80, 73, 100, 100])
print(scores.mean()) # Prints the mean of the array elements
print(scores.max()) # Prints the max value in the array
print(scores.min()) # Prints the min value in the array
print(scores.argmin()) # Prints the index number of the min value
print(scores.std()) # Prints the standard deviation of the array elements
Output:
81.0
100
34
1
21.2670099987
If there are NaN values in the dataset, you can use np.nanmax(scores), np.nanstd(scores), etc., which ignore the NaN values while calculating the metrics. Otherwise, if even a single NaN value is present, the result is always NaN.
Although you could have used Python's built-in functions to find the max and min values in a list, NumPy provides these functions as part of the array itself, along with functions for mean and standard deviation.
To find the median and mode, you can use the np.median function and SciPy's stats module. Here is the example:
np.median(scores)
from scipy import stats
stats.mode(scores)
Output:
88.0
ModeResult(mode=array([100]), count=array([2]))
Consider how many loops we avoided by using NumPy. You will see more of these convenience methods that eliminate multiple loops in the next lesson.
Arrays of different data types
Here are some more examples of creating homogeneous arrays of different types.
print(np.array([1.0, 1.5, 2.0, 2.5]).dtype) # Float type
print(np.array([True, False, True]).dtype) # Boolean type
print(np.array(["AL", "AK", "AZ", "AR", "CA"]).dtype) # Unicode 2 characters
Output:
float64 bool <U2
Array accessing and slicing
Array element access and slicing are similar to Python lists. Here are some examples:
countries = np.array(["US", "CA", "MX"])
print(countries[0]) # Prints out the first element of the array
print(countries[1:]) # Get elements from index 1 (included) to the end of the list
print(countries[:1]) # Get elements from the beginning to index 1 (excluded)
print(countries[0:2]) # Get elements from 0 index to 2 index (excluded)
print(countries[:]) # Get all elements
Output:
US
['CA' 'MX']
['US']
['US' 'CA']
['US' 'CA' 'MX']
Slicing in multidimensional arrays
You can convert a 1D array to a 2D array by applying the reshape method. Here is an example:
x = np.reshape(np.arange(0, 12), (3, 4))
x
Output:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
You can also reshape by setting the shape property of an existing array, as shown below:
x = np.array([2, 3, 5, 8, 0, 1])
x.shape = (3, 2)
With multidimensional arrays, you can slice by providing the index values for each dimension. Here are some examples:
| Example | Output | Comments |
|---|---|---|
x[1] |
array([4, 5, 6, 7]) | gets the second row completely |
x[1:] |
array([[ 4, 5, 6, 7], [ 8, 9, 10, 11]]) |
gets all the rows starting from second |
x[:2] |
array([[0, 1, 2, 3], [4, 5, 6, 7]]) |
gets all the rows until 2 - excluding 2 |
x[:,2] |
array([ 2, 6, 10]) | gets all the rows of the third column |
x[:,(1,3)] |
array([[ 1, 3], [ 5, 7], [ 9, 11]]) |
gets all the rows of the given two column values after the : |
x[(2):,(1,3)] |
array([[ 9, 11]]) | gets given two (1,3) column values of the third row (2) |
Note: Adding a comma after the : changes the selection completely!
Using an ellipsis to select all dimensions except the innermost one
In NumPy, you can use an ellipsis (...) to select full slices (:) of all the outer dimensions and select only a few indices of the inner dimension after the ellipsis. Here is an example:
a = np.arange(16).reshape(2, 2, 2, 2)
a[..., 0]
Output:
array([[[ 0, 2],
[ 4, 6]],
[[ 8, 10],
[12, 14]]])
The above slice selects all indices of the outer dimensions and only the 0th index of the innermost dimension.
Logical Operators
You can apply logical expressions to array data. Here are some examples:
| Example | Output | Comments |
|---|---|---|
x[x>9] |
array([10, 11]) | gets all array elements meeting the given expression |
np.any([x>2]) |
True | gets the result as applied to the entire array |
np.all([x>2], axis=1) |
array([[False, False, False, True]]) | adding axis=1 will get you the result as applied to columns |
np.any([x>2], axis=0) |
array([[False, False, False, True], [ True, True, True, True], [ True, True, True, True]]) |
adding axis=0 will get you the result as applied to each element |
x[np.logical_and(x > 2, x < 9)] |
array([3, 4, 5, 6, 7, 8]) | 'and' operators between multiple logical operators |
np.logical_or(x < 4, x > 6) |
array([[ True, True, True, True], [False, False, False, True], [ True, True, True, True]]) |
'or' operator that returns boolean by applying the expression on each element |
Similarly, you have logical_not and logical_xor that apply the not and xor operators, respectively.
Note:
You cannot use multiple logical operators together, as in the example below:
x[(x > 9) or (x < 2)]
This raises a ValueError because it cannot determine whether you want a vector of booleans or a single boolean.
Instead, use the any or all operator, similar to Python, as shown below:
np.any([x > 9, x < 2])
NumPy arrays can only be homogeneous
With a NumPy array, you cannot construct an array with heterogeneous data as you can with Python lists. Here is an attempt to construct one, showing how the data types are converted:
mixed_array = np.array(["1", 2, np.nan])
print(mixed_array)
print(type(mixed_array[1]))
Output:
['1' '2' 'nan']
<class 'numpy.str_'>
Note that all values are converted to strings.
Convenience functions commonly used
| Example | Output | Comments |
|---|---|---|
np.arange(2) |
array([0, 1]) | similar to Python range except, what you get is ndarray instead of Python list |
np.arange(0, 1, 0.3) |
array([0. , 0.3, 0.6, 0.9]) | same as above but this includes the stop and the step value |
np.zeros(2) |
array([0., 0.]) | get an array of zeroes of the given size |
np.zeros((2,3)) |
array([[0., 0., 0.], [0., 0., 0.]]) |
get a multidimensional array of zeroes of the given size |
np.ones(2) |
array([1., 1.]) | using this you get 1's instead. You can create multidimensional here also |
x = np.array([3, 4])np.any(x>0) |
True | returns True if the expression is true for any element |
np.random.default_rng().random(2) |
array([0.xxxxxxx, 0.xxxxxx]) | returns ndarray of random numbers of uniformly distributed float between 0 and 1 of the given size |
np.random.default_rng().integers(2, 10) |
x | returns 'x' a random integer between the given range |
np.random.default_rng().standard_normal(10) |
array(10 floats) | returns ndarray of standard normal distributed floats between 0 and 1 of the given size |
Note:
The legacy random.random implementation should be avoided in favor of the default_rng method. Refer: https://numpy.org/doc/stable/reference/random/index.html
- Reference for more math functions: https://numpy.org/doc/stable/reference/routines.math.html