DataFrames
A DataFrame is a complex data structure that uses several Series objects. DataFrames are used to represent tabular data. Similar to a table, DataFrames have columns, column names, and rows with indexes. Here is a simple example:
df = pd.DataFrame(
{
"marks": [70, 66, 100, 88],
"age": [29, 32, 31, 28],
"sex": ["F", "M", "F", "F"],
"name": ["Jane", "John", "Sally", "Sandy"],
"ssn": ["1234", "3456", "4567", "5678"],
}
)
print(df)
Output:
age marks name sex ssn 0 29 70 Jane F 1234 1 32 66 John M 3456 2 31 100 Sally F 4567 3 28 88 Sandy F 5678In the example above,
name, ssn, age, marks, and sex are the column names (labels). Rows have index numbers from 0 to 3. By default, the column names are ordered alphabetically.You can get the values of an entire row or column by invoking functions. Whenever you retrieve row or column values, you receive a Series object with the respective values. Here is an example:## Get Column values from column name (label)Using a column label, you can retrieve all the values of that column. Here is the code to get all the values of the age column:
col_values = df["age"]
print(type(col_values))
Output:><class 'pandas.core.series.Series'>Notice that the retrieved value is a Series object. Let's now print the values of the Series object.
print(col_values)
Output:
0 29 1 32 2 31 3 28 Name: age, dtype: int64## nlargest and nsmallest The
nlargest and nsmallest methods that you used on a Series object also work on DataFrames.You have to provide the number of largest records to retrieve, along with the column(s) that should be used for sorting to find the largest values. Here is an example:
import pandas as pd
df = pd.DataFrame({"marks": [70, 66, 100, 100, 88], "age": [29, 32, 30, 31, 28]})
df.nlargest(3, "marks")
Output:
marks age
2 100 30
3 100 31
4 88 28
You can specify multiple columns as well. If you want to get the `nlargest` based on both the `marks` and `age` columns, you would change the `columns` value to a list of column names, as shown below:import pandas as pd
df = pd.DataFrame({"marks": [70, 66, 100, 100, 88], "age": [29, 32, 30, 31, 28]})
df.nlargest(3, ["marks", "age"])
Output:
marks age 3 100 31 2 100 30 4 88 28## Change indexAnother great feature of DataFrames is that you can easily change the current index or even reindex, as the case may be, to make things easier. In the example above, the rows carry index numbers 0 through 3. However, in real-life data, every table may have a natural primary key, and you may be constantly querying against a specific value of that primary key. With a DataFrame, you can easily replace the numerical index with any other column's values. Let's say in the example above, we want to replace the row index with the `ssn` column. Here is the code to do just that:
df.set_index("ssn", inplace=True)
print(df)
Output:
age marks name sex ssnNotice that the existing index is replaced with the values from the
1234 29 70 Jane F 3456 32 66 John M
4567 31 100 Sally F
5678 28 88 Sandy F
ssn column, and the ssn column is no longer part of the DataFrame's data.If you want to keep the ssn column, you can add the keyword argument drop=False.
Adding the inplace=True keyword argument will ensure this index change happens on the existing object instead of creating a new one. __Alternative way__You can achieve the same result by using the statements below as well. First, just replace the index with the new column, and then drop the existing ssn column:
df.index = df["ssn"]
df.drop("ssn", axis=1, inplace=True)
print(df)
Output:
age marks name sex ssnIf you do not add the
1234 29 70 Jane F 3456 32 66 John M 4567 31 100 Sally F 5678 28 88 Sandy F
inplace=True argument to the drop command, you will receive a new DataFrame object with the required column dropped. However, if you add this parameter, the existing DataFrame object is modified in place.Replacing the existing index is by far the most commonly used command in the data wrangling step.## Sorting
A DataFrame can be sorted by its index or by one or more columns.
Here are the methods:* df.sort_index() - sorts the entire DataFrame in ascending order based on the index.* df.sort_values(by=['col1']) - sorts the entire DataFrame in ascending order based on col1. * df.sort_values(by=['col1', 'col2']) - sorts the entire DataFrame in ascending order based on col1 and col2. You can add more columns separated by a comma.You can add the optional keyword argument ascending=False to sort the DataFrame in reverse order. Also, using inplace=True will sort the existing DataFrame instead of returning a new one.There are a few more variations in the argument list. Please refer to the complete list here: pandas.DataFrame.sort_values## Change Column orderingYou can change the column positions using the reindex function and setting axis=1, as shown below.
df.reindex(df.columns.sort_values(), axis=1)
In the example above, you get a new DataFrame in which all the columns are sorted in ascending order of their labels.
* More reference: [pandas.DataFrame.set_index](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.set_index.html)
- It is important to note that
inplace=Trueis an argument you can pass to many DataFrame commands, which all basically make modifications to the existing DataFrame instead of returning a new one. - For extra-large datasets, it is important to add the
inplace=Trueparameter; otherwise, you may run out of memory by creating unnecessarily large, almost duplicate DataFrame objects. - An index can have duplicate values. To find if the index has duplicates, you can use
df.index.duplicated(), which returns a Series withTrueorFalsevalues for each index row that is duplicated. Refer to: pandas.Series.duplicated