Reading and Writing Files Using Pandas

Python provides a good set of classes and functions for reading from and writing to the file system. However, a data analyst will rarely use those functions, especially if they are mostly dealing with tabular data, as Pandas directly provides a DataFrame from a file that is in a table format.

In this lesson, you will see some examples. Before we dive into reading files, let's understand the most common file formats:

csv - comma-separated values
xml - eXtensible Markup Language
json - JavaScript Object Notation

Construct a DataFrame using the contents of the file

For all tabular data where a DataFrame is more convenient to manipulate, you can use any of the methods below based on the file format.

Reading a CSV file

The most common file format is CSV, which stands for comma-separated values. In this file format, all the column values are separated by a comma. An optional header may be present, in which case the column labels are also separated by a comma. Download the Titanic dataset from here: https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/titanic.csv

Open the downloaded file using a text editor. You will see that all the columns are separated by a comma. A row ends with a newline character, so every row has its own line.

The dataset you have been working with is part of Seaborn, so you did not have to download it to your machine. If you want to work with this downloaded file, you would invoke the command below:

import pandas as pd

titanic_df = pd.read_csv('datasets/titanic.csv')
titanic_df.head(1)

Output:

The head(1) method used on the titanic_df DataFrame object returns the first row of the DataFrame. You will learn more about head in the next lesson.

Note: Ensure that the downloaded file is moved to a datasets folder. If you do not have one, create a datasets folder as a child of the folder in which your .ipynb file is present.

You can also create the DataFrame by directly reading the CSV file from its source. Here is the code to do that:

import pandas as pd

titanic_df = pd.read_csv(
    "https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/titanic.csv"
)
titanic_df.head()

Titanic Column Information

If you are curious about what the column labels and values mean in this dataset, here is a rundown:

survival - Survival (0 = No; 1 = Yes)
class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name - Passenger Name
sex - Passenger Gender
age - Age of the Passenger
sibsp - Number of Siblings/Spouses Aboard with the Passenger
parch - Number of Parents/Children Aboard with the Passenger
ticket - Ticket Number
fare - Passenger Fare
cabin - Cabin
embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Dump the contents of a DataFrame to a file

You can conveniently save the contents of a DataFrame to a file on the filesystem using one of the methods below, depending on the desired file format.

Writing to a CSV file

To write a DataFrame to a CSV file, all you have to do is invoke the to_csv() function on the dataset, providing the file path. A new file will be created from the contents of the DataFrame. Here is an example of saving all the records of travelers who are above 50 years old to a separate file:

above50 = titanic_df[titanic_df.age > 50]
above50.to_csv("datasets/above50.csv", index=False)

Output: There is no output in the notebook, but a file named above50.csv will be created in the datasets folder without the index column.

Reading a JSON file

To read a JSON file, all you have to do is use read_json instead of read_csv. Here is an example:

import pandas as pd
df = pd.read_json("url-to-json-file")

Other file format

While CSV and JSON are the most commonly used file formats, Pandas also provides functions for other file formats, such as HTML, XML, relational databases, etc. Refer to the official documentation for details: https://pandas.pydata.org/pandas-docs/stable/io.html

Construct a Dictionary object using the contents of a file in JSON format

You can convert the content of a JSON file into a dictionary object using the json module. Here is an example:

from urllib.request import urlopen
import json

my_dictionary = json.load(
    urlopen(
        "https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json"
    )
)
type(my_dictionary)

You will see that my_dictionary is a dictionary object.

The urllib library provides the urlopen function to read a file, given its URL. This returns an http.client.HTTPResponse stream, which is in turn fed to the json.load() function that converts the JSON read from the response object into a Python dictionary object.

In case of exceptional conditions, you need to close opened resources like the HTTPResponse using a try...except block. Python also provides a convenient with statement that handles opened resources automatically. Here is a better implementation of the above code:

with urlopen(
    "https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json"
) as response:
    print(type(response))
    my_dictionary = json.load(response)

Construct a Dictionary object using string, byte or bytearray that represents a JSON

There is another convenience method, loads, that can convert a string, bytes object, or bytearray into a dictionary. Here is an example:

learning_center = '{ "school":"mbcc", "country":"USA", "city":"Troy"}'

my_school = json.loads(learning_center)
type(my_school)

In this example, learning_center is a JSON string that is converted to a dictionary using the loads function.

The complementary function to convert a dictionary to a JSON string is dumps. Here is an example:

my_dictionary = {"school": "mbcc", "country": "USA", "city": "Troy"}

my_school = json.dumps(my_dictionary)
type(my_school)

You will see that my_school is a JSON string.