Data-Driven Stories

So far, you have learned the techniques of deriving metrics or visualizations in bite-sized solutions. You are also advised to read our EDA eBook to get a basic understanding of the concepts of Exploratory Data Analysis (EDA). In this lesson, you will learn the art of weaving a data-driven story by putting it all together. Although the procedure is given as a series of steps, your data analysis may or may not exactly align with these steps. As long as you derive meaningful insights from your dataset, your analysis is complete, with or without the steps given below.

Start every step with an initial question. Plot meaningful, self-explanatory diagrams with meaningful labels. Conclude the step with your own conclusion by observing the results.

Note: Steps 4 to 8 can be in any order. You may omit some steps altogether if there is no relevant data for that step. The data cleaning in Step 3 may be revisited again and again as you discover more dirty values during the analysis in other steps. You may also add more steps based on your dataset's unique makeup. If you have data from multiple sources, you may consider merging the data and then continuing with your analysis.

Step 1. Data Ingestion

For the most part, open datasets are available in a tabular format. You can use the built-in methods to readily convert a .csv, .json, or any other file format by using one of the functions of Pandas as described in the File Reading and Writing Chapter.

Step 2. Bird's-Eye View

When a dataset is given to you to derive insights, the first step is to get an overview of your data.

Some of the fundamental questions for tabular data could be:

How many rows are present?
How many columns are present?
What are the data types of each column?

You can get the answers to all the questions above by applying the following methods on a Pandas DataFrame, as detailed in the chapter Summary Statistics:

info()
describe()
head()

Step 3. Cleaning the Data

Now it is time to start one round of cleaning by fixing the column names, filling in null or empty values, etc., as detailed in the chapter Data Cleaning.

For the next set of steps, you can follow these guidelines:

Ask some initial questions (1-2 max) and start analyzing the data to answer those specific questions. Asking questions is the hardest part. It comes with practice. Start by first plotting diagrams based on data types, as that may open doors for some questions.
Have some possible hunches in mind for your answers, and then find the answers from the data. Look at multiple independent variables (variables could be other column values, or you may have to gather additional data to supplement your initial data) that could be contributing to the answers to your questions.
Find some initial results and assess how confident you are with them. What would make you more confident with the results? Try to think if there are other lurking variables that you did not consider.
Present your findings using single-variable (1D) and multi-variable (2D) visualization charts.
Finally, reflect on your prior hunches and what the actual answers were. In almost all cases, you will now have more questions, even though you answered your initial ones.
Repeat the cycle until you have reasonably understood the data on hand and are able to show your findings through visualizations along the way.
List the assumptions you made along the way, the limitations of your findings, and other extraneous factors that could influence your findings.

Step 4. Numerical Data Analysis

If your dataset has any numerical values, this step is essential. It is worth noting, however, that some columns might be ingested into the DataFrame with a non-numeric data type, even though they should have been of a numeric type. In that case, go back to the previous step and convert that column's values to numeric. For example, a price with a and/or,` might be ingested as a non-numeric type, which needs to be converted to numeric so that you can perform arithmetic operations on it. If you don't have any numerical values to consider, move to the next step.

At this point, you will get a fair idea of which numerical column you would like to explore more. Find the outliers, mean, median, distribution, etc., by plotting a Box and Whisker plot or a Histogram, as detailed in the chapter Simple Visualization.

You may also want to segment any continuous data with some criteria, if applicable, as detailed in the chapter Simple Queries.

Step 5. Categorical Data Analysis

If any of the column values are categorical, you can create bar charts using the Seaborn library. If the category labels are too big, consider a horizontal bar chart. If you see part-to-whole scenarios, consider a stacked bar chart. Refer to the chapter Visualization with Seaborn.

Step 6. Time Series Plots

If you see any date/time-related column values, plot a line chart as explained in the chapter Time Series.

Step 7. Geospatial Diagrams

If you see any latitude and longitude values, or county, state, or country values worth investigating, then apply the techniques as detailed in the chapter Geospatial Diagrams.

Step 8. Multivariate Relationships

When a given tabular dataset contains multivariate observation data, it is a good idea to see the possible correlation relationships between them. You can first get a subset of columns of interest and create a correlation heatmap, as explained in the chapter Visualization with Seaborn.

Using the heatmap as a guide, pick two variants showing a strong correlation and consider plotting a Seaborn lmplot or regplot, showing a positively or negatively correlated relationship.

Step 9. Insights!

Congratulations! If you have made it this far, you have tamed the beast!
By creating many different diagrams and charts, you should now have a fair idea of your dataset and, hopefully, have made some interesting discoveries about the data.

However, not all diagrams and analyses will be meaningful. Before every analysis, you would have started with a simple question and concluded with your observation. In this last step, you will only explain the meaningful insights that you derived, leaving out the irrelevant ones.

In essence, you have arrived at the Explanatory Phase by taking the help of the Exploratory Phase. In this phase, provide a final conclusion summarizing the salient features of your findings that are worthy of weaving into a story, thereby adding your own unique insights to the data!

In the event that nothing interesting flashes from the data, then revisit the steps above by asking more questions, getting more data, and digging deeper into your analysis, which might again take you on another unique discovery path. Repeat all the steps as needed until you are satisfied with your findings and are ready to share them and/or your hypothesis with the whole world!

Remember, however, that many EDA exercises only lead to more analysis and more data collection, using which a hypothesis is proven with inferential statistical techniques. Only then should a business decision be made.

Examples

Every dataset is unique, and hence the steps you would take for your analysis would be unique as well. Some data needs less cleaning, as it may already be clean, but some others need significant steps in the data cleaning process itself before a detailed analysis can be done.

Every analyst's train of thought and the questions they come up with are different as well. Hence, even if two analysts are given the same dataset, they typically come up with their own unique charts and plots and derive their own unique insights. Even after an analyst completes an analysis, another analyst may see opportunities to enhance the analysis further and derive even more insights. A good analyst can almost always enhance any analysis!

Our students have created amazing analytics on several open datasets. You can find them at: Mbcc Talent Website

Open Datasets

Many governments all over the world are making their data 'open', thereby contributing to the public domain. Anyone can freely use and analyze 'open data' without any restrictions, except for, in some cases, adding an attribution statement when published on the web. US Govt Data, UK Govt Data, and India Govt Data are some examples of open data.

The reason for the open data movement is broadly twofold. Government organizations that receive taxpayers' money are obligated to make non-sensitive information public in many countries. Secondly, people are realizing that there is a wealth of information hidden in the massive amount of data that organizations have collected over many years, and by making it public, innovative private companies and their citizens can derive insights from the data and share them with the government to help make better-informed decisions for its citizens and the rest of the world.

Below, you will find a few other open datasets that you can use for your analysis:

Other API References

To extract data from an HTML file (for screen scraping applications), refer to the Beautiful Soup documentation to help you extract data: https://beautiful-soup-4.readthedocs.io/en/latest/
To extract PDF data, refer to: https://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Using_Python_to_Extract_Tables_From_PDFs.php
To automatically generate all kinds of diagrams that are possible on your dataset: https://github.com/AutoViML However, note that it is important to understand what you are trying to generate. Although libraries like these make it simple, and you can use them as a starting point, you should dig deeper to truly understand your data.

Exercise

Challenge yourself to derive insights by applying all the techniques you have learned so far by choosing any open dataset of your choice (a dataset outside of the list given above is also fine). Then, take it a step further: refer to the internet and enhance your analysis by going above and beyond what you have learned in these eBooks, and come up with your own unique story!