Case Study: Example of a Brief Data Science Project

To conclude this map, I’d like to walk you through an example of an actual data science project. I’ll be using my work from an article posted at visualizecuriosity.com, my personal website. There you’ll find many other articles that provide information on the data science processes that created the charts and results you’ll see below. However, keep in mind that there are many possible applications, and many languages to use! In this example, I’ll be showcasing a project that focused on data visualization and used the programming language R. There are many other examples of data science projects out there, that also offer step-by-step insights - check out these ones to start.


In this project, I began with the question: with the COVID-19 pandemic grounding life to a halt in March 2020, what data was available to see its economic impact on a day-to-day frequency? Once you have your question in mind, the first step for most data science projects is to find and gather the data. Since my goal was to find the most up-to-date numbers available, I had to search around for data that was updated on a weekly or even daily cycle. With some googling and exploring around, I found two excellent public datasets: Homebase data on hours worked by employees and OpenTable data on the number of customers visiting restaurants.

Snippet of the OpenTable data

Snippet of the OpenTable data

Once you’ve collected the data you think you’ll need, it’s time to clean it. In my case, I knew I wanted to plot the data over time, comparing the pre-lockdowns trend to the post-lockdown one, and by state, to see if the different lockdown rules were affecting business activity. So after I downloaded the datasets made available on each company’s website (as Excel spreadsheets), I imported it into R to start cleaning. I wrote some lines that reshaped the data into this form, and also added a few commands to clean the variables names and scale of some variables.

My R code to import and start investigating and cleaning the OpenTable data

My R code to import and start investigating and cleaning the OpenTable data

With your cleaned data in hand, you can now start visualizing! I had come into this question with several charts already in mind, such as line charts that simply plotted business traffic over time, and geographic maps of the US to allow for comparison among all the states. As I spent time “up close with the data” - gathering it and cleaning it so that I learned what it actually looked like - I had new ideas for visualizations I wanted to make. I realized that stock market performance was another public source of data that updated daily, and that if I normalized all my various trends to be relative to their pre-pandemic level, I could compare a stock market index directly against the business traffic measures. Often a data science project is shaped by what data is actually available, what form it’s in, and what it has to say. As the saying goes, you don’t want to strangle the data but let it speak for itself. As you explore the data, you may be inspired with new questions or data analysis ideas - this is where having a curious mindset comes in handy.

From R code....

From R code....

....to data visualizations!

....to data visualizations!

The final step was writing about my charts and sharing my results. The end product is the article you can read here. You can also view the entirety of my R script used for this project on my GitHub page. Altogether, it probably took me about a week, with several hours spent each day, to get from a question to the completed article full of charts and discussion. Most of that time was spent cleaning the data and customizing my charts to look exactly how I wanted. Some projects I’ve completed in just a couple of days, while others have taken me weeks just to find and gather the right data! With data science, the possibilities are nearly endless, and the only restriction on what can be done is your own creativity.


Which map will you explore next?

Explore other worlds

 
Previous
Previous

Data Science Programming Resources