What I Learned Scraping Data for my Capstone

After 12 intense weeks of learning, I graduate from BrainStation’s Data Science bootcamp this week.

My capstone project is “Predicting Coffee Ratings from Expert Reviews”. You can check out the full project in my github.

Picking a Topic & Scraping the Data

In selecting a topic, I had a few parameters in mind. I wanted to:

Work on data that was relatively unexplored
Select a topic that was easy to understand so that I could focus on the technical skills instead of understanding the topic

In addition, a prior classmate scraped some data for a project we worked on together. I was intrigued by the idea and wanted to learn how to do this myself.

Separately, I came across some projects looking at coffee ratings. Many of them seemed extensively explored, and I wasn’t happy with the quality of the datasets. But I do enjoy coffee and the topic fit my “easy to understand” criteria. So, I decided to scrape my own data from CoffeeReview.com.

What Did I learn?

Beautiful Soup is an awesome library
Scraping messy data was pretty straight forward, but it was much more challenging to scrape the specific data I wanted
Scraping can be time consuming
Scraping in smaller pieces reduced wasted time trouble-shooting errors
A little help can go a long way

My first attempt at scraping data was challenging and exciting. It was surprisingly easy to scrape data but it was in a very messy format. I wanted to be more precise to make cleaning easier later. I spent a couple days banging my against the wall, and then asked a web developer for some help understanding the html text. Once I had a baseline for what I was looking for, I was on a roll. I could hone in on exactly what I needed and it was so satisfying.

Another challenge was that scraping was time consuming. At first I tried to scrape a lot at once, only to have errors appear hours later and have to start over. It turned out that some of the older webpages had a different format that more recent ones, so I changed tactics. I created two ‘scraper’ functions. One requested more details and the other fewer to accommodate the content changes. I then ran the scraper in smaller chunks. I also added an exception so that the scraper would continue if it got an index error (meaning some of the data I requsted was missing).

_config.yml

In the end, I was able to scrape ~6,500 reviews, which became my dataset for the project. I’m glad I took on the challenge of scraping my own data and it was fun to work on a unique dataset for the project.

Written on November 9, 2022