Web Scraping 101
Today, I would like to learn what Web scraping is. This term has been floating around my head for a while. Perhaps I have unconsciously expected that this technique can make my life easier, especially when I try to get information, for example, food recipes.
Then, let’s go learn about that.
What is Web Scraping?
So, what is web scraping? What does this term even mean?
To be short and simple, web scraping is a technique for extracting data from websites. A little more in detail, we use web scraping tools (mostly called a bot or web scraper) to access the website and copying the data from it.
How do we get the data then?
We need to access to the website first. Then, we should decide what data we want to scrape. For example, we can extract food recipes, product images and prices, or sales data of online stores, and more.
Let’s imagine that we decided to scrape Pad Thai recipe (only because it is my soul food). We search “Pad Thai recipe” on Google, and click one of the results, in our example, we accessed “Tastes Better From Scratch”.
Easy Homemade Pad Thai - Tastes Better from Scratch
This amazing Pad Thai recipe is easy and approachable and can be made in under 30 minutes. It starts with fresh…
We really liked the recipe, so we are trying to scrape the information. However, we just realized that we don’t have any tool for scraping. Here, I found pretty useful scraping tools:
axios — Promise based HTTP client for the browser and node.js.
cheerio — jQuery for Node.js. Cheerio makes it easy to select, edit, and view DOM elements.
What is promise based HTTP? And what is jQuery for Node.js? We feel a little lost here. Don’t worry! Let’s not dig into those too deep. We only want to know what is happening. Simple is our motto for today.
First thing first. Think about what we do when we want to get something. We REQUEST items. Same concept is applied here too. If we want to get the information from the website, we need to REQUEST the website. Then what will happen? RESPONSE will follow. We request something, and response will come. And axios is helping the conversation.
With axios, we just got the entire information from the website, and we call it Hypertext Markup Language (HTML). The data includes not only the recipes, but also unnecessary information such as comments, other recommended recipes, and more. Reading thousands of lines of data to find our target information does not make any sense. And here comes the scraper tool, cheerio.
Cheerio will traverse the whole document with selector methods. It will select specific elements that we queried. This tool enables us to selectively choose the data, in our case, Pad Thai recipe. Here, this is our target information to extract in text:
In HTML, this information is assorted according to its attributes. For example, the attributes of the ingredients list looks like this:
The single ingredient has an attribute named, “div:nth-child(1)” under the ingredients list, “div.wprm-recipe-ingredients-container”. Now we know the selector, so let’s try to use cheerio.
If we want all ingredients, we can say:
Then, the result will be all the ingredients listed in text separated by comma. Of course we can apply same thing for instructions. It will be “.wprm-recipe-instructions” instead of “.wprm-recipe-ingreditents”. And, we are done! Now you can go try on your own! One thing to keep in mind is that cheerio only works on static websites.
One last thing!
Not all websites allow us scraping their documents. Usually, we can find out whether we are allowed or not with Robot.txt.