In a recent project for web scraping, I created a group of web scrapers that would parse a target site content based on specific html tags. However, the presence of these tags vary across various sites. I developed a basic algorithm to process the sites:
- Make a GET request to the target using
requests
library - Pass the returned HTML response to
BeautifulSoup
library for parsing - Look for script tag of
application/ld+json
as it may contain schema information - If none found, look for
og:meta
tags as it may contain site information
I wrapped the above algorithm in a top level runner class and created separate scraper classes which I pass as arguments into the runner during runtime.
However, I noticed I had to duplicate some of the common functions across the various scraper classes such as creating separate HTTP requests objects and loading the HTML body into BeautifulSoup.
I started to investigate the Template design pattern as an option for refactoring. The Template design pattern is a behavioural pattern which allows you to define the algorithm in a base class and let the subclasses override the algorithm if required.
Given the use case above, we could wrap the entire algorithm into a base class which implements the core algorithm steps:
The Scraper class implements the parse_site
method which represents the algorithm steps. It retrieves the target site html and creates a beautifulsoup object. Then it calls parse_ld_tags
which attempts to retrieve any application/ld+json tags if present. If not, it defaults to running parse_meta_tags
method. Note that the parse_custom
method can be overridden in subclasses to provide custom parsing logic.
Since parse_custom
is blank, we can override this in custom subclasses. Assuming we need a custom scraper which only looks for title and meta-description tags by overriding the parse_custom
method:
Below is a full listing of the example code above:
While it works, there are some disadvantages in using this approach:
-
The logic of the algorithm is tied to a base class. Any changes to this would require changes to the subclasses.
-
Implementation of custom scraping would require creating a custom subclass and implementing a specific method.
However, the code base is a bit cleaner by pulling the duplicate code into a base class. It also allows you to overwrite part of the algorithm logic while keeping the rest of it intact.
Future posts will investigate the use of other patterns such as the Factory Method in a similar refactoring scenario.