Crawl for URL patterns
You can use FetchFox to crawl a website for a URL pattern.
A URL pattern starts with "http", and includes at least one wildcard "*" charcters.
Here is an example:
This URL pattern would match all of these URLs:
You can include multiple wildcards to match specific patterns, like this:
This URL pattern will match all of these URLs:
Why use URL patterns?
It is useful to crawl for URL patterns if you want to find many pages that have similar data. Web sites use URLs to organize their data, and you can use that organization in your scraper.
For example, if you are scraping e-commerce data, you may notice that all the products have a URLs like this:
A URL pattern is an easy way to find all the products. Just put in "*" for the part that changes
Scraping Pokemon Moves with URL patterns
Lets do an example scrape using URL patterns. We're going to scrape all the Pokemon moves using URL patterns.
You can find Pokemon moves at URLs like this:
...and so on...
You'll notice they all have this format:
This format becomes our URL pattern. Let's get started.
Click the arrow to continue, and wait for FetchFox to initialize your workflow.
For this scrape, remove any steps that FetchFox created so we have a blank workflow.
Then, add a "Crawl" step by clicking the plus icon.
Then, select the option to crawl based on a URL pattern:
Enter the URL pattern from before, which is
And then click "Save".
Click "Run", and you should see results like the screenshot below.
We can combine this with extraction to get data out of each page. To do this, add an extract step using the plus icon.
For this example, add the following fields:
name: Move name
type: Move type
power: Move power
Make sure to tell the AI to scrape a single item per page, and click save.
Run the scraper again. Your results should look something like this:
Combining a URL pattern crawl and a data extraction step is an easy and powerful way to scrape data from many websites.
Last updated