Crawl for URL patterns

You can use FetchFox to crawl a website for a URL pattern.

A URL pattern starts with "http", and includes at least one wildcard "*" charcters.

Here is an example:

https://www.example.com/category/*

This URL pattern would match all of these URLs:

https://www.example.com/category/page-1
https://www.example.com/category/page-2
https://www.example.com/category/sub-cat-a
https://www.example.com/category/sub-cat-a/some-article
https://www.example.com/category/helpdesk?question_id=111

You can include multiple wildcards to match specific patterns, like this:

https://www.example.com/category/*/items/*

This URL pattern will match all of these URLs:

https://www.example.com/category/toys/items/11
https://www.example.com/category/toys/items/22
https://www.example.com/category/toys/items/33
https://www.example.com/category/books/items/111
https://www.example.com/category/books/items/222
https://www.example.com/category/books/items/333

Why use URL patterns?

It is useful to crawl for URL patterns if you want to find many pages that have similar data. Web sites use URLs to organize their data, and you can use that organization in your scraper.

For example, if you are scraping e-commerce data, you may notice that all the products have a URLs like this:

https://www.some-store.com/shop/products/111-basic-soap
https://www.some-store.com/shop/products/222-fancy-soap
...and so on...

A URL pattern is an easy way to find all the products. Just put in "*" for the part that changes

https://www.some-store.com/shop/products/*

Scraping Pokemon Moves with URL patterns

Lets do an example scrape using URL patterns. We're going to scrape all the Pokemon moves using URL patterns.

You can find Pokemon moves at URLs like this:

You'll notice they all have this format:

https://pokemondb.net/move/*

This format becomes our URL pattern. Let's get started.

As a first step, make a new scrape at https://fetchfox.ai/new, and put in the top level URL of the site, like this:

Put in any URL from the target site to get started

Click the arrow to continue, and wait for FetchFox to initialize your workflow.

For this scrape, remove any steps that FetchFox created so we have a blank workflow.

Create a blank workflow by removing any steps FetchFox added

Then, add a "Crawl" step by clicking the plus icon.

Add a "Crawl" step

Then, select the option to crawl based on a URL pattern:

Use the option that crawls for URL patterns

Enter the URL pattern from before, which is

https://pokemondb.net/move/*

And then click "Save".

Enter your URL pattern for the crawl.

For URL pattern crawls, make sure to put in a limit on the number of results. These often find many results, and if you don't put in a limit, you will quickly burn through your credits.

Put in a limit for the crawl

Click "Run", and you should see results like the screenshot below.

Results for the URL pattern crawl.

We can combine this with extraction to get data out of each page. To do this, add an extract step using the plus icon.

You can extract data from the pages you found.

For this example, add the following fields:

name: Move name

type: Move type

power: Move power

Make sure to tell the AI to scrape a single item per page, and click save.

Run the scraper again. Your results should look something like this:

Extraction can follow a crawl step

Combining a URL pattern crawl and a data extraction step is an easy and powerful way to scrape data from many websites.

Last updated