Limiting crawls

The crawl endpoint can be limited in two ways: you can limit the numbe of pages visited, and the depth (or distance) from the starting URLs. Some reasons to limit crawls:

Limiting a crawl will control costs. Each visit uses bandwidth which incurs a cost depending on the proxy you use. There is also a small per-visit charge from FetchFox. A cap on the numbe of visits will also cap your cost.
**Limiting crawls will reduce runtime. **Although FetchFox visits pages concurrently during a crawl, there is a cap on that concurrency. A crawl with thousands of visits will take longer to execute than one with hundreds or fewer.

Depending on the number of URLs you want to find, a small crawl may suffice.

Limit the number of visits

The max_visits parameter puts a limit on the number of pages visited in the crawl. The crawl will never visit more than that many URLs. Below is an example of a call to the crawl endpoint with a limit.

curl -X POST https://api.fetchfox.ai/api/crawl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FETCHFOX_API_KEY" \
-d '{
    "pattern":"https://pokemondb.net/pokedex/**",
    "max_visits": 50
}'

This crawl will never visit more than the specified max pages.

Limit the depth

The max_depth parameter limits the depth of the crawl. This parameter only applies if you specify starting URLs using the start_urls parameter. Depth is the distance from the starting URLs. Each starting URL has a depth of 0. URLs linked directly from the starting URL pages have a depth of 1, and pages linked from there have a depth of 2, and so on. If there are multiple links to a page, the lowest depth value is used. The max_depth parameter is the maximum depth from the start URLs that is allowed in a crawl. If you set a max depth of 0, only the starting URLs will be visited. If you set a max depth of 1, only pages linked from the starting URLs will be visited, and so on. Below is an example of a crawl that limits the maximum depth.

curl -X POST https://api.fetchfox.ai/api/crawl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FETCHFOX_API_KEY" \
-d '{
    "pattern":"https://pokemondb.net/pokedex/*",
    "start_urls": [
      "https://pokemondb.net/pokedex/national"
    ],
    "max_depth": 1
}'

This crawl will not travel beyond the specified depth from the starting URLs.

Guides

Scrape

Crawl

Extract

Limit the number of visits

Limit the depth

Guides

Scrape

Crawl

Extract

​Limit the number of visits

​Limit the depth

Limit the number of visits

Limit the depth