Skip to main content
When you crawl for URLs using a pattern, FetchFox needs a set of URLs to start at. Those URLs can be set in one of two ways:
  • Automatic: You can let FetchFox determine the starting URLs for you.
  • Explicit: You can tell FetchFox which URLs to start crawling from.
Let’s look at both of these options.

Automatically determine the starting URLs

If you do not pass startUrls, FetchFox generates a small seed set from your pattern:
  • The origin (for example https://example.com)
  • Path prefixes derived from your pattern

Explicitly setting the starting URLs

Use startUrls to explicitly define the seed URLs for a crawl. Setting startUrls is helpful for crawling specific parts of a large site. It is especially useful with maxDepth, which limits the maximum depth of a crawl. For example, suppose you are scraping commits on specific repos on GitHub. You can pass target repos in startUrls, then set maxDepth: 0.
curl -X POST https://api.fetchfox.ai/api/crawl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FETCHFOX_API_KEY" \
-d '{
    "pattern":"https://github.com/*/commit/*",
    "startUrls": [
      "https://github.com/bitcoin/bitcoin/commits/master/",
      "https://github.com/torvalds/linux/commits/master/"
    ],
    "maxDepth": 0,
    "maxVisits": 50
}'
The call above will find commit URLs for the target repos, without wasting time on irrelevant parts of the site.