You can tell FetchFox which URLs to start from
https://example.com/category/*
, then the URL https://example.com
will be included in the starting URL set.https://example.com/a/b/*
will yield the guesses https://example.com/a/b
and https://example.com/a
.start_urls
parameter lets you explicitly set the starting URLs for a crawl. If you set the starting URLs in this way, FetchFox will use only those URLs as a starting point, and will not add URLs to the set you provide.
Setting the starting URLs is helpful for crawling specific parts of large sites. It is especially useful in combination with the max_depth
parameter, which limits the maximum depth of a crawl. You can use these two parameters to find links only from a specific set of pages.
For example, suppose you are scraping commits on specific repos on GitHub. You can pass in the target repos in start_urls
, and limit the depth to 0, as shown in the example below.