Extract data from URLs

FetchFox’s extract endpoint takes a list of URLs and item template as input, and returns structured data following the item template. The item template can be a dictionary of key value pairs, or a string. You can extract a one item per URL, or many many items per URL.

A simple extraction

The extract endpoint has two required parameters:

urls specifies which URLs to target for data extraction.
template specifies the output format for each item.

The template can be a dictionary or a string.

If template is a dictionary, the keys in the dictionary define the keys of each output item. The values in your template dictionary should describe what data you want in that field. The entire template dictionary will be passed to the AI, which will use it to do data extraction, so you can include helpful information like how to get the data, the format for that field, and so on.

If the template is a string, FetchFox will use that string to automatically determine the keys in the output items.

Whether you template is a dictionary or a string, FetchFox will establish a JSON schema for your output items. This schema will be returned in the artifacts section of the response.

Below is an example of calling the extract endpoint with a template.

The response to this call will look something like this:

Extracting multiple items per URL

By default, FetchFox extracts one item for each URL you pass in. Sometimes, a page contains multiple items. You can tell FetchFox to extract all of them by setting the per_page parameter to many.

Below is an example of extract multliple items from a single page.

curl -X POST "https://api.fetchfox.ai/api/extract" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
   "urls": [
     "https://pokemondb.net/pokedex/bulbasaur",
     "https://pokemondb.net/pokedex/ivysaur",
     "https://pokemondb.net/pokedex/venusaur"
   ],
   "template": {
     "move_name": "Name of the pokemon move",
     "move_type": "Name of the move type",
     "move_power": "The power of the move"
   },
   "per_page": "many"
}'

The response to this call will look something like this:

{
  "job_id": "fjszygdh38",
  "results": {
    "items": [
      {
        "move_name": "Growl",
        "move_type": "Normal",
        "move_power": "100",
        "_url": "https://pokemondb.net/pokedex/ivysaur",
        "_htmlUrl": "https://ffcloud.s3.amazonaws.com/visit/html/xz6rjf8h2v.html"
      },
      {
        "move_name": "Growth",
        "move_type": "Normal",
        "move_power": "—",
        "_url": "https://pokemondb.net/pokedex/ivysaur",
        "_htmlUrl": "https://ffcloud.s3.amazonaws.com/visit/html/xz6rjf8h2v.html"
      },
      ...more items...
    ]
  },
  "metrics": { ...cost and usage metrics... },
  "artifacts": [
    {
      "type": "divide",
      "divide": {
        "reasoning": "Moves are presented in tables with the class 'data-table', where each move (row) is represented by a <tr> inside <tbody>. I focused on extracting each <tr> for coverage.",
        ...more chain of thought...
        "selector": ".data-table tbody tr"
      }
    },
    {
      "type": "schema",
      "schema": { ...JSON schema definition... }
  ]
}

Note the divide artifact. FetchFox when you extract multiple items per page, FetchFox uses a CSS selector to divide the page into pieces. The CSS selector it used is included as the divide artifact, along with some AI chain of thought.

Get Started

Scrape

Crawl

Extract

Extract data from URLs

A simple extraction

Extracting multiple items per URL

Get Started

Scrape

Crawl

Extract

​A simple extraction

​Extracting multiple items per URL

A simple extraction

Extracting multiple items per URL