FireCrawl | By MENTOR AI icon

FireCrawl | By MENTOR AI

Interact with Firecrawl Web Crawler API

Overview

This node integrates with the Firecrawl Web Crawler API to scrape content from web pages. The "Scrape" operation allows users to fetch and extract content from a single URL in various formats such as Markdown or screenshots. It supports advanced options like waiting for JavaScript to load, emulating mobile devices, customizing HTTP headers, and filtering HTML tags.

Common scenarios where this node is beneficial include:

  • Extracting main article content from news sites or blogs.
  • Capturing webpage snapshots for archival or review.
  • Scraping product details or reviews from e-commerce pages.
  • Automating data collection workflows that require structured web content extraction.

Practical example: A user wants to scrape the main content of a blog post as Markdown while also capturing a full-page screenshot. They can specify the URL, select both "Markdown" and "Full Screenshot" formats, and enable "Only Main Content" to exclude navigation and footer elements.

Properties

Name Meaning
URL The web page URL to scrape.
Formats Output formats to include. Options: Markdown, Single Screenshot, Full Screenshot.
Additional Options Collection of optional settings:
- Only Main Content Return only the main content, excluding header, navigation, and footer sections (boolean).
- Include Tags Comma-separated list of HTML tags to explicitly include in the output (string).
- Exclude Tags Comma-separated list of HTML tags to exclude from the output (string).
- Custom Headers (JSON) JSON string defining custom HTTP request headers (e.g., cookies, user-agent). Must be valid JSON.
- Wait Time (ms) Milliseconds to wait before scraping to allow JavaScript to load (number).
- Mobile Emulation Emulate a mobile device during scraping (boolean).
- Skip TLS Verification Skip SSL/TLS certificate verification (boolean).
- Timeout (ms) Request timeout in milliseconds (number). Defaults to 300000 ms (5 minutes).

Output

The node outputs a JSON object containing the scraped data returned by the Firecrawl API. The structure depends on the requested formats and options but generally includes:

  • For Markdown format: extracted main content in Markdown syntax.
  • For screenshots: image data or URLs referencing the captured screenshots.
  • Additional fields may include raw HTML or other requested formats if selected.

If the node encounters an error during scraping, it throws an error with a descriptive message.

Binary data (screenshots) are handled via the API response; the node does not directly output binary but includes references or encoded data as part of the JSON output.

Dependencies

  • Requires an API key credential for the Firecrawl Web Crawler API.
  • The node makes HTTPS POST requests to https://api.firecrawl.dev/v1/scrape.
  • Network access to the Firecrawl API endpoint must be available.
  • Proper configuration of the API key credential within n8n is necessary.

Troubleshooting

  • Invalid JSON in Custom Headers: If the "Custom Headers (JSON)" property contains invalid JSON, the node will throw an error stating "Custom Headers must be valid JSON". Ensure the JSON string is correctly formatted.
  • API Errors: If the Firecrawl API returns an error, the node throws an error with the message starting with "Firecrawl Scrape error:" followed by the API's error message. Check the URL validity, API key permissions, and network connectivity.
  • Timeouts: Long loading pages might cause timeouts. Adjust the "Timeout (ms)" property accordingly.
  • Empty or Unexpected Output: Verify that the URL is accessible and that the selected formats and options match the expected content type.
  • SSL/TLS Issues: If scraping fails due to SSL certificate problems, enabling "Skip TLS Verification" may help but use cautiously.

Links and References

Discussion