Overview
This node integrates with the Firecrawl Web Crawler API to scrape content from web pages. The "Scrape" operation allows users to fetch and extract content from a single URL in various formats such as Markdown or screenshots. It supports advanced options like waiting for JavaScript to load, emulating mobile devices, customizing HTTP headers, and filtering HTML tags.
Common scenarios where this node is beneficial include:
- Extracting main article content from news sites or blogs.
- Capturing webpage snapshots for archival or review.
- Scraping product details or reviews from e-commerce pages.
- Automating data collection workflows that require structured web content extraction.
Practical example: A user wants to scrape the main content of a blog post as Markdown while also capturing a full-page screenshot. They can specify the URL, select both "Markdown" and "Full Screenshot" formats, and enable "Only Main Content" to exclude navigation and footer elements.
Properties
Name | Meaning |
---|---|
URL | The web page URL to scrape. |
Formats | Output formats to include. Options: Markdown , Single Screenshot , Full Screenshot . |
Additional Options | Collection of optional settings: |
- Only Main Content | Return only the main content, excluding header, navigation, and footer sections (boolean). |
- Include Tags | Comma-separated list of HTML tags to explicitly include in the output (string). |
- Exclude Tags | Comma-separated list of HTML tags to exclude from the output (string). |
- Custom Headers (JSON) | JSON string defining custom HTTP request headers (e.g., cookies, user-agent). Must be valid JSON. |
- Wait Time (ms) | Milliseconds to wait before scraping to allow JavaScript to load (number). |
- Mobile Emulation | Emulate a mobile device during scraping (boolean). |
- Skip TLS Verification | Skip SSL/TLS certificate verification (boolean). |
- Timeout (ms) | Request timeout in milliseconds (number). Defaults to 300000 ms (5 minutes). |
Output
The node outputs a JSON object containing the scraped data returned by the Firecrawl API. The structure depends on the requested formats and options but generally includes:
- For Markdown format: extracted main content in Markdown syntax.
- For screenshots: image data or URLs referencing the captured screenshots.
- Additional fields may include raw HTML or other requested formats if selected.
If the node encounters an error during scraping, it throws an error with a descriptive message.
Binary data (screenshots) are handled via the API response; the node does not directly output binary but includes references or encoded data as part of the JSON output.
Dependencies
- Requires an API key credential for the Firecrawl Web Crawler API.
- The node makes HTTPS POST requests to
https://api.firecrawl.dev/v1/scrape
. - Network access to the Firecrawl API endpoint must be available.
- Proper configuration of the API key credential within n8n is necessary.
Troubleshooting
- Invalid JSON in Custom Headers: If the "Custom Headers (JSON)" property contains invalid JSON, the node will throw an error stating "Custom Headers must be valid JSON". Ensure the JSON string is correctly formatted.
- API Errors: If the Firecrawl API returns an error, the node throws an error with the message starting with "Firecrawl Scrape error:" followed by the API's error message. Check the URL validity, API key permissions, and network connectivity.
- Timeouts: Long loading pages might cause timeouts. Adjust the "Timeout (ms)" property accordingly.
- Empty or Unexpected Output: Verify that the URL is accessible and that the selected formats and options match the expected content type.
- SSL/TLS Issues: If scraping fails due to SSL certificate problems, enabling "Skip TLS Verification" may help but use cautiously.
Links and References
- Firecrawl API Documentation: https://firecrawl.dev/docs
- n8n Documentation on Creating Nodes: https://docs.n8n.io/integrations/creating-nodes/
- JSON Validator Tools (for custom headers): https://jsonlint.com/