Crawl and Scrape

Crawl websites and extract data

Actions3

Overview

This node enables crawling and scraping of web pages to extract specific data such as links, text content, or raw HTML. It is useful for scenarios where you need to gather information from websites automatically, like collecting all hyperlinks on a page, extracting readable text for analysis, or retrieving the full HTML markup for further processing.

Practical examples include:

Extracting all outbound links from a blog post to analyze link structure.
Scraping article text from news sites for sentiment analysis.
Downloading raw HTML of product pages for custom parsing.

Properties

Name	Meaning
URL	The web address of the page to crawl or scrape.
Max Depth	Maximum depth of crawling (only applicable when extracting links). Defines how deep the crawler should follow links starting from the initial URL.

Output

The node outputs JSON objects with the following structure depending on the selected operation:

Extract Links

{
  "status": "success",
  "message": "Crawling finished",
  "data": {
    "url": "string",        // The original URL crawled
    "links": ["string"]     // Array of unique extracted URLs found on the page(s)
  }
}

Extract Text

{
  "status": "success",
  "message": "Text extraction finished",
  "data": {
    "url": "string",        // The URL scraped
    "text": "string"        // All visible text content extracted from the page body
  }
}

Extract HTML

{
  "status": "success",
  "message": "HTML extraction finished",
  "data": {
    "url": "string",        // The URL scraped
    "html": "string"        // Raw HTML content of the page
  }
}

No binary data output is produced by this node.

Dependencies

This node depends on the crawlee library for crawling and scraping functionality.
It requires internet access to fetch web pages.
No special API keys or credentials are needed unless accessing protected resources.
The node uses internal request throttling and timeout settings to manage crawling.

Troubleshooting

Common issues:
- Invalid or malformed URLs can cause errors during crawling.
- Target websites blocking automated requests may result in empty or failed responses.
- Network timeouts if the site is slow or unreachable.
- Exceeding the maximum number of allowed requests per crawl (set to 100) might limit results.
Error messages:
- Errors related to URL parsing indicate invalid input URLs.
- Timeout errors suggest increasing the timeout or checking network connectivity.
- If the node fails but "Continue On Fail" is enabled, it will return error details per item instead of stopping execution.
Resolutions:
- Verify and correct the URL format.
- Use appropriate headers or proxies if the target site blocks bots.
- Adjust the max depth or reduce the number of requests to avoid overloading the target server.

Links and References

Crawlee GitHub Repository – For detailed documentation on the crawling library used.
n8n Documentation – General guidance on creating and using custom nodes.