Crawl and Scrape icon

Crawl and Scrape

Crawl websites and extract data

Overview

This node enables crawling and scraping of web pages using different extraction operations. It is useful for scenarios where you need to programmatically gather data from websites, such as collecting all links on a page, extracting visible text content, or retrieving the raw HTML markup.

Practical examples include:

  • Extracting all hyperlinks from a webpage to analyze site structure or gather URLs for further processing.
  • Extracting textual content from articles or blog posts for sentiment analysis or summarization.
  • Retrieving full HTML content for offline archiving or detailed parsing.

Properties

Name Meaning
URL The web address to crawl or scrape.
Operation The type of extraction to perform:
- Extract Links: Get all links from the page.
- Extract Text: Get all visible text content.
- Extract HTML: Get the raw HTML source code.
Max Depth (Only for "Extract Links" operation) Maximum depth level for crawling linked pages.

Output

The node outputs JSON objects with the following structure depending on the selected operation:

  • Extract Links

    {
      "status": "success",
      "message": "Crawling finished",
      "data": {
        "url": "<input URL>",
        "links": ["<list of unique extracted URLs>"]
      }
    }
    

    Contains the original URL and an array of unique links found during crawling.

  • Extract Text

    {
      "status": "success",
      "message": "Text extraction finished",
      "data": {
        "url": "<input URL>",
        "text": "<extracted visible text content>"
      }
    }
    

    Contains the original URL and the extracted plain text content from the page body.

  • Extract HTML

    {
      "status": "success",
      "message": "HTML extraction finished",
      "data": {
        "url": "<input URL>",
        "html": "<raw HTML content>"
      }
    }
    

    Contains the original URL and the full HTML source of the page.

No binary data output is produced by this node.

Dependencies

  • Uses the crawlee library for crawling and scraping functionality.
  • Requires network access to the target URLs.
  • No explicit API keys or external service credentials are needed.
  • Runs within n8n environment with standard HTTP request capabilities.

Troubleshooting

  • Common issues:

    • Invalid or unreachable URLs will cause errors during crawling or extraction.
    • Pages requiring authentication or JavaScript rendering may not be scraped correctly since this uses static HTML parsing.
    • Network timeouts or slow responses can lead to request handler timeouts (default 30 seconds).
  • Error messages:

    • Errors related to URL parsing or network failures will be logged and can cause the node to fail unless "Continue On Fail" is enabled.
    • If extraction fails, the error message includes details about the failure and the item index in the workflow.
  • Resolutions:

    • Verify URLs are correct and accessible.
    • Use proxy or authentication mechanisms if required by the target site.
    • Increase timeout settings if necessary (not exposed in current properties).
    • Enable "Continue On Fail" to allow partial results when some items fail.

Links and References

Discussion