Crawl and Scrape icon

Crawl and Scrape

Crawl websites and extract data

Overview

This node enables crawling and scraping of web pages to extract specific types of data. It supports three main operations: extracting all links from a page, extracting the visible text content, and extracting the raw HTML content. This is useful for scenarios such as gathering URLs for further crawling, collecting textual information for analysis, or retrieving full HTML markup for custom parsing.

Practical examples include:

  • Extracting all outbound links from a news article to analyze referenced sources.
  • Scraping product descriptions or reviews as plain text for sentiment analysis.
  • Downloading the full HTML of a webpage to feed into another system for detailed processing.

Properties

Name Meaning
URL The web address of the page to crawl or scrape.
Operation The type of extraction to perform on the page. Options: "Extract Links", "Extract Text", "Extract HTML".
Max Depth (Only for "Extract Links" operation) The maximum depth level for crawling linked pages.

Output

The output is an array of JSON objects, each representing the result of the extraction for one input item. The structure depends on the selected operation:

  • Extract Links

    {
      "status": "success",
      "message": "Crawling finished",
      "data": {
        "url": "input URL",
        "links": ["list of unique extracted URLs"]
      }
    }
    

    Contains the original URL and an array of all unique hyperlinks found on that page.

  • Extract Text

    {
      "status": "success",
      "message": "Text extraction finished",
      "data": {
        "url": "input URL",
        "text": "all visible text content from the page"
      }
    }
    

    Contains the original URL and the concatenated visible text content extracted from the page body.

  • Extract HTML

    {
      "status": "success",
      "message": "HTML extraction finished",
      "data": {
        "url": "input URL",
        "html": "raw HTML source of the page"
      }
    }
    

    Contains the original URL and the full raw HTML markup of the page.

No binary data output is produced by this node.

Dependencies

  • Uses the crawlee library for crawling and scraping functionality.
  • Requires network access to the target URLs.
  • No special API keys or credentials are needed unless accessing protected resources.
  • Runs within n8n environment with standard HTTP request capabilities.

Troubleshooting

  • Common issues:

    • Invalid or unreachable URLs will cause errors during crawling.
    • Pages heavily reliant on JavaScript may not render correctly since the crawler uses static HTML parsing.
    • Network timeouts or slow responses can lead to request handler timeouts.
  • Error messages:

    • Errors related to URL parsing indicate malformed URLs; ensure the URL is valid and complete.
    • Timeout errors suggest increasing timeout settings or checking network connectivity.
    • If the node fails but "Continue On Fail" is enabled, it will output error details per item instead of stopping execution.

Links and References

Discussion