Overview
This node enables crawling and scraping of web pages to extract specific data such as links, text content, or raw HTML. It is useful for scenarios where you need to gather information from websites automatically, like collecting all hyperlinks on a page, extracting readable text for analysis, or retrieving the full HTML markup for further processing.
Practical examples include:
- Extracting all outbound links from a blog post to analyze link structure.
- Scraping article text from news sites for sentiment analysis.
- Downloading raw HTML of product pages for custom parsing.
Properties
Name | Meaning |
---|---|
URL | The web address of the page to crawl or scrape. |
Max Depth | Maximum depth of crawling (only applicable when extracting links). Defines how deep the crawler should follow links starting from the initial URL. |
Output
The node outputs JSON objects with the following structure depending on the selected operation:
Extract Links
{ "status": "success", "message": "Crawling finished", "data": { "url": "string", // The original URL crawled "links": ["string"] // Array of unique extracted URLs found on the page(s) } }
Extract Text
{ "status": "success", "message": "Text extraction finished", "data": { "url": "string", // The URL scraped "text": "string" // All visible text content extracted from the page body } }
Extract HTML
{ "status": "success", "message": "HTML extraction finished", "data": { "url": "string", // The URL scraped "html": "string" // Raw HTML content of the page } }
No binary data output is produced by this node.
Dependencies
- This node depends on the
crawlee
library for crawling and scraping functionality. - It requires internet access to fetch web pages.
- No special API keys or credentials are needed unless accessing protected resources.
- The node uses internal request throttling and timeout settings to manage crawling.
Troubleshooting
Common issues:
- Invalid or malformed URLs can cause errors during crawling.
- Target websites blocking automated requests may result in empty or failed responses.
- Network timeouts if the site is slow or unreachable.
- Exceeding the maximum number of allowed requests per crawl (set to 100) might limit results.
Error messages:
- Errors related to URL parsing indicate invalid input URLs.
- Timeout errors suggest increasing the timeout or checking network connectivity.
- If the node fails but "Continue On Fail" is enabled, it will return error details per item instead of stopping execution.
Resolutions:
- Verify and correct the URL format.
- Use appropriate headers or proxies if the target site blocks bots.
- Adjust the max depth or reduce the number of requests to avoid overloading the target server.
Links and References
- Crawlee GitHub Repository – For detailed documentation on the crawling library used.
- n8n Documentation – General guidance on creating and using custom nodes.