Overview
This node enables crawling and scraping of web pages by fetching a specified URL and extracting data from it. It supports three main operations: extracting all links on the page, extracting the raw text content, or extracting the raw HTML content. This is useful for scenarios such as gathering URLs for further crawling, retrieving textual information for analysis, or capturing the full HTML markup for processing or archiving.
Practical examples include:
- Extracting all hyperlinks from a news article to find related stories.
- Scraping the visible text from a product page for sentiment analysis.
- Downloading the complete HTML of a webpage for offline storage or detailed parsing.
Properties
Name | Meaning |
---|---|
URL | The URL to crawl or scrape. Must be a valid web address. |
Operation | The type of extraction to perform. Options are: - Extract Links: get all links on the page - Extract Text: get all visible text content - Extract HTML: get the raw HTML source code of the page |
Max Depth | (Only for "Extract Links" operation) Maximum depth of crawling when following links. |
Output
The node outputs JSON objects with the following structure depending on the selected operation:
Extract Links
{ "status": "success", "message": "Crawling finished", "data": { "url": "<input URL>", "links": ["<list of unique absolute URLs extracted>"] } }
Contains the original URL and an array of all unique links found on that page.
Extract Text
{ "status": "success", "message": "Text extraction finished", "data": { "url": "<input URL>", "text": "<all visible text content from the page>" } }
Contains the original URL and the concatenated text content extracted from the body of the page.
Extract HTML
{ "status": "success", "message": "HTML extraction finished", "data": { "url": "<input URL>", "html": "<raw HTML source code of the page>" } }
Contains the original URL and the full HTML markup as a string.
No binary data output is produced by this node.
Dependencies
- Uses the
crawlee
library internally for crawling and scraping functionality. - Requires internet access to fetch the target URLs.
- No explicit API keys or credentials are needed unless accessing protected resources.
- No additional n8n environment variables or configurations are required.
Troubleshooting
Common issues:
- Invalid or unreachable URLs will cause errors during crawling.
- Pages requiring authentication or JavaScript rendering may not return expected results since this uses static HTML parsing.
- Network timeouts or slow responses can lead to request handler timeouts (default 30 seconds).
Error messages:
- Errors during crawling or extraction are logged and can either stop execution or be handled gracefully if "Continue On Fail" is enabled.
- Typical error includes request failures or invalid URL parsing.
Resolutions:
- Verify the correctness and accessibility of the input URL.
- Increase timeout settings if necessary (not exposed in current properties).
- Use proxy or authentication mechanisms externally if needed before feeding URLs to this node.
Links and References
- Crawlee GitHub Repository – underlying crawling library used.
- Cheerio Documentation – for understanding HTML parsing used internally.
- n8n Documentation – general guidance on creating and using custom nodes.