Overview
This node, named "Spider," is designed for web scraping and data processing tasks. It supports two main resources: URL collection rules ("网址采集规则") and data processing rules. When using the URL resource, it collects URLs based on user-provided start URLs and optional filtering criteria. For the data resource, it processes input data according to specified mapping rules, including text transformations and image downloading.
Common scenarios where this node is beneficial include:
- Automatically gathering multiple URLs from a list of starting points for further processing or crawling.
- Processing scraped HTML or JSON data by applying transformation rules, such as extracting metadata or downloading images referenced in the data.
- Preparing datasets by cleaning or enriching data fields before passing them to subsequent workflow nodes.
Practical example:
- You provide a list of seed URLs (one per line) to the node under the URL resource. The node fetches and filters these URLs according to existence or non-existence patterns, returning a unique set of URLs for further crawling.
- Alternatively, you input scraped data and define mapping rules that transform text fields or download images embedded in the data, automating content preparation.
Properties
Name | Meaning |
---|---|
采集网址 (url) | The starting URL(s) for scraping. Multiple URLs can be entered separated by new lines. |
Note: The node also supports other properties related to URL filtering (start
, end
, exist
, noExist
) and data mapping (map
), but only the "采集网址" property was provided explicitly.
Output
The node outputs an array of JSON objects representing the processed results:
- For the URL resource, the output JSON contains unique URLs collected after applying filtering rules. Each item includes a
currentUrl
field among others. - For the data resource, the output JSON contains transformed data fields according to the mapping rules. If images are downloaded, their paths are updated accordingly in the output data.
If images are involved, the node handles downloading them concurrently and updates references in the data to point to the downloaded locations.
Dependencies
- Requires network access to fetch URLs and download images.
- Uses concurrency control for efficient parallel processing.
- Depends on external libraries for URL parsing, path handling, HTML parsing (Cheerio), and promise mapping.
- Requires configuration of API keys or authentication tokens if the target URLs or APIs need authorization (not shown explicitly in the code).
Troubleshooting
- Empty or invalid URLs: Ensure the input URLs are correctly formatted and accessible. Invalid URLs or empty lines are filtered out.
- Concurrency limits: The node uses a concurrency limit of 5 for fetching URLs and downloading images. If the target server restricts connections, consider adjusting concurrency or adding delays.
- Image download failures: If images fail to download, verify the source URLs and destination paths. Network issues or permission errors may cause failures.
- Data mapping errors: Incorrect mapping rules or unsupported transformation types may cause processing errors. Review the mapping configuration carefully.
- Missing required parameters: The "采集网址" property is required for the URL resource; missing it will prevent execution.
Links and References
- Cheerio - jQuery for Node.js
- Node.js URL module
- n8n Documentation (for general node usage and credential setup)