Spider

爬虫

Actions2

网址采集规则 Actions
- 列表页
数据处理规则 Actions
- 数据处理

Overview

This node, named "Spider," is designed for web scraping and data processing tasks. It supports two main resources: URL collection rules ("网址采集规则") and data processing rules. When using the URL resource, it collects URLs based on user-provided start URLs and optional filtering criteria. For the data resource, it processes input data according to specified mapping rules, including text transformations and image downloading.

Common scenarios where this node is beneficial include:

Automatically gathering multiple URLs from a list of starting points for further processing or crawling.
Processing scraped HTML or JSON data by applying transformation rules, such as extracting metadata or downloading images referenced in the data.
Preparing datasets by cleaning or enriching data fields before passing them to subsequent workflow nodes.

Practical example:

You provide a list of seed URLs (one per line) to the node under the URL resource. The node fetches and filters these URLs according to existence or non-existence patterns, returning a unique set of URLs for further crawling.
Alternatively, you input scraped data and define mapping rules that transform text fields or download images embedded in the data, automating content preparation.

Properties

Name	Meaning
采集网址 (url)	The starting URL(s) for scraping. Multiple URLs can be entered separated by new lines.

Note: The node also supports other properties related to URL filtering (start, end, exist, noExist) and data mapping (map), but only the "采集网址" property was provided explicitly.

Output

The node outputs an array of JSON objects representing the processed results:

For the URL resource, the output JSON contains unique URLs collected after applying filtering rules. Each item includes a currentUrl field among others.
For the data resource, the output JSON contains transformed data fields according to the mapping rules. If images are downloaded, their paths are updated accordingly in the output data.

If images are involved, the node handles downloading them concurrently and updates references in the data to point to the downloaded locations.

Dependencies

Requires network access to fetch URLs and download images.
Uses concurrency control for efficient parallel processing.
Depends on external libraries for URL parsing, path handling, HTML parsing (Cheerio), and promise mapping.
Requires configuration of API keys or authentication tokens if the target URLs or APIs need authorization (not shown explicitly in the code).

Troubleshooting

Empty or invalid URLs: Ensure the input URLs are correctly formatted and accessible. Invalid URLs or empty lines are filtered out.
Concurrency limits: The node uses a concurrency limit of 5 for fetching URLs and downloading images. If the target server restricts connections, consider adjusting concurrency or adding delays.
Image download failures: If images fail to download, verify the source URLs and destination paths. Network issues or permission errors may cause failures.
Data mapping errors: Incorrect mapping rules or unsupported transformation types may cause processing errors. Review the mapping configuration carefully.
Missing required parameters: The "采集网址" property is required for the URL resource; missing it will prevent execution.

Links and References

Cheerio - jQuery for Node.js
Node.js URL module
n8n Documentation (for general node usage and credential setup)