Spider

爬虫

Actions2

网址采集规则 Actions
- 列表页
数据处理规则 Actions
- 数据处理

Overview

This node, named "Spider," provides data processing capabilities focused on transforming input JSON data according to user-defined rules. It is particularly useful for cleaning, extracting, replacing, or augmenting text fields within structured data. Common scenarios include preparing scraped or imported data for further analysis by removing unwanted characters, extracting specific substrings via regex or delimiters, adding prefixes/suffixes, setting default or fixed values, stripping HTML tags, extracting URLs or emails, and downloading images referenced in the data.

For example, if you have a dataset with messy text fields containing HTML tags and embedded image URLs, this node can clean the text, extract the URLs, and optionally download the images to local storage or a specified path prefix.

Properties

Name	Meaning
字段处理 (map)	A collection of field processing rules. Each entry specifies: - 字段名称: The name of the field to process. - 数据处理规则: One or more processing rules applied sequentially to that field. Supported rules include: - 正则提取 (extract): Extract text using a regular expression. - 正则替换 (replaceText): Replace text matching a pattern. - 文本提取 (extractText): Extract substring between start and end strings. - 文本替换 (replace): Replace occurrences of a string. - 清除换行符 (removeLineBreaks): Remove line breaks. - 清除超链接 (removeLinks): Remove hyperlinks. - 添加前缀 (addPrefix): Add a prefix string. - 添加后缀 (addSuffix): Add a suffix string. - 设置默认值 (setDefault): Set a default value if empty. - 设置固定值 (setFixedText): Set a fixed value. - 清除Html标签 (stripHtmlTags): Remove HTML tags. - 提取Url链接 (extractUrl): Extract URL links. - 提取电子邮件 (extractEmail): Extract email addresses. - 图片下载 (imageDown): Download images referenced in the field, with options to specify download URL prefix and rename prefix.

Name

Meaning

字段处理 (map)

A collection of field processing rules. Each entry specifies:
- 字段名称: The name of the field to process.
- 数据处理规则: One or more processing rules applied sequentially to that field. Supported rules include:
- 正则提取 (extract): Extract text using a regular expression.
- 正则替换 (replaceText): Replace text matching a pattern.
- 文本提取 (extractText): Extract substring between start and end strings.
- 文本替换 (replace): Replace occurrences of a string.
- 清除换行符 (removeLineBreaks): Remove line breaks.
- 清除超链接 (removeLinks): Remove hyperlinks.
- 添加前缀 (addPrefix): Add a prefix string.
- 添加后缀 (addSuffix): Add a suffix string.
- 设置默认值 (setDefault): Set a default value if empty.
- 设置固定值 (setFixedText): Set a fixed value.
- 清除Html标签 (stripHtmlTags): Remove HTML tags.
- 提取Url链接 (extractUrl): Extract URL links.
- 提取电子邮件 (extractEmail): Extract email addresses.
- 图片下载 (imageDown): Download images referenced in the field, with options to specify download URL prefix and rename prefix.

Output

The node outputs an array of JSON objects corresponding to the processed input items. Each item's JSON will have its specified fields transformed according to the configured rules.

If the "图片下载" (imageDown) rule is used, the node also downloads images found in the field's content. The downloaded images are saved to specified destinations, and the field's content is updated to reflect new image source paths.

No binary output is directly produced by the node; image downloads happen as side effects handled internally.

Dependencies

Requires internet access if using the image download feature.
Uses concurrency-limited asynchronous operations to fetch URLs and download images.
Relies on external libraries such as:
- lodash for deep cloning and utility functions.
- cheerio for parsing and manipulating HTML content.
- Node.js core modules like url and path for URL and file path handling.
No explicit API keys or credentials are required unless the user configures URLs requiring authentication externally.

Troubleshooting

Empty or unchanged fields after processing: Ensure that the input data contains the expected fields and that the processing rules are correctly configured (e.g., correct regex patterns, start/end strings).
Image download failures: Check that the image URLs are valid and accessible. Network issues or invalid URLs may cause downloads to fail silently.
Incorrect replacements or extractions: Verify the correctness of regex expressions and string parameters used in rules.
Performance issues with large datasets: The node processes items sequentially but uses concurrency limits for downloads; very large inputs may require tuning or splitting into smaller batches.
HTML parsing errors: If the field content is not valid HTML when using image download, the node might not find images correctly.

Links and References

Cheerio Documentation – For understanding HTML parsing and manipulation.
Lodash Documentation – Utility functions used for data cloning and manipulation.
Node.js URL Module – Handling URL parsing.
Node.js Path Module – File path utilities.

This summary is based solely on static analysis of the provided source code and property definitions.