Spider icon

Spider

爬虫

Actions2

Overview

This node, named "Spider," is designed for web scraping and data extraction tasks. It supports two main resources: URL collection rules ("网址采集规则") and data processing rules. For the "网址采集规则" resource with the "列表页" (list page) operation, it collects URLs from specified starting points, applying filters based on inclusion or exclusion criteria within the links.

Common scenarios where this node is beneficial include:

  • Crawling multiple web pages starting from given URLs to gather lists of target links.
  • Filtering collected URLs by specifying substrings that must or must not be present in the links.
  • Extracting structured data from web pages for further processing or automation workflows.

Practical example:

  • You want to scrape product listing pages from an e-commerce site. You provide starting URLs (one per line), specify HTML regions to start and end scraping, and filter links to only those containing certain keywords (e.g., "product") while excluding others (e.g., "ads"). The node returns a deduplicated list of URLs matching these criteria.

Properties

Name Meaning
采集网址 (url) The starting URLs to begin scraping from. Multiple URLs can be provided separated by new lines.
开始采集区域 (start) The HTML region or marker indicating where to start collecting links on the page.
结束采集区域 (end) The HTML region or marker indicating where to stop collecting links on the page.
链接包含 (exist) Substrings that must be present in the links to be included. Multiple values separated by new lines.
链接不包含 (noExist) Substrings that must NOT be present in the links to be included. Multiple values separated by new lines.

Note: The last three properties are shown only when the operation is "conventional" under the "url" resource.

Output

The node outputs an array of JSON objects representing the collected URLs after filtering and deduplication. Each object contains at least a currentUrl field with the URL string.

Example output structure:

[
  {
    "currentUrl": "https://example.com/page1"
  },
  {
    "currentUrl": "https://example.com/page2"
  }
]

No binary data output is produced by this operation.

Dependencies

  • The node uses concurrency-controlled asynchronous requests to fetch and process URLs.
  • It depends on external libraries such as:
    • A URL parsing library.
    • A DOM parsing library for HTML content processing.
    • A concurrency management utility for handling multiple asynchronous operations.
  • No explicit external API keys or credentials are required for this operation.
  • The node expects valid URLs and network access to perform HTTP requests.

Troubleshooting

  • Empty results: If no URLs are returned, verify that the starting URLs are correct and accessible, and that the "start" and "end" collection regions correctly match the HTML structure of the target pages.
  • Incorrect filtering: Ensure that the "链接包含" and "链接不包含" fields are properly set with substrings that accurately reflect the desired link patterns.
  • Network errors: The node may fail if the target websites block automated requests or require authentication.
  • Malformed URLs: Input URLs should be well-formed; otherwise, parsing errors might occur.
  • Concurrency limits: Too many simultaneous requests might cause throttling or failures; the node uses a concurrency limit of 5 to mitigate this.

Links and References

Discussion