WaterCrawl icon

WaterCrawl

Consume WaterCrawl API

Overview

The node integrates with the WaterCrawl API to perform web crawling and scraping tasks. Specifically, the Create operation under the Crawl resource initiates a new crawl request on a specified URL with customizable options for page handling and spidering behavior.

This node is useful when you want to programmatically extract data from websites by crawling multiple pages or scraping specific content. For example, it can be used to gather product information across an e-commerce site, monitor news articles on a blog, or collect links and HTML content from a set of web pages.

Properties

Name Meaning
URL The starting URL to begin the crawl.
Page Options Settings related to how each page is processed during the crawl:
- Wait Time (Ms): Delay after page load before processing.
- Timeout (Ms): Maximum time to wait for page load.
- Include HTML: Whether to include full HTML in results.
- Include Links: Whether to extract links found on the page.
- Only Main Content: Extract only main content, ignoring other parts.
- Accept Cookies Selector: CSS selector to automatically accept cookies.
- Locale: Browser locale setting.
- Extra Headers: Additional HTTP headers as JSON.
- Include Tags: Comma-separated list of HTML tags to include.
- Exclude Tags: Comma-separated list of HTML tags to exclude.
- Take Screenshot: Capture a screenshot of the page.
- Generate PDF: Create a PDF snapshot of the page.
Plugin Options JSON object specifying additional plugin-specific options for the crawl.
Spider Options Controls the crawling behavior:
- Max Pages: Maximum number of pages to crawl.
- Max Depth: Maximum link depth to follow.
- Allowed Domains: Comma-separated domains allowed during crawl.
- Exclude Paths: Comma-separated path patterns to exclude.
- Include Paths: Comma-separated path patterns to include.

Output

The output is a JSON array where each item corresponds to the result of the crawl request creation. The structure includes details returned by the WaterCrawl API about the newly created crawl job, such as identifiers, status, and metadata.

The exact fields depend on the API response but typically include:

  • Crawl request ID
  • Status of the crawl
  • Configuration details used
  • Any immediate results or references to later fetch results

No binary data is output by this operation.

Dependencies

  • Requires an API key credential for authenticating with the WaterCrawl API.
  • The base URL for the API defaults to https://app.watercrawl.dev but can be customized via credentials.
  • No other external dependencies are required beyond network access to the WaterCrawl service.

Troubleshooting

  • Invalid JSON in pluginOptions or extra_headers: If these inputs contain malformed JSON, the node will throw an error indicating invalid JSON. Ensure that JSON strings are properly formatted.
  • Timeouts or crawl failures: If the crawl does not complete within the specified timeout or encounters network issues, the API may return errors. Adjust the timeout or check network connectivity.
  • Incorrect selectors or tag lists: Providing invalid CSS selectors or malformed tag lists may cause incomplete or failed extraction.
  • API authentication errors: Ensure the API key credential is valid and has necessary permissions.

Links and References

Discussion