WaterCrawl icon

WaterCrawl

Consume WaterCrawl API

Overview

The WaterCrawl node integrates with the WaterCrawl API to perform web crawling and scraping tasks. Specifically, the "Scrape URL" operation allows users to scrape a single webpage directly by providing its URL. This operation can be configured to either wait for the scraping results synchronously or return immediately with a reference URL. It supports advanced page options such as waiting times, timeouts, content filtering, and output formats like screenshots or PDFs.

This node is beneficial in scenarios where you need to extract structured data or content from web pages without building your own scraper infrastructure. For example, it can be used to gather product details from e-commerce sites, monitor news articles, or capture snapshots of webpages for archival purposes.

Properties

Name Meaning
URL The web address of the page to scrape.
Wait for Results Whether the node should wait for the scraping process to complete and return the results immediately (true), or return right away with a reference to the scrape job (false).
Download Results When waiting for results, whether to download the actual scraped data or just receive a reference URL pointing to the data.
Page Options A collection of settings controlling how the page is processed:
- Wait Time (Ms): Milliseconds to wait after the page loads before scraping.
- Timeout (Ms): Maximum time allowed for page loading.
- Include HTML: Whether to include the full HTML content in the result.
- Include Links: Whether to extract links found on the page.
- Only Main Content: Extract only the main content area of the page, excluding headers/footers/ads.
- Accept Cookies Selector: CSS selector to automatically click an "accept cookies" button if present.
- Locale: Browser locale setting (e.g., "en-US").
- Extra Headers: Additional HTTP headers to send with the request, specified as JSON.
- Include Tags: Comma-separated list of HTML tags to include in extraction (e.g., "div,p,h1").
- Exclude Tags: Comma-separated list of HTML tags to exclude (e.g., "script,style").
- Take Screenshot: Whether to capture a screenshot of the page.
- Generate PDF: Whether to generate a PDF version of the page.
Plugin Options JSON object specifying additional plugin-specific options for the scraping process.

Output

The node outputs JSON data representing the results of the scraping operation. The structure depends on the options selected:

  • If Download Results is enabled, the output includes the scraped content such as extracted HTML, links, main content, screenshots (as binary data references), or PDFs.
  • If not downloading results, the output contains metadata including a reference URL to access the scrape results externally.
  • The output is wrapped in n8n's standard execution metadata format, associating each result with the corresponding input item.

If screenshots or PDFs are requested, these are included as binary data attachments in the output, allowing downstream nodes to handle or save them accordingly.

Dependencies

  • Requires an active WaterCrawl API key credential for authentication.
  • Needs network access to the WaterCrawl API endpoint (default: https://app.watercrawl.dev).
  • No other external dependencies are required within n8n.

Troubleshooting

  • Invalid JSON in Plugin Options or Extra Headers: The node expects valid JSON strings for pluginOptions and extra_headers. Errors will occur if these fields contain malformed JSON. To fix, ensure proper JSON formatting.
  • Timeouts or No Results: If the target page takes too long to load or does not respond, consider increasing the Timeout (Ms) or adjusting Wait Time (Ms).
  • Empty or Partial Data: Using tag filters (Include Tags, Exclude Tags) incorrectly may omit desired content. Review these settings if expected data is missing.
  • Authentication Errors: Ensure the API key credential is correctly configured and has necessary permissions.
  • Binary Data Handling: When requesting screenshots or PDFs, downstream nodes must support binary data processing to utilize these outputs.

Links and References

Discussion