WaterCrawl icon

WaterCrawl

Consume WaterCrawl API

Overview

The node integrates with the WaterCrawl API to manage web crawling tasks. Specifically, the Get Crawl Results operation retrieves the results of a previously initiated crawl request by its ID. It supports pagination and optionally downloading the full content of each result.

This node is useful when you want to programmatically access data extracted from websites after a crawl has been performed. For example, you might use it to:

  • Fetch scraped data from a crawl job for further processing or analysis.
  • Paginate through large sets of crawl results.
  • Download detailed content or metadata associated with each crawled page.

Practical example: After starting a crawl on a website to collect product information, you can use this operation to retrieve the collected pages in batches, optionally downloading the full HTML or other extracted data for each page.

Properties

Name Meaning
Crawl ID The unique identifier of the crawl request whose results you want to retrieve.
Limit Maximum number of results to return per page (minimum 1). Defaults to 50.
Page The page number of results to retrieve. Defaults to 1.
Download Whether to download the full content of each result or just get metadata/reference URLs.

Output

The output JSON array contains objects representing individual crawl results. Each object includes metadata about the crawled page. If the Download property is enabled, the node downloads and includes the full content or detailed result data for each item.

Additionally, the node outputs a second JSON array providing pagination details with the following fields:

  • current_page: The current page number retrieved.
  • next_page: The next page number if available, otherwise null.
  • previous_page: The previous page number if available, otherwise null.
  • total_pages: Total number of pages available based on the count and limit.
  • total_results: Total number of results available.

This allows easy handling of paginated crawl results within workflows.

Dependencies

  • Requires an API key credential for authenticating with the WaterCrawl API.
  • Uses the WaterCrawl API base URL, defaulting to https://app.watercrawl.dev if not configured otherwise.
  • No additional external dependencies beyond the WaterCrawl API service.

Troubleshooting

  • Invalid JSON in pluginOptions or extra_headers: The node expects valid JSON strings for these properties. Errors will indicate parsing issues. Ensure JSON syntax is correct.
  • Missing or invalid Crawl ID: The operation requires a valid crawl request ID. Providing an incorrect or empty ID will cause errors.
  • Pagination parameters out of range: Using page numbers or limits that exceed available data may return empty results or no next page.
  • API authentication errors: Ensure the API key credential is correctly configured and has necessary permissions.
  • Download failures: If downloading result content fails, check network connectivity and API response status.

If the node is set to continue on failure, errors for individual items will be returned as error messages in the output instead of stopping execution.

Links and References

Discussion