CloudBrowser icon

CloudBrowser

Interact with websites using a cloud-based browser instance

Overview

This node, named "CloudBrowser," allows users to interact with websites through a cloud-based browser instance. Specifically for the Content resource and the Get HTML From Website operation, it navigates to a specified URL using a remote browser service and retrieves the full HTML content of the webpage.

Common scenarios where this node is beneficial include:

  • Extracting raw HTML content from dynamic websites that require JavaScript execution.
  • Scraping data from pages that load content asynchronously.
  • Automating web navigation tasks where direct HTTP requests are insufficient due to client-side rendering.

Practical example:

  • You want to scrape product details from an e-commerce site that loads content dynamically via JavaScript. Using this node, you can navigate to the product page URL and retrieve the fully rendered HTML for further parsing.

Properties

Name Meaning
URL to Navigate The URL of the website to open and retrieve HTML content from. This is required.
Navigation Options Options controlling how navigation behaves:
- Wait Until: When to consider navigation finished (options: Load, Domcontentloaded, Networkidle0, Networkidle2).
- Timeout (Ms): Maximum wait time in milliseconds.
Browser Configuration Settings for the browser instance:
- Browser Type: Choose between Chrome, Chromium, or ChromeHeadlessShell.
- Headless Mode: Whether to run browser without UI.
- Stealth Mode: Enable stealth to avoid detection.
- Keep Open (Seconds): How long to keep browser open before auto-closing (0 means never).
- Label: Name for the browser instance.
- Save Session: Save session for reuse.
- Recover Session: Recover previously saved session.
Custom Arguments Additional command-line arguments to pass to the browser on startup.
Ignored Default Arguments List of default browser arguments to ignore when launching the browser.
Proxy Configuration Proxy server settings:
- Host
- Port
- Username
- Password

Output

The output JSON object contains the following fields:

  • title: The title of the loaded webpage.
  • url: The final URL after navigation (may differ if redirected).
  • content: The full HTML content of the webpage as a string.

This output provides the complete HTML source of the page after any client-side scripts have executed, enabling downstream processing or scraping.

No binary data is output for this operation.

Dependencies

  • Requires access to a cloud-based browser service API, authenticated via an API token credential.
  • Uses Puppeteer library internally to connect to the remote browser instance.
  • The node requires configuration of the API token credential for the cloud browser service.
  • Optional proxy configuration can be used to route browser traffic.

Troubleshooting

  • No WebSocket address received from the browser service: Indicates failure to open a browser instance remotely. Check API token validity and service availability.
  • Navigation timeout: If the page takes too long to load, increase the "Timeout (Ms)" value in Navigation Options.
  • Empty or incomplete HTML content: May occur if the page requires additional interaction or longer wait times; adjust "Wait Until" option accordingly.
  • Authentication or permission errors: Ensure the API token credential has proper permissions and is correctly configured.
  • Proxy connection issues: Verify proxy host, port, and credentials if using proxy settings.

Links and References

Discussion