Crawl4AI: Basic Crawler icon

Crawl4AI: Basic Crawler

Crawl websites using Crawl4AI

Overview

The node "Crawl4AI: Basic Crawler" is designed to crawl websites by fetching and processing the content of a single URL. It is useful for scenarios where you want to extract web page data, such as scraping articles, product information, or any publicly accessible web content. The node supports advanced crawling options like JavaScript execution, respecting robots.txt rules, caching strategies, and filtering content by CSS selectors.

Practical examples include:

  • Extracting article text from news websites.
  • Collecting product details from e-commerce pages.
  • Crawling dynamic web pages that require JavaScript rendering.
  • Filtering out unwanted page sections (e.g., navigation bars, footers).

Properties

Name Meaning
URL The URL to crawl.
Browser Options Settings related to the browser environment used for crawling:
- Enable JavaScript: Whether to execute JavaScript on the page.
- Headless Mode: Run browser without UI.
- User Agent: Custom user agent string.
- Viewport Height & Width: Dimensions of the browser viewport.
Crawler Options Controls how the crawler behaves:
- Cache Mode: Use cache (enabled), bypass cache (force fresh), or only use cache.
- Check Robots.txt: Respect website crawling rules.
- CSS Selector: Focus on specific page elements.
- Exclude External Links: Skip links outside the domain.
- Excluded Tags: HTML tags to ignore.
- JavaScript Code: Custom JS to run after page load.
- JavaScript Only Mode: Execute JS without crawling.
- Max Retries: Number of retry attempts.
- Page Timeout: Max wait time for page load.
- Request Timeout: Max wait time for network requests.
- Session ID: Maintain browser state across crawls.
- Word Count Threshold: Minimum words required to include content.
Options Additional output controls:
- Include Media Data: Include images/videos in output.
- Verbose Response: Include detailed data like HTML and status codes.

Output

The node outputs JSON data representing the crawled content of the specified URL. This includes extracted text, optionally filtered by CSS selectors and excluding specified tags or external links. If enabled, media data such as images and videos are included. When verbose mode is active, additional metadata like raw HTML content, HTTP status codes, and other diagnostic information are provided.

If the node handles binary data (e.g., media files), it will be included accordingly, representing the downloaded media associated with the crawled page.

Dependencies

  • Requires an API key credential for the Crawl4AI service to perform crawling operations.
  • The node internally uses a browser automation environment capable of executing JavaScript and handling headless browsing.
  • No additional environment variables are explicitly required beyond the API authentication.

Troubleshooting

  • Common Issues:

    • Crawling blocked by robots.txt if "Check Robots.txt" is enabled.
    • Timeouts if the page takes too long to load; adjust "Page Timeout" and "Request Timeout".
    • JavaScript execution failures if custom JS code is incorrect or incompatible.
    • Cache inconsistencies if cache mode is misconfigured.
  • Error Messages:

    • Authentication errors indicate missing or invalid API credentials.
    • Network errors suggest connectivity issues or invalid URLs.
    • Timeout errors can be resolved by increasing timeout settings.
    • Parsing errors may occur if CSS selectors do not match any elements.

Links and References

Discussion