Overview
The node "Crawl4AI: Basic Crawler" enables crawling multiple URLs using the Crawl4AI service. It is designed to fetch and extract web content programmatically, supporting advanced options like JavaScript execution, headless browsing, caching strategies, and selective content extraction via CSS selectors.
This node is beneficial in scenarios such as:
- Aggregating content from multiple websites for data analysis or monitoring.
- Extracting specific parts of web pages (e.g., articles, product details) by specifying CSS selectors.
- Performing automated web scraping with control over browser behavior and request handling.
- Handling large-scale crawling tasks with concurrency and retry mechanisms.
Practical examples include:
- Collecting news articles from various sources by crawling their URLs and extracting article content.
- Monitoring competitor websites for changes by periodically crawling and comparing page contents.
- Gathering product information from e-commerce sites while respecting robots.txt rules and excluding unwanted HTML tags.
Properties
Name | Meaning |
---|---|
URLs | Comma-separated list of URLs to crawl. The node will process each URL provided here. |
Browser Options | Settings controlling the browser environment during crawling: - Enable JavaScript: Whether to execute JavaScript on pages. - Headless Mode: Run browser without UI. - User Agent: Custom user agent string. - Viewport Height & Width: Dimensions of the browser viewport. |
Crawler Options | Controls crawling behavior: - Cache Mode: Use cache enabled/read-write, bypass cache, or read-only cache. - Check Robots.txt: Respect robots.txt rules. - CSS Selector: Focus extraction on specific page elements. - Exclude External Links: Omit external links from results. - Excluded Tags: HTML tags to exclude. - Max Retries: Number of retries on failure. - Page Timeout (ms): Max wait time for page load. - Request Timeout (ms): Max wait time for network requests. - Stream Results: Stream output as available. - Word Count Threshold: Minimum word count for content inclusion. |
Options | Additional options: - Include Media Data: Include images/videos in output. - Verbose Response: Include detailed data like full HTML and status codes. - Max Concurrent Crawls: Limit concurrent crawling tasks. |
Output
The node outputs JSON data representing the crawled content for each URL processed. The structure typically includes:
- Extracted textual content based on the CSS selector or full page if none specified.
- Metadata such as HTTP status codes, final URLs after redirects, and possibly timing information.
- If enabled, media data like images and videos embedded in the page.
- When verbose response is enabled, detailed raw HTML and other diagnostic information are included.
- If streaming is enabled, partial results may be emitted progressively as they become available.
Binary data output is not explicitly indicated but media data inclusion suggests that image/video data might be handled or referenced.
Dependencies
- Requires an API key credential for the Crawl4AI service to authenticate requests.
- Relies on the Crawl4AI backend for crawling and rendering web pages.
- No additional environment variables or external services are indicated beyond the API credential.
Troubleshooting
Common Issues:
- Invalid or missing API key credential will cause authentication failures.
- Network timeouts if pageTimeout or requestTimeout values are too low for slow-loading pages.
- Crawling blocked by robots.txt if the option to respect it is enabled and the site disallows crawling.
- Incorrect CSS selector may result in empty or incomplete content extraction.
- Exceeding max concurrent crawls may lead to throttling or rate limiting by the Crawl4AI service.
Error Messages:
- Authentication errors: Verify the API key credential is correctly configured.
- Timeout errors: Increase timeout settings or reduce concurrency.
- Content extraction errors: Check CSS selector syntax and ensure target elements exist.
- Cache-related errors: Adjust cache mode if stale or missing data issues occur.
Links and References
- Crawl4AI Official Website — For service documentation and API details.
- CSS Selectors Reference — To craft effective selectors for content extraction.
- Robots.txt Specification — Understanding how robots.txt affects crawling permissions.