Overview
This node, named "Crawl4AI: Basic Crawler," is designed to process raw HTML content by crawling and extracting meaningful data from it. It is particularly useful for scenarios where you have raw HTML input and want to extract structured information such as text content, links, or media elements while applying filters like CSS selectors or excluding certain tags.
Practical examples include:
- Extracting the main article content from a downloaded HTML page.
- Filtering out navigation bars, footers, or sidebars from the content.
- Resolving relative URLs based on a base URL.
- Optionally including media data like images or videos in the output.
- Applying word count thresholds to filter out insignificant content.
Properties
Name | Meaning |
---|---|
HTML Content | The raw HTML content to process. Example: <html><body><h1>Example</h1><p>Content</p></body></html> |
Base URL | The base URL used to resolve relative links within the HTML content. Default: https://example.com |
Crawler Options | Collection of options to customize crawling behavior: |
- CSS Selector | CSS selector string to focus extraction on specific parts of the HTML (e.g., article.content ). Empty means full page. |
- Exclude External Links | Boolean flag to exclude external links from the extracted results. |
- Excluded Tags | Comma-separated list of HTML tags to exclude from processing (e.g., nav,footer,aside ). |
- Word Count Threshold | Minimum number of words required for content to be included in the output. |
Options | Additional options controlling output details: |
- Include Media Data | Whether to include media data such as images and videos in the output. |
- Verbose Response | Whether to include detailed response data like full HTML and status codes in the output. |
Output
The node outputs JSON data representing the processed content extracted from the provided raw HTML. This typically includes:
- Cleaned and filtered textual content based on the CSS selector and excluded tags.
- Resolved links with respect to the base URL, optionally excluding external links.
- If enabled, media data such as image and video metadata.
- When verbose mode is enabled, additional details like the original HTML snippet and HTTP status codes may be included.
If media data is included, binary data references might be present corresponding to images or videos extracted from the HTML.
Dependencies
- Requires an API key credential for the Crawl4AI service to perform crawling and processing.
- The node depends on the Crawl4AI API endpoint to process the HTML content.
- No other external dependencies are indicated in the bundled source code.
Troubleshooting
Common issues:
- Invalid or missing API credentials will prevent the node from functioning.
- Providing malformed or empty HTML content will result in no meaningful output.
- Incorrect CSS selectors may lead to empty or incomplete content extraction.
- Setting too high a word count threshold might filter out all content unintentionally.
Error messages:
- Authentication errors indicate problems with the API key; verify and re-enter credentials.
- Parsing errors suggest invalid HTML input; ensure the HTML content is well-formed.
- Network or API errors may occur if the Crawl4AI service is unreachable; check internet connectivity and service status.
Links and References
- Crawl4AI Official Website — For API documentation and service details.
- CSS Selectors Reference — To craft effective CSS selectors for content extraction.
- HTML5 Specification — For understanding HTML structure relevant to tag exclusion.