Overview
The node "Crawl4AI: Content Extractor" is designed to extract structured content from web pages by using CSS selectors. It is particularly useful for web scraping tasks where you want to gather repeated elements such as product listings, article summaries, or any other structured data from a webpage.
Typical use cases include:
- Extracting product details (title, price, image URL) from e-commerce sites.
- Scraping news headlines and summaries from media websites.
- Collecting event information from event listing pages.
- Any scenario where you need to programmatically extract multiple fields of data from repeated HTML elements on a page.
This node supports running a headless browser with JavaScript enabled, allowing it to handle dynamic content that requires client-side rendering or interaction (e.g., clicking buttons or scrolling).
Properties
Name | Meaning |
---|---|
URL | The web address of the page to extract content from. Example: https://example.com |
Base Selector | A CSS selector that identifies the repeating element on the page (e.g., each product item or article card). Extraction will be performed relative to these elements. Example: div.product-item |
Fields | Defines one or more fields to extract from each base element. Each field includes: - Field Name: Identifier for the extracted data (e.g., "title") - CSS Selector: Selector relative to the base element - Field Type: What to extract ( Text , HTML , or an Attribute )- Attribute Name: If extracting an attribute, specify which (e.g., href ) |
Browser Options | Settings controlling the headless browser behavior: - Enable JavaScript execution (true/false) - Run in headless mode (true/false) - Custom JavaScript code to run before extraction (e.g., click buttons) - Timeout in milliseconds to wait for page load - Viewport width and height in pixels |
Options | Additional options: - Cache Mode: How to use caching ( Enabled (Read/Write) , Bypass (Force Fresh) , Only (Read Only) )- Include Original Text: Whether to include full original webpage text in output - Clean Text: Whether to clean and normalize extracted text (remove extra spaces, newlines) |
Output
The node outputs JSON data representing an array of objects, each corresponding to one matched base element on the page. Each object contains key-value pairs where keys are the field names defined in the input properties, and values are the extracted content according to the specified field type:
- For Text fields: plain text content extracted from the selected element.
- For HTML fields: inner HTML content of the selected element.
- For Attribute fields: the value of the specified attribute (e.g.,
href
,src
).
If the option to include the original full webpage text is enabled, the output will also contain a field with the entire page's raw text content.
The node does not explicitly mention binary data output, so it is assumed to output only JSON structured data.
Dependencies
- Requires an API key credential for the Crawl4AI service to perform the extraction.
- Uses a headless browser environment internally to load and interact with web pages.
- No additional external dependencies are exposed to the user.
- The node expects proper network access to the target URLs.
Troubleshooting
- Timeouts: If the page takes too long to load or JavaScript execution hangs, increase the "Timeout (Ms)" property.
- Empty results: Check that the "Base Selector" correctly matches the repeating elements on the page. Also verify that the "Fields" CSS selectors are correct relative to the base selector.
- JavaScript not executing: Ensure "Enable JavaScript" is set to true if the page relies on client-side rendering. Use the "JavaScript Code" property to trigger interactions like clicking "Load More" buttons.
- Cache issues: If stale data is returned, try setting "Cache Mode" to "Bypass (Force Fresh)" to force fresh page loads.
- Attribute extraction errors: Make sure the attribute name is valid and exists on the selected element.
- Credential errors: Confirm that a valid API key credential is configured for the node.
Links and References
- CSS Selectors Reference
- Headless Browser Automation Concepts
- Web Scraping Best Practices
- Crawl4AI official documentation (not publicly linked here; refer to your service provider)