Crawl4AI: Content Extractor

Extract structured content from web pages using Crawl4AI

Actions3

Overview

The Crawl4AI: Content Extractor node extracts structured content from web pages by leveraging a Large Language Model (LLM) to interpret and parse the page content according to user-defined instructions and schema. It is particularly useful for scenarios where you want to scrape specific data points from websites without writing complex scraping code, such as extracting product details, article summaries, or any custom information from HTML pages.

Typical use cases include:

E-commerce: Extracting product names, prices, descriptions from online stores.
News aggregation: Summarizing articles or extracting headlines and authors.
Market research: Collecting structured data from competitor websites.
Data enrichment: Pulling additional info from public web pages into workflows.

The node loads the target URL in a browser environment (optionally with JavaScript enabled), optionally runs custom JS code to interact with the page (e.g., clicking buttons or scrolling), then sends the extracted text along with user instructions and schema fields to an LLM for structured extraction.

Properties

Name	Meaning
URL	The web page URL to extract content from.
Extraction Instructions	Text instructions guiding the LLM on what specific information to extract from the page.
Schema Fields	Defines the fields to extract, each with: - Field Name: identifier for the extracted data. - Field Type: String, Number, Boolean, or Array. - Description: helps the LLM understand the field's meaning. - Required: whether the field must be present.
Browser Options	Controls how the embedded browser loads the page: - Enable JavaScript: run JS on the page. - Headless Mode: run browser without UI. - JavaScript Code: custom JS to execute before extraction. - Timeout: max wait time for loading. - Viewport Height/Width: browser window size.
LLM Options	Settings for the language model used: - LLM Provider: choice of supported LLMs. - Max Tokens: max response length. - Override LLM Provider: whether to override credentials. - Provider API Key: API key if overriding. - Temperature: randomness level in output.
Options	Additional extraction options: - Cache Mode: controls caching behavior (enabled, bypass, read-only). - Include Original Text: whether to include full webpage text in output. - CSS Selector: focus extraction on a specific part of the page.

Output

The node outputs JSON objects containing the extracted structured data according to the defined schema fields. Each output item corresponds to one extraction result with keys matching the field names specified in the schema.

If enabled, the output can also include the original full text content of the webpage under a dedicated property.

No binary data output is indicated by the source.

Example output structure (simplified):

{
  "title": "Example Product",
  "price": 19.99,
  "description": "A great product for your needs.",
  "inStock": true,
  "tags": ["sale", "popular"]
}

Dependencies

Requires an active API key credential for the Crawl4AI service to perform content extraction.
Supports multiple LLM providers; if overriding the default provider, an API key for that provider is needed.
Uses a headless browser environment internally to load and interact with web pages.
No other external dependencies are explicitly required.

Troubleshooting

Page Load Failures: If the URL is incorrect or the page takes too long to load, extraction will fail. Adjust the timeout or verify the URL.
JavaScript Disabled Issues: Some pages require JavaScript to render content. Ensure "Enable JavaScript" is turned on if needed.
Incorrect Extraction: If the LLM does not extract expected fields, refine the extraction instructions or improve schema field descriptions.
Cache Problems: If stale data is returned, check cache mode settings; use "Bypass" to force fresh extraction.
API Key Errors: Missing or invalid API keys for Crawl4AI or LLM providers will cause authentication errors.
Custom JS Errors: Invalid JavaScript code in browser options may cause page interaction failures.