Crawl4AI: Content Extractor

Extract structured content from web pages using Crawl4AI

Actions3

Overview

The node "Crawl4AI: Content Extractor" is designed to extract structured JSON content from web pages. It supports extracting JSON data directly from a URL or embedded within HTML pages, including JSON inside <script> tags or JSON-LD structured data. This node is useful for scenarios where you need to scrape or gather structured data from websites that expose their data in JSON format or embed it within their HTML.

Practical examples include:

Extracting product information from e-commerce sites that provide JSON data in script tags.
Gathering metadata or structured data from news articles using JSON-LD.
Fetching API-like JSON responses directly from URLs for further processing.

Properties

Name	Meaning
URL	The URL of the JSON data to extract. This is the web address from which the JSON content will be fetched.
JSON Path	The path within the JSON structure to extract specific data. If left empty, the entire JSON response is returned.
Source Type	Specifies where to find the JSON data on the page. Options are: - Direct JSON URL: The URL returns JSON directly. - JSON in Script Tag: JSON is embedded inside an HTML `<script>` tag. - JSON-LD: JSON-LD structured data.
Script Selector	CSS selector to identify the `<script>` tag containing the JSON data when "Source Type" is set to "JSON in Script Tag".
Browser Options	Collection of options controlling browser behavior during extraction: - Headless Mode: Run browser without UI. - Enable JavaScript: Allow JS execution on the page. - Timeout (MS): Max wait time for page load. - JavaScript Code: Custom JS code to run before extraction (e.g., scrolling).
Options	Additional options: - Cache Mode: Controls caching behavior with options Enabled (read/write), Bypass (force fresh fetch), Only (read only). - Include Full Content: Whether to include the full JSON content along with extracted data. - Headers: HTTP headers to send with the request, in JSON format.

Output

The node outputs JSON data extracted from the specified source. The output json field contains the extracted data according to the provided JSON Path or the entire JSON if no path is specified.

If the "Include Full Content" option is enabled, the output will also contain the complete JSON content fetched from the source alongside the extracted subset.

No binary data output is indicated by the source code or properties.

Dependencies

Requires an API key credential for the Crawl4AI service to perform the extraction.
Uses a headless browser environment optionally to load and execute JavaScript on web pages.
Supports custom HTTP headers for requests.
Caching mechanism configurable via options to optimize repeated requests.

Troubleshooting

Common issues:
- Incorrect or missing URL can cause failures to fetch data.
- Invalid JSON Path may result in empty or incorrect extraction results.
- If JSON is embedded in a script tag but the wrong CSS selector is used, no data will be extracted.
- JavaScript disabled in browser options might prevent dynamic content from loading.
- Timeout too short may cause incomplete page loads and extraction failures.
Error messages:
- Network errors or timeouts indicate connectivity or server issues; increasing timeout or checking URL accessibility helps.
- Parsing errors suggest invalid JSON or incorrect JSON Path; verify the JSON structure and path correctness.
- Authentication errors imply missing or invalid API credentials; ensure the required API key is configured properly.

Links and References

JSONPath Syntax Reference
JSON-LD Introduction
Headless Browser Automation Concepts (for understanding browser options)
Crawl4AI official documentation (not linked here due to lack of URL in source)