Package Information
Released: 5/12/2025
Downloads: 149 weekly / 757 monthly
Latest Version: 0.1.7
Author: golfamigo
Documentation
Crawl4AI for n8n
This project provides n8n integration for Crawl4AI, a powerful web crawling and data extraction tool. It consists of two main nodes:
- Crawl4AI: Basic Crawler - For general web crawling and content extraction
- Crawl4AI: Content Extractor - For extracting structured data using CSS selectors, LLM, or JSON
Features
Basic Crawler Node
- Crawl Single URL - Extract content from a single web page
- Crawl Multiple URLs - Process multiple web pages in one operation
- Process Raw HTML - Extract content from raw HTML without crawling
Content Extractor Node
- CSS Selector Extractor - Extract structured data using CSS selectors
- LLM Extractor - Use AI to extract structured data from webpages
- JSON Extractor - Extract and process JSON data from web pages
Installation
- Clone this repository into your n8n custom nodes directory
- Run
npm install
to install dependencies - Restart your n8n instance
Usage
Setting up credentials
Before using the nodes, you need to set up Crawl4AI API credentials:
- Go to Settings > Credentials > New
- Select Crawl4AI API
- Configure connection settings:
- Connection Mode: Direct or Docker
- Authentication: Configure as needed
- LLM Settings: Enable and configure if needed for AI extraction
Basic Crawler Usage
The Basic Crawler node allows you to crawl web pages and extract their content:
- Add the "Crawl4AI: Basic Crawler" node to your workflow
- Select an operation (Crawl Single URL, Crawl Multiple URLs, or Process Raw HTML)
- Configure the required parameters
- Run the workflow to extract content
Content Extractor Usage
The Content Extractor node allows you to extract structured data from web pages:
- Add the "Crawl4AI: Content Extractor" node to your workflow
- Select an extraction method (CSS Selector, LLM, or JSON)
- Configure the extraction parameters
- Run the workflow to extract structured data
Configuration Options
Browser Options
- Headless Mode: Run browser in headless mode
- Enable JavaScript: Enable JavaScript execution
- Viewport Size: Set browser viewport dimensions
- Timeout: Maximum time to wait for page load
- User Agent: Override browser user agent
Crawler Options
- Cache Mode: Control caching behavior
- JavaScript Code: Execute custom JS on the page
- CSS Selector: Focus crawling on specific elements
- Excluded Tags: Skip specific HTML tags
- Check Robots.txt: Respect robots.txt rules
- Word Count Threshold: Filter content by word count
LLM Extraction Options
- Extraction Instructions: Instructions for the AI
- Schema Fields: Define structured data schema
- LLM Provider: Choose AI model provider
- Temperature: Control randomness of AI responses
Project Structure
nodes/
├── Crawl4aiBasicCrawler/
│ ├── Crawl4aiBasicCrawler.node.ts # Main node file
│ ├── crawl4ai.svg # Icon
│ ├── actions/
│ │ ├── operations.ts # Operations definition
│ │ ├── router.ts # Router handler
│ │ ├── crawlSingleUrl.operation.ts # Single URL crawl operation
│ │ ├── crawlMultipleUrls.operation.ts # Multiple URL crawl operation
│ │ └── processRawHtml.operation.ts # Raw HTML processing operation
│ └── helpers/
│ ├── interfaces.ts # Interface definitions
│ ├── utils.ts # Common utilities
│ ├── apiClient.ts # API client
│ └── formatters.ts # Formatting tools
│
└── Crawl4aiContentExtractor/
├── Crawl4aiContentExtractor.node.ts # Main node file
├── crawl4ai.svg # Icon
├── actions/
│ ├── operations.ts # Operations definition
│ ├── router.ts # Router handler
│ ├── cssExtractor.operation.ts # CSS selector extraction operation
│ ├── llmExtractor.operation.ts # LLM extraction operation
│ └── jsonExtractor.operation.ts # JSON extraction operation
└── helpers/
├── interfaces.ts # Interface definitions
├── utils.ts # Common utilities
├── apiClient.ts # API client
└── formatters.ts # Formatting tools
credentials/
└── Crawl4aiApi.credentials.ts # Credentials definition
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.