Actions20
- AI Scrape Actions
- Analyze Sentiment Actions
- Convert to SQL Actions
- Generate Embedding Actions
- HTML to Any Actions
- Image Generation Actions
- Make Prediction Actions
- NSFW Detection Actions
- Object Detection Actions
- Process Image Actions
- Profanity Detection Actions
- Search Web Actions
- Spam Detection Actions
- Speech to Text Actions
- Spell Check Actions
- Summary Actions
- Text to Speech Actions
- Translate Actions
- Translate Image Actions
- Web Suggestion Actions
Overview
The AI Scrape operation of the JigsawStack node enables users to extract structured information from web pages or raw HTML content using AI-powered scraping. Instead of manually parsing HTML or writing complex selectors, this node allows specifying "element prompts" (like "titles", "prices", "points") that describe what data to extract. It supports scraping either directly from a URL or from provided HTML content.
This node is beneficial in scenarios such as:
- Extracting product details, prices, or reviews from e-commerce sites.
- Gathering headlines, summaries, or key points from news articles.
- Collecting structured data from paginated listings by specifying page numbers.
- Scraping content behind authentication or with custom headers and cookies.
- Using advanced configurations for controlling network requests, blocking unwanted resources, or customizing page load behavior.
Practical example: You want to scrape all product titles and prices from an online store's category page. You provide the URL, specify element prompts ["titles", "prices"], optionally set a root CSS selector to narrow scope, and get back structured JSON with those elements extracted.
Properties
Name | Meaning |
---|---|
Scrape Source | Choose the source of the scrape: either "URL" to scrape from a live webpage or "HTML" to scrape from provided raw HTML content. |
URL | The URL of the page to scrape. Required if Scrape Source is "url" . |
HTML | Raw HTML content to scrape. Required if Scrape Source is "html" . |
Element Prompts | Array of strings describing the elements to extract from the page, e.g., "titles" , "points" , "prices" . These guide the AI on what data to look for. |
Root Element Selector | CSS selector string to limit scraping scope to a specific element and its children, e.g., "main" . |
Page Position | Number indicating the current page number for pagination purposes (minimum value 1). |
Advance Config | JSON object for advanced scraper options including console logging, network settings, cookies, HTTP headers, request blocking patterns, and page navigation options. |
Http Headers | JSON object specifying custom HTTP headers to send with the request as key-value pairs. |
Reject Request Pattern | Array of string patterns to intercept and block certain resource requests during scraping, e.g., ["jpg", "png"] to block images. |
Goto Options | JSON object defining custom page-load behavior settings, such as waitUntil conditions. |
Wait For | JSON object defining wait conditions before scraping, supporting modes like timeout (milliseconds), waiting for a selector, or executing a function. |
Cookies | JSON object specifying cookies to set for the page request. |
Size Preset | Predefined screen size presets like "HD" , "FHD" , "4K UHD" to simulate different viewport sizes. |
Is Mobile | Boolean flag to emulate a mobile device viewport. |
Scale | Device scale factor for the viewport (minimum 1). |
Width | Viewport width in pixels. |
Height | Viewport height in pixels. |
Force Rotate Proxy | Boolean flag to force proxy rotation for each request, which may incur additional costs. |
BYO Proxy | JSON object for bring-your-own-proxy configuration including server URL and optional authentication credentials. |
Output
The node outputs JSON data containing the scraped results based on the specified element prompts. The structure typically includes keys corresponding to each prompt with arrays or values representing the extracted content.
If binary data is involved (not explicitly indicated here), it would represent downloaded files or media related to the scrape, but this node primarily outputs structured JSON.
Dependencies
- Requires an active API key credential for the JigsawStack API service.
- Internet access to reach target URLs unless scraping raw HTML.
- Optional proxy configuration if using BYO Proxy or forcing proxy rotation.
- Properly configured HTTP headers, cookies, and advanced options may be necessary for scraping some protected or dynamic websites.
Troubleshooting
- Missing or invalid API key: The node requires a valid API key credential; ensure it is correctly set up.
- Invalid URL or unreachable site: Verify the URL is correct and accessible from your environment.
- Conflicting inputs: Either URL or HTML must be provided, not both. Ensure only one source is set.
- Incorrect element prompts: If no data is returned, check that the prompts accurately describe the desired elements.
- Pagination issues: Make sure the page position is set correctly starting at 1.
- Blocked resources: Overly aggressive reject request patterns might block essential scripts or styles causing incomplete scraping.
- Proxy errors: If using proxies, verify proxy server details and credentials are correct.
- Timeouts or slow loading: Adjust wait conditions or goto options to allow sufficient time for page load.
Links and References
- JigsawStack API Documentation (for detailed API capabilities)
- CSS Selectors Reference (to craft root element selectors)
- n8n Documentation (general usage and credential setup)