Package Information
Documentation
@resettech/n8n-nodes-semantic-text-splitter
This is an n8n community node that provides a Semantic Double-Pass Merging text splitter for advanced document chunking in AI workflows.
Description
The Semantic Double-Pass Merging Text Splitter is an advanced text splitting algorithm that uses embeddings to create semantically coherent chunks. It implements a two-pass approach:
- First Pass: Splits text based on semantic similarity between sentences
- Second Pass: Merges adjacent chunks that are semantically similar
This approach results in more coherent and contextually meaningful chunks compared to traditional character or token-based splitting methods.
Features
- Semantic-based splitting: Uses embeddings to determine chunk boundaries
- Double-pass merging: Optimizes chunk coherence through a second merging pass
- Flexible threshold options: Multiple methods for determining split points
- Size constraints: Enforce minimum and maximum chunk sizes
- Buffer context: Include surrounding sentences for better embedding context
- Customizable sentence splitting: Configure how text is split into sentences
Installation
Community Nodes (Recommended)
- Go to Settings > Community Nodes
- Select Install
- Enter
@resettech/n8n-nodes-semantic-text-splitter
in the search field - Select Install
Manual Installation
- Navigate to your n8n installation's custom nodes folder:
cd ~/.n8n/nodes
- Clone this repository:
git clone https://github.com/ResetNetwork/n8n-nodes.git cd n8n-nodes/n8n-nodes-semantic-text-splitter
- Install dependencies:
npm install
- Build the node:
npm run build
- Restart n8n
Usage
The Semantic Double-Pass Text Splitter node requires:
- An Embeddings input (e.g., OpenAI Embeddings, Google Gemini Embeddings)
- Either a Document input or Text input
Node Parameters
Options
- Buffer Size: Number of sentences to include as context when creating embeddings (default: 1)
- Breakpoint Threshold Type: Method for determining chunk boundaries
- Percentile: Use the 95th percentile of distances
- Standard Deviation: Use mean + standard deviation
- Interquartile: Use IQR method
- Gradient: Find maximum gradient change
- Breakpoint Threshold Amount: Manual threshold override (0-1)
- Number of Chunks: Target number of chunks (overrides threshold)
- Second Pass Threshold: Similarity threshold for merging in second pass (0-1, default: 0.8)
- Min Chunk Size: Minimum characters per chunk (default: 100)
- Max Chunk Size: Maximum characters per chunk (default: 2000)
- Sentence Split Regex: Regular expression for sentence splitting
Example Workflow
- Add a Document Loader node (e.g., PDF, Text File)
- Add an Embeddings node (e.g., OpenAI Embeddings)
- Add the Semantic Double-Pass Text Splitter node
- Connect:
- Document Loader → Text Splitter (Document input)
- Embeddings → Text Splitter (Embeddings input)
- Connect the output to a Vector Store or further processing
Algorithm Details
The algorithm works as follows:
- Sentence Splitting: Text is split into sentences using a configurable regex
- Context Building: Each sentence is combined with surrounding sentences based on buffer size
- Embedding Generation: Embeddings are created for each contextualized sentence
- Distance Calculation: Cosine distances between consecutive embeddings are calculated
- Breakpoint Detection: Chunk boundaries are determined based on distance threshold
- Initial Chunking: Text is split at detected breakpoints
- Second Pass Merging: Adjacent chunks with high similarity are merged
- Size Constraints: Chunks are adjusted to meet size requirements
Development
# Install dependencies
npm install
# Build
npm run build
# Run in development mode
npm run dev
# Lint
npm run lint
# Format code
npm run format
License
Author
Reset Network - https://github.com/ResetNetwork
Contributing
Contributions are welcome! Please feel free to submit a Pull Request to the Reset Network n8n-nodes repository.
Support
For issues and feature requests, please use the GitHub issues page.