Semantic Text Splitter

Split text into embedding walk based chunks

Overview

The Semantic Text Splitter node is designed to split input text into semantically meaningful chunks based on embedding similarity and sentence structure. It uses a sliding window approach over sentences and applies a percentile threshold to determine breakpoints for splitting. This node is particularly useful in scenarios where large texts need to be broken down into smaller, contextually coherent pieces for further processing such as embedding generation, semantic search indexing, or natural language understanding tasks.

Practical examples include:

  • Preparing documents for vector database ingestion by splitting them into manageable chunks.
  • Enhancing semantic search accuracy by segmenting text at logical boundaries.
  • Preprocessing long articles or transcripts for AI models that have input size limits.

Properties

Name Meaning
This node must be connected to a document loader. Insert one A notice indicating that this node requires an upstream document loader node connection.
Percentile Threshold For Breakpoints Numeric value (default 0.95). Determines the percentile cutoff used to identify breakpoints in the text splitting process based on embedding similarity scores.
Number Of Sentences To Consider In Sliding Window Numeric value (default 3). Defines how many sentences are considered together when evaluating where to split the text.
Minimum Chunk Size Numeric value (default 100). Specifies the minimum number of characters each chunk should contain to avoid overly small segments.
Sentence Delimiters String (default ".!?"). Characters used to identify sentence boundaries during text splitting.

Output

The node outputs JSON data representing the text split into chunks. Each chunk corresponds to a semantically coherent segment of the original text, determined by the configured parameters. The output is intended for downstream nodes that consume segmented text, such as embedding generators or semantic search indexers.

If binary data were involved, it would represent the processed text chunks or embeddings, but this node focuses on JSON text chunk outputs only.

Dependencies

  • Requires an upstream document loader node providing the input text to be split.
  • Utilizes embedding data from a connected embedding node to calculate semantic breakpoints.
  • No external API keys or environment variables are explicitly required by this node itself.
  • Relies on internal utilities and classes bundled within the n8n LangChain integration.

Troubleshooting

  • Common issues:

    • Node not receiving input text because it is not connected to a document loader node.
    • Incorrect or missing embedding input causing failure in breakpoint calculation.
    • Setting a very high percentile threshold may result in fewer splits, producing large chunks.
    • Very low minimum chunk size might create too many small chunks, reducing semantic coherence.
  • Error messages:

    • Errors related to missing input connections typically indicate the need to connect a document loader or embedding node.
    • Parameter validation errors if required numeric fields are left empty or set to invalid values.
  • Resolutions:

    • Ensure the node is connected properly to both a document loader and an embedding provider.
    • Adjust the percentile threshold and window size parameters to balance chunk size and semantic relevance.
    • Verify sentence delimiters match the text language and style for accurate sentence detection.

Links and References

Discussion