pdf-ocr

Package Information

Released: 7/16/2025
Downloads: 0 weekly / 0 monthly
Latest Version: 1.3.0
Author: dnachavez

Documentation

n8n-nodes-pdf-ocr

This is an n8n community node that lets you extract text from PDF files using OCR (Optical Character Recognition) in your n8n workflows.

PDF OCR uses Tesseract.js for text recognition and PDF.js for PDF processing, providing a completely free solution without requiring external APIs, accounts, or system dependencies.

n8n is a fair-code licensed workflow automation platform.

Installation
Operations
Compatibility
Usage
Troubleshooting
Publishing
Resources

Installation

Follow the installation guide in the n8n community nodes documentation.

Note: This node uses pure JavaScript libraries and doesn't require any external system dependencies.

Operations

PDF OCR

  • Extract text from PDF files - Converts PDF pages to images and uses OCR to extract text
  • Multi-page support - Processes all pages in a PDF document
  • Multi-language support - Supports over 100 languages including English, Spanish, French, German, Italian, Portuguese, Russian, Chinese Simplified, Japanese, and Korean
  • Multiple output formats - Choose between combined text, per-page text, or detailed output with metadata

Parameters

  • Input Binary Property - Name of the binary property containing the PDF file (default: "data")
  • Language - Language for OCR text recognition (default: English)
  • Output Format - How to format the extracted text:
    • Combined Text: All text combined into a single string
    • Per Page: Array of text for each page
    • Detailed: Combined text + per-page text + metadata
  • DPI Scale - Scale factor for rendering PDF pages (default: 2, higher = better quality but slower)

Compatibility

  • Minimum n8n version: 0.187.0
  • Node.js version: >=20.15
  • System requirements: None (pure JavaScript)

Usage

Basic Usage

  1. Add the PDF OCR node to your workflow
  2. Connect it to a node that provides PDF binary data (e.g., HTTP Request, Read Binary File)
  3. Configure the input binary property name
  4. Select the desired language for OCR
  5. Choose the output format
  6. Run the workflow

Output Examples

Combined Text Output

{
  "text": "This is the combined text from all pages...",
  "totalPages": 3,
  "language": "eng"
}

Per Page Output

{
  "pages": [
    {
      "pageNumber": 1,
      "text": "Text from page 1..."
    },
    {
      "pageNumber": 2,
      "text": "Text from page 2..."
    }
  ],
  "totalPages": 2,
  "language": "eng"
}

Detailed Output

{
  "text": "Combined text from all pages...",
  "pages": [
    {
      "pageNumber": 1,
      "text": "Text from page 1..."
    }
  ],
  "metadata": {
    "totalPages": 1,
    "language": "eng",
    "dpiScale": 2
  }
}

Workflow Examples

Extract Text from Uploaded PDF

  1. Manual Trigger - Start workflow manually
  2. HTTP Request - Download PDF from URL
  3. PDF OCR - Extract text from PDF
  4. Set - Process extracted text

Process PDF Files from Google Drive

  1. Google Drive Trigger - Monitor for new PDF files
  2. Google Drive - Download PDF file
  3. PDF OCR - Extract text with multi-language support
  4. Gmail - Email extracted text

Features

  • Free and Open Source - No API keys, subscriptions, or system dependencies required
  • Multi-language OCR - Supports 100+ languages
  • Multi-page Processing - Handles PDFs with multiple pages
  • Flexible Output - Choose from different output formats
  • Error Handling - Robust error handling with continue-on-fail support
  • Memory Efficient - Processes pages individually and cleans up temporary files

Troubleshooting

Common Issues

"totalPages: 0" with No Text Extracted

This usually indicates that PDF processing failed. Common causes:

  • Corrupted PDF file
  • PDF is password protected or has restrictions
  • PDF contains only images without text (OCR should still work)
  • Memory issues with very large PDFs

Performance Issues

  • Reduce DPI scale for faster processing
  • Monitor memory usage for large PDFs
  • Consider processing PDFs in smaller batches

Memory Issues

  • For large PDFs, try reducing the DPI scale
  • Process PDFs one at a time instead of in batches
  • Restart n8n if memory usage becomes too high

Publishing

Want to publish this node to npm? See our comprehensive Publishing Guide for step-by-step instructions on how to:

  • Prepare your package for publication
  • Publish to npm registry
  • Install in n8n instances
  • Maintain and update your package
  • Follow security best practices

Resources

Discussion