Image to Text (Captioning)

Generates a textual description (caption) for an image using Transformers.js.

Overview

This node generates textual descriptions (captions) for images using pre-trained image captioning models from the Transformers.js library. It supports processing images provided either as URLs or as binary data from previous nodes. The node is useful in scenarios where you want to automatically generate human-readable summaries of images, such as for accessibility purposes, content tagging, or enhancing metadata.

Practical examples include:

Automatically generating captions for images uploaded to a CMS.
Creating alt text for images to improve web accessibility.
Summarizing visual content in social media posts or marketing materials.

Properties

Name	Meaning
Image Input	URL of the image to process, or the name of the binary property from a previous node containing the image data.
Model	The image captioning model to use. Options: - ViT-GPT2 Image Captioning (Default) - BLIP Image Captioning (Base) - BLIP Image Captioning (Large) Note: Some models may be gated and require authentication, which is not supported yet.
Output Caption Field	The field name where the generated image caption (text) will be stored in the output JSON.
Max New Tokens	Optional. Maximum number of new tokens (roughly words) to generate for the caption. Controls the length of the caption.
Include Full Output	Whether to include the full raw output object from the model under a field named `[Output Caption Field]_full`. This may contain multiple generated texts or additional details.

Output

The node outputs an array of items, each containing a JSON object with the original input data plus:

A field named as specified by Output Caption Field containing the generated caption text.
Optionally, if Include Full Output is enabled, an additional field named [Output Caption Field]_full containing the full raw output object from the model, which may include multiple generated captions or other metadata.

No binary data is output by this node.

Example output JSON snippet:

{
  "caption": "A group of people standing around a table with food.",
  "caption_full": [
    {
      "generated_text": "A group of people standing around a table with food."
    }
  ]
}

Dependencies

Uses the @huggingface/transformers library via Transformers.js to load and run image captioning models locally.
Requires internet access initially to download the selected model unless it is cached.
No external API keys or credentials are currently supported or required.
Runs inference on CPU with float32 precision.

Troubleshooting

Model loading failures: If the node fails to load the selected model, it may be due to network issues or because the model is private/gated on Hugging Face. The error message will indicate this. To resolve, check your internet connection or select a publicly available model.
Empty or invalid image input: The node requires a valid image URL or binary property name. Providing empty or incorrect input will cause an error.
Unexpected output format: If the model returns output not matching the expected structure, the node will throw an error. This could happen if the model changes or is incompatible.
Performance: Running large models on CPU may be slow. Consider smaller models or running on more powerful hardware if performance is critical.