ElevenLabs icon

ElevenLabs

WIP

Overview

The "Create Transcript" operation of the Speech resource in this node converts audio or video files into text transcripts. It is designed to transcribe spoken content from binary audio/video input, supporting features like speaker diarization, language detection, timestamp granularity, and tagging of audio events such as laughter or footsteps.

This node is beneficial in scenarios where automated transcription is needed, such as:

  • Transcribing interviews, meetings, or podcasts for documentation.
  • Generating subtitles or captions for videos.
  • Analyzing audio content for keywords or sentiment.
  • Accessibility improvements by providing text alternatives to audio.

For example, a user can upload an audio recording of a meeting and receive a detailed transcript with speaker labels and timestamps, facilitating easy review and note-taking.

Properties

Name Meaning
Binary Input Field The name of the binary property that contains the audio or video file to be transcribed. This must be provided to specify which input data to process.
Transcript Model ID The identifier of the transcription model to use. Currently, only "scribe_v1" is available.
Language Code The language code (ISO-639-1 or ISO-639-3) of the audio. If left empty, the language will be auto-detected.
Tag Audio Events Whether to tag audio events like (laughter), (footsteps), etc., in the transcription. Default is true.
Number of Speakers Maximum number of speakers expected in the audio, which helps improve prediction accuracy. Accepts values between 1 and 32.
Timestamps Granularity The level of detail for timestamps in the transcription. Options are: none, word, or character. Default is word.
Speaker Diarization Whether to annotate which speaker is talking in the uploaded file. Boolean value; default is false.
Enable Logging Whether to enable logging of the transcription request. False means zero retention mode (history features unavailable). Default is true.

Output

The node outputs a JSON object containing the transcription results. The exact structure depends on the API response but typically includes:

  • The full transcribed text.
  • Optional speaker labels if diarization is enabled.
  • Timestamp information according to the selected granularity.
  • Tags for audio events if enabled.
  • Metadata about the transcription process.

If the node supports binary output (not explicitly shown here for this operation), it would typically contain the original or processed audio/video file or related media.

Dependencies

  • Requires an active API key credential for the ElevenLabs API.
  • The node sends requests to the ElevenLabs speech-to-text endpoint (/speech-to-text).
  • Proper configuration of the binary input field is necessary to provide the audio/video file data.
  • Network access to the ElevenLabs API service.

Troubleshooting

  • Missing or incorrect binary input: Ensure the binary input field name matches the actual binary property containing the audio/video file.
  • Unsupported audio format: Verify that the input file format is supported by the transcription service.
  • Invalid language code: Use valid ISO-639-1 or ISO-639-3 codes or leave blank for auto-detection.
  • API authentication errors: Confirm that the API key credential is correctly configured and has necessary permissions.
  • Exceeding speaker count limits: The number of speakers should be between 1 and 32; setting outside this range may cause errors.
  • Logging disabled: Disabling logging disables history features; enable it if you need to retain transcription history.

Common error messages might include authentication failures, invalid input data errors, or quota exceeded notices. Resolving these usually involves checking credentials, input correctness, and API usage limits.

Links and References

Discussion