Job TreeNavigate the job tree to view your child job details
Loading job tree...
Fast, high quality speech transcription with many available backends, word-level timestamps, speaker diarization, and translation capabilities.
Code
ready
INPUTS
file
JSON
Drag or click to select file
Audio file
word_level_timestamps
Whether to return word-level timestamps. Defaults to True.
speaker_diarization
Whether to perform speaker diarization. Defaults to False.
speed_boost
(Only applicable to `stable-ts`, `whisper-timestamped`, `whisperx`). `stable-ts` uses whisper-v3-turbo, while `whisper-timestamped` and `whisperx` use the base model when this is set to True. Defaults to False.
backend
A choice between different model backends. Choices between "stable-ts", "whisper-timestamped", "whisperx", "whisper-zero", and "groq-whisper". See README for more information.
source_language
Language of the audio. Defaults to auto-detect if not specified. See README for supported language codes.
target_language
Language code of the language to translate to (doesn't translate if left blank). See README for supported language codes.
min_speakers
Minimum number of speakers to detect for diarization. Defaults to auto-detect when set to -1.
max_speakers
Maximum number of speakers to detect for diarization. Defaults to auto-detect when set to -1.
min_silence_length
Minimum length of silence in seconds to use for splitting audio for parallel processing. Defaults to 0.4.
min_segment_length
Minimum length of audio segment in seconds to use for splitting audio for parallel processing. If set to -1, we pick a value based on your settings.
chunks
A parameter to manually specify the start and end times of each chunk when splitting audio for parallel processing. If set to "", we use silence detection to split the audio. If set to a string formatted with a start and end second on each line, we use the specified chunks. Example: '0,10' and '10,20' on separate lines.
denoise_audio
Whether to apply denoising to the audio to get rid of background noise before transcription. Defaults to False.
use_vad
Whether to use Silero VAD for splitting audio into segments. Defaults to False. More accurate than ffmpeg silence detection.
use_pyannote_segmentation
Whether to use Pyannote segmentation for splitting audio into segments. Defaults to False.
vad_threshold
The threshold for VAD. Defaults to 0.2.
pyannote_segmentation_threshold
The threshold for Pyannote segmentation when merging segments. This value determines the number of seconds between words to merge segments. Defaults to 0.8.
initial_prompt
A prompt to correct misspellings and style. Defaults to "".
Outputs
waiting for outputs
Logs
listening for logs...
README

Speech Transcription

This app can precisely transcribe audio data, with additional options for auto-translation.

IMPORTANT (August 30, 2024): This function will significantly change over the next three weeks. If you have integrated this into your production code, please fix the version by passing it in as part of the function name to either the API or the SDK. We have posted an example function slug with version specifications below.

sieve/speech_transcriber:86b4f1f

For pricing, click here.

For detailed notes, click here.

Key Features

  • Word-level Timestamps: Provides precise timestamps for each word in the transcript (not available in groq-whisper).
  • Speaker Diarization: Identifies and labels different speakers in the audio.
  • Speed Boost Option: Accelerates transcription speed with a slight trade-off in accuracy.
  • Model Backend Options: Choose from various backend models to balance cost, quality, and speed.
  • Auto Translation: Dynamically translates transcriptions into multiple languages.

Pricing

Note (August 30, 2024): The pricing will change in the coming weeks. You can check the price of your job via the usage table or via API.

Pricing for this function is compute based.

As an estimate, word level timestamps enabled on a stable-ts backend cost us $0.15 / hr of audio, extrapolated from this example.

Notes

Picking the right settings

  • backend Options:
    • whisperx: A fast transcription option available.
    • stable-ts: Offers more accuracy, especially in timestamps.
    • whisper-timestamped: Similar to stable-ts, focuses on accurate timestamps.
    • whisper-zero: The highest quality option available, but also the slowest.
    • groq-whisper: The fastest and cheapest option available, optimized using Groq (costs ~$0.111 / hour of audio).
  • Enabling speaker_diarization returns speaker IDs for each word in the transcript. This is useful if you want to know who said what.
  • Enabling speed_boost will use smaller models with either decoding approach. This is useful if you want to get results faster and don't mind sacrificing some accuracy.

Languages

We support 99 total languages. You may enter a language code into the language parameter if you already know the language of the original audio. If you don't know the language of the original audio, you may leave the language parameter blank and we will automatically detect the language of the original audio. If you want to see the full list of supported languages, you may refer to the table below.

  • en (English)
  • zh (Chinese)
  • de (German)
  • es (Spanish)
  • ru (Russian)
  • ko (Korean)
  • fr (French)
  • ja (Japanese)
  • pt (Portuguese)
  • tr (Turkish)
  • pl (Polish)
  • ca (Catalan)
  • nl (Dutch)
  • ar (Arabic)
  • sv (Swedish)
  • it (Italian)
  • id (Indonesian)
  • hi (Hindi)
  • fi (Finnish)
  • vi (Vietnamese)
  • he (Hebrew)
  • uk (Ukrainian)
  • el (Greek)
  • ms (Malay)
  • cs (Czech)
  • ro (Romanian)
  • da (Danish)
  • hu (Hungarian)
  • ta (Tamil)
  • no (Norwegian)
  • th (Thai)
  • ur (Urdu)
  • hr (Croatian)
  • bg (Bulgarian)
  • lt (Lithuanian)
  • la (Latin)
  • mi (Maori)
  • ml (Malayalam)
  • cy (Welsh)
  • sk (Slovak)
  • te (Telugu)
  • fa (Persian)
  • lv (Latvian)
  • bn (Bengali)
  • sr (Serbian)
  • az (Azerbaijani)
  • sl (Slovenian)
  • kn (Kannada)
  • et (Estonian)
  • mk (Macedonian)
  • br (Breton)
  • eu (Basque)
  • is (Icelandic)
  • hy (Armenian)
  • ne (Nepali)
  • mn (Mongolian)
  • bs (Bosnian)
  • kk (Kazakh)
  • sq (Albanian)
  • sw (Swahili)
  • gl (Galician)
  • mr (Marathi)
  • pa (Punjabi)
  • si (Sinhala)
  • km (Khmer)
  • sn (Shona)
  • yo (Yoruba)
  • so (Somali)
  • af (Afrikaans)
  • oc (Occitan)
  • ka (Georgian)
  • be (Belarusian)
  • tg (Tajik)
  • sd (Sindhi)
  • gu (Gujarati)
  • am (Amharic)
  • yi (Yiddish)
  • lo (Lao)
  • uz (Uzbek)
  • fo (Faroese)
  • ps (Pashto)
  • tk (Turkmen)
  • nn (Nynorsk)
  • mt (Maltese)
  • sa (Sanskrit)
  • lb (Luxembourgish)
  • my (Myanmar)
  • bo (Tibetan)
  • tl (Tagalog)
  • mg (Malagasy)
  • as (Assamese)
  • tt (Tatar)
  • haw (Hawaiian)
  • ln (Lingala)
  • ha (Hausa)
  • ba (Bashkir)
  • jw (Javanese)
  • su (Sundanese)
  • yue (Cantonese)
  • my (Burmese)
  • ca (Valencian)
  • nl (Flemish)
  • ht (Haitian)
  • lb (Letzeburgesch)
  • ps (Pushto)
  • pa (Panjabi)
  • ro (Moldavian)
  • si (Sinhalese)
  • es (Castilian)
  • zh (Mandarin)