Skip to content

安装使用 whisperX

下载安装 whisperX

sh
git clone https://github.com/m-bain/whisperX.git
cd whisperX
pip install -r requirements.txt
pip install -e .

然后确保安装了 ffmpeg。

使用

sh
whisperx --model large-v2 --language zh --output_format srt --chunk_size 5 --initial_prompt "以下是中文普通话语句。" "BV1hZ421g7xC.wav"

其中:

  • --model:使用的模型
  • --language:显式指定语音的语言
    • zh 表示中文
  • --output_format:输出的文件格式
    • 常用的格式有:srt, json, txt
  • --chunk_size:片段切分的秒数
    • 中文语音的一个较优值:5
  • --initial_prompt:用于初始化的提示以约束模型输出风格

参考

What is the most efficient version of OpenAI Whisper? : r/MachineLearning

m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Simplified Chinese rather than traditional? · openai/whisper · Discussion #277

More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) · Issue #159 · m-bain/whisperX

命令行选项

whisperx --help
txt
usage: whisperx [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE]
                [--device_index DEVICE_INDEX] [--batch_size BATCH_SIZE]
                [--compute_type {float16,float32,int8}]
                [--output_dir OUTPUT_DIR]
                [--output_format {all,srt,vtt,txt,tsv,json,aud}]
                [--verbose VERBOSE] [--task {transcribe,translate}]
                [--language {...,en,ja,zh,Chinese,English,Japanese,...}]
                [--align_model ALIGN_MODEL]
                [--interpolate_method {nearest,linear,ignore}] [--no_align]
                [--return_char_alignments] [--vad_onset VAD_ONSET]
                [--vad_offset VAD_OFFSET] [--chunk_size CHUNK_SIZE]
                [--diarize] [--min_speakers MIN_SPEAKERS]
                [--max_speakers MAX_SPEAKERS] [--temperature TEMPERATURE]
                [--best_of BEST_OF] [--beam_size BEAM_SIZE]
                [--patience PATIENCE] [--length_penalty LENGTH_PENALTY]
                [--suppress_tokens SUPPRESS_TOKENS] [--suppress_numerals]
                [--initial_prompt INITIAL_PROMPT]
                [--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT]
                [--fp16 FP16]
                [--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
                [--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]
                [--logprob_threshold LOGPROB_THRESHOLD]
                [--no_speech_threshold NO_SPEECH_THRESHOLD]
                [--max_line_width MAX_LINE_WIDTH]
                [--max_line_count MAX_LINE_COUNT]
                [--highlight_words HIGHLIGHT_WORDS]
                [--segment_resolution {sentence,chunk}] [--threads THREADS]
                [--hf_token HF_TOKEN] [--print_progress PRINT_PROGRESS]
                audio [audio ...]

positional arguments:
  audio                 audio file(s) to transcribe

options:
  -h, --help            show this help message and exit
  --model MODEL         name of the Whisper model to use (default: small)
  --model_dir MODEL_DIR
                        the path to save model files; uses ~/.cache/whisper by
                        default (default: None)
  --device DEVICE       device to use for PyTorch inference (default: cuda)
  --device_index DEVICE_INDEX
                        device index to use for FasterWhisper inference
                        (default: 0)
  --batch_size BATCH_SIZE
                        the preferred batch size for inference (default: 8)
  --compute_type {float16,float32,int8}
                        compute type for computation (default: float16)
  --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                        directory to save the outputs (default: .)
  --output_format {all,srt,vtt,txt,tsv,json,aud}, -f {all,srt,vtt,txt,tsv,json,aud}
                        format of the output file; if not specified, all
                        available formats will be produced (default: all)
  --verbose VERBOSE     whether to print out the progress and debug messages
                        (default: True)
  --task {transcribe,translate}
                        whether to perform X->X speech recognition
                        ('transcribe') or X->English translation ('translate')
                        (default: transcribe)
  --language {...,en,ja,zh,Chinese,English,Japanese,...}
                        language spoken in the audio, specify None to perform
                        language detection (default: None)
  --align_model ALIGN_MODEL
                        Name of phoneme-level ASR model to do alignment
                        (default: None)
  --interpolate_method {nearest,linear,ignore}
                        For word .srt, method to assign timestamps to non-
                        aligned words, or merge them into neighbouring.
                        (default: nearest)
  --no_align            Do not perform phoneme alignment (default: False)
  --return_char_alignments
                        Return character-level alignments in the output json
                        file (default: False)
  --vad_onset VAD_ONSET
                        Onset threshold for VAD (see pyannote.audio), reduce
                        this if speech is not being detected (default: 0.5)
  --vad_offset VAD_OFFSET
                        Offset threshold for VAD (see pyannote.audio), reduce
                        this if speech is not being detected. (default: 0.363)
  --chunk_size CHUNK_SIZE
                        Chunk size for merging VAD segments. Default is 30,
                        reduce this if the chunk is too long. (default: 30)
  --diarize             Apply diarization to assign speaker labels to each
                        segment/word (default: False)
  --min_speakers MIN_SPEAKERS
                        Minimum number of speakers to in audio file (default:
                        None)
  --max_speakers MAX_SPEAKERS
                        Maximum number of speakers to in audio file (default:
                        None)
  --temperature TEMPERATURE
                        temperature to use for sampling (default: 0)
  --best_of BEST_OF     number of candidates when sampling with non-zero
                        temperature (default: 5)
  --beam_size BEAM_SIZE
                        number of beams in beam search, only applicable when
                        temperature is zero (default: 5)
  --patience PATIENCE   optional patience value to use in beam decoding, as in
                        https://arxiv.org/abs/2204.05424, the default (1.0) is
                        equivalent to conventional beam search (default: 1.0)
  --length_penalty LENGTH_PENALTY
                        optional token length penalty coefficient (alpha) as
                        in https://arxiv.org/abs/1609.08144, uses simple
                        length normalization by default (default: 1.0)
  --suppress_tokens SUPPRESS_TOKENS
                        comma-separated list of token ids to suppress during
                        sampling; '-1' will suppress most special characters
                        except common punctuations (default: -1)
  --suppress_numerals   whether to suppress numeric symbols and currency
                        symbols during sampling, since wav2vec2 cannot align
                        them correctly (default: False)
  --initial_prompt INITIAL_PROMPT
                        optional text to provide as a prompt for the first
                        window. (default: None)
  --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
                        if True, provide the previous output of the model as a
                        prompt for the next window; disabling may make the
                        text inconsistent across windows, but the model
                        becomes less prone to getting stuck in a failure loop
                        (default: False)
  --fp16 FP16           whether to perform inference in fp16; True by default
                        (default: True)
  --temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
                        temperature to increase when falling back when the
                        decoding fails to meet either of the thresholds below
                        (default: 0.2)
  --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
                        if the gzip compression ratio is higher than this
                        value, treat the decoding as failed (default: 2.4)
  --logprob_threshold LOGPROB_THRESHOLD
                        if the average log probability is lower than this
                        value, treat the decoding as failed (default: -1.0)
  --no_speech_threshold NO_SPEECH_THRESHOLD
                        if the probability of the <|nospeech|> token is higher
                        than this value AND the decoding has failed due to
                        `logprob_threshold`, consider the segment as silence
                        (default: 0.6)
  --max_line_width MAX_LINE_WIDTH
                        (not possible with --no_align) the maximum number of
                        characters in a line before breaking the line
                        (default: None)
  --max_line_count MAX_LINE_COUNT
                        (not possible with --no_align) the maximum number of
                        lines in a segment (default: None)
  --highlight_words HIGHLIGHT_WORDS
                        (not possible with --no_align) underline each word as
                        it is spoken in srt and vtt (default: False)
  --segment_resolution {sentence,chunk}
                        (not possible with --no_align) the maximum number of
                        characters in a line before breaking the line
                        (default: sentence)
  --threads THREADS     number of threads used by torch for CPU inference;
                        supercedes MKL_NUM_THREADS/OMP_NUM_THREADS (default:
                        0)
  --hf_token HF_TOKEN   Hugging Face Access Token to access PyAnnote gated
                        models (default: None)
  --print_progress PRINT_PROGRESS
                        if True, progress will be printed in transcribe() and
                        align() methods. (default: False)