安装使用 whisperX
下载安装 whisperX
git clone https://github.com/m-bain/whisperX.git
cd whisperX
pip install -r requirements.txt
pip install -e .
然后确保安装了 ffmpeg。
whisperx --model large-v2 --language zh --output_format srt --chunk_size 5 --initial_prompt "以下是中文普通话语句。" "BV1hZ421g7xC.wav"
:使用的模型- 可用的模型有:
- 详见:https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages
- 可用的模型有:
:输出的文件格式- 常用的格式有:
- 常用的格式有:
:片段切分的秒数- 中文语音的一个较优值:
- 中文语音的一个较优值:
What is the most efficient version of OpenAI Whisper? : r/MachineLearning
m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Simplified Chinese rather than traditional? · openai/whisper · Discussion #277
More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) · Issue #159 · m-bain/whisperX
whisperx --help
usage: whisperx [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE]
[--device_index DEVICE_INDEX] [--batch_size BATCH_SIZE]
[--compute_type {float16,float32,int8}]
[--output_dir OUTPUT_DIR]
[--output_format {all,srt,vtt,txt,tsv,json,aud}]
[--verbose VERBOSE] [--task {transcribe,translate}]
[--language {...,en,ja,zh,Chinese,English,Japanese,...}]
[--align_model ALIGN_MODEL]
[--interpolate_method {nearest,linear,ignore}] [--no_align]
[--return_char_alignments] [--vad_onset VAD_ONSET]
[--vad_offset VAD_OFFSET] [--chunk_size CHUNK_SIZE]
[--diarize] [--min_speakers MIN_SPEAKERS]
[--max_speakers MAX_SPEAKERS] [--temperature TEMPERATURE]
[--best_of BEST_OF] [--beam_size BEAM_SIZE]
[--patience PATIENCE] [--length_penalty LENGTH_PENALTY]
[--suppress_tokens SUPPRESS_TOKENS] [--suppress_numerals]
[--initial_prompt INITIAL_PROMPT]
[--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT]
[--fp16 FP16]
[--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
[--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]
[--logprob_threshold LOGPROB_THRESHOLD]
[--no_speech_threshold NO_SPEECH_THRESHOLD]
[--max_line_width MAX_LINE_WIDTH]
[--max_line_count MAX_LINE_COUNT]
[--highlight_words HIGHLIGHT_WORDS]
[--segment_resolution {sentence,chunk}] [--threads THREADS]
[--hf_token HF_TOKEN] [--print_progress PRINT_PROGRESS]
audio [audio ...]
positional arguments:
audio audio file(s) to transcribe
-h, --help show this help message and exit
--model MODEL name of the Whisper model to use (default: small)
--model_dir MODEL_DIR
the path to save model files; uses ~/.cache/whisper by
default (default: None)
--device DEVICE device to use for PyTorch inference (default: cuda)
--device_index DEVICE_INDEX
device index to use for FasterWhisper inference
(default: 0)
--batch_size BATCH_SIZE
the preferred batch size for inference (default: 8)
--compute_type {float16,float32,int8}
compute type for computation (default: float16)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
directory to save the outputs (default: .)
--output_format {all,srt,vtt,txt,tsv,json,aud}, -f {all,srt,vtt,txt,tsv,json,aud}
format of the output file; if not specified, all
available formats will be produced (default: all)
--verbose VERBOSE whether to print out the progress and debug messages
(default: True)
--task {transcribe,translate}
whether to perform X->X speech recognition
('transcribe') or X->English translation ('translate')
(default: transcribe)
--language {...,en,ja,zh,Chinese,English,Japanese,...}
language spoken in the audio, specify None to perform
language detection (default: None)
--align_model ALIGN_MODEL
Name of phoneme-level ASR model to do alignment
(default: None)
--interpolate_method {nearest,linear,ignore}
For word .srt, method to assign timestamps to non-
aligned words, or merge them into neighbouring.
(default: nearest)
--no_align Do not perform phoneme alignment (default: False)
Return character-level alignments in the output json
file (default: False)
--vad_onset VAD_ONSET
Onset threshold for VAD (see pyannote.audio), reduce
this if speech is not being detected (default: 0.5)
--vad_offset VAD_OFFSET
Offset threshold for VAD (see pyannote.audio), reduce
this if speech is not being detected. (default: 0.363)
--chunk_size CHUNK_SIZE
Chunk size for merging VAD segments. Default is 30,
reduce this if the chunk is too long. (default: 30)
--diarize Apply diarization to assign speaker labels to each
segment/word (default: False)
--min_speakers MIN_SPEAKERS
Minimum number of speakers to in audio file (default:
--max_speakers MAX_SPEAKERS
Maximum number of speakers to in audio file (default:
--temperature TEMPERATURE
temperature to use for sampling (default: 0)
--best_of BEST_OF number of candidates when sampling with non-zero
temperature (default: 5)
--beam_size BEAM_SIZE
number of beams in beam search, only applicable when
temperature is zero (default: 5)
--patience PATIENCE optional patience value to use in beam decoding, as in
https://arxiv.org/abs/2204.05424, the default (1.0) is
equivalent to conventional beam search (default: 1.0)
--length_penalty LENGTH_PENALTY
optional token length penalty coefficient (alpha) as
in https://arxiv.org/abs/1609.08144, uses simple
length normalization by default (default: 1.0)
--suppress_tokens SUPPRESS_TOKENS
comma-separated list of token ids to suppress during
sampling; '-1' will suppress most special characters
except common punctuations (default: -1)
--suppress_numerals whether to suppress numeric symbols and currency
symbols during sampling, since wav2vec2 cannot align
them correctly (default: False)
--initial_prompt INITIAL_PROMPT
optional text to provide as a prompt for the first
window. (default: None)
--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
if True, provide the previous output of the model as a
prompt for the next window; disabling may make the
text inconsistent across windows, but the model
becomes less prone to getting stuck in a failure loop
(default: False)
--fp16 FP16 whether to perform inference in fp16; True by default
(default: True)
--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
temperature to increase when falling back when the
decoding fails to meet either of the thresholds below
(default: 0.2)
--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
if the gzip compression ratio is higher than this
value, treat the decoding as failed (default: 2.4)
--logprob_threshold LOGPROB_THRESHOLD
if the average log probability is lower than this
value, treat the decoding as failed (default: -1.0)
--no_speech_threshold NO_SPEECH_THRESHOLD
if the probability of the <|nospeech|> token is higher
than this value AND the decoding has failed due to
`logprob_threshold`, consider the segment as silence
(default: 0.6)
--max_line_width MAX_LINE_WIDTH
(not possible with --no_align) the maximum number of
characters in a line before breaking the line
(default: None)
--max_line_count MAX_LINE_COUNT
(not possible with --no_align) the maximum number of
lines in a segment (default: None)
--highlight_words HIGHLIGHT_WORDS
(not possible with --no_align) underline each word as
it is spoken in srt and vtt (default: False)
--segment_resolution {sentence,chunk}
(not possible with --no_align) the maximum number of
characters in a line before breaking the line
(default: sentence)
--threads THREADS number of threads used by torch for CPU inference;
supercedes MKL_NUM_THREADS/OMP_NUM_THREADS (default:
--hf_token HF_TOKEN Hugging Face Access Token to access PyAnnote gated
models (default: None)
--print_progress PRINT_PROGRESS
if True, progress will be printed in transcribe() and
align() methods. (default: False)