项目地址
昨天我按照官方说明pip安装了huggingsound,但这个仓库要求较低版本的pytorch、transformers和huggingface-hub库,我重新修改为原先的pytorch和huggingface库环境,git克隆huggingsound代码库,在这个项目里面执行下述测试代码。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 from huggingsound import SpeechRecognitionModel, KenshoLMDecoder # model=SpeechRecognitionModel('jonatasgrosman/wav2vec2-large-xlsr-53-english') # audio_paths=['/home/tellw/test/test-1.wav','/home/tellw/datasets/LibriSpeech/dev-clean/8842/304647/8842-304647-0013.flac'] # transcriptions=model.transcribe(audio_paths) # print(transcriptions) # if False: # lm_path='models/wav2vec2-large-xlsr-53-english/lm/lm.binary' # unigrams_path='models/wav2vec2-large-xlsr-53-english/lm/unigrams.txt' # decoder=KenshoLMDecoder(model.token_set,lm_path=lm_path,unigrams_path=unigrams_path) # transcriptions=model.transcribe(audio_paths,decoder=decoder) # print(transcriptions) # references=[ # {'path':'/home/tellw/test/test-1.wav','transcription':'yes'}, # {'path':'/home/tellw/datasets/LibriSpeech/dev-clean/8842/304647/8842-304647-0013.flac','transcription':'THOU LIKE ARCTURUS STEADFAST IN THE SKIES WITH TARDY SENSE GUIDEST THY KINGDOM FAIR BEARING ALONE THE LOAD OF LIBERTY'} # ] # evaluation=model.evaluate(references) # print(evaluation) model=SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn") audio_paths=['/mnt/sda/dataset/speech/aidatatang_200zh/corpus/train/G4429/T0055G4429S0237.wav'] transcriptions=model.transcribe(audio_paths) print(transcriptions)
因为采用常见数据集的数据,模型的准确率表现很优秀,有个问题是评估接口算预测的词错率为0,不清楚它是四舍五入还是怎样,但我觉得几个单词与原标签有差别,词错率至少为0.1
根据jonatasgrosman/whisper-large-zh-cv11 的说明下载6G(我有个显存为6G的N卡)的模型做推理,电脑会崩,不做此实验了
使用模型识别自己产生的语音的操作见使用whisper做一个本地语音识别程序
参考链接:
jonatasgrosman/wav2vec2-large-xlsr-53-english jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn
创建于2023.2.19/22.27,修改于2023.2.19/22.27