项目地址

昨天我按照官方说明pip安装了huggingsound,但这个仓库要求较低版本的pytorch、transformers和huggingface-hub库,我重新修改为原先的pytorch和huggingface库环境,git克隆huggingsound代码库,在这个项目里面执行下述测试代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from huggingsound import SpeechRecognitionModel, KenshoLMDecoder

# model=SpeechRecognitionModel('jonatasgrosman/wav2vec2-large-xlsr-53-english')
# audio_paths=['/home/tellw/test/test-1.wav','/home/tellw/datasets/LibriSpeech/dev-clean/8842/304647/8842-304647-0013.flac']

# transcriptions=model.transcribe(audio_paths)
# print(transcriptions)

# if False:
# lm_path='models/wav2vec2-large-xlsr-53-english/lm/lm.binary'
# unigrams_path='models/wav2vec2-large-xlsr-53-english/lm/unigrams.txt'
# decoder=KenshoLMDecoder(model.token_set,lm_path=lm_path,unigrams_path=unigrams_path)
# transcriptions=model.transcribe(audio_paths,decoder=decoder)
# print(transcriptions)

# references=[
# {'path':'/home/tellw/test/test-1.wav','transcription':'yes'},
# {'path':'/home/tellw/datasets/LibriSpeech/dev-clean/8842/304647/8842-304647-0013.flac','transcription':'THOU LIKE ARCTURUS STEADFAST IN THE SKIES WITH TARDY SENSE GUIDEST THY KINGDOM FAIR BEARING ALONE THE LOAD OF LIBERTY'}
# ]

# evaluation=model.evaluate(references)
# print(evaluation)

model=SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")
audio_paths=['/mnt/sda/dataset/speech/aidatatang_200zh/corpus/train/G4429/T0055G4429S0237.wav']

transcriptions=model.transcribe(audio_paths)
print(transcriptions)

因为采用常见数据集的数据,模型的准确率表现很优秀,有个问题是评估接口算预测的词错率为0,不清楚它是四舍五入还是怎样,但我觉得几个单词与原标签有差别,词错率至少为0.1

根据jonatasgrosman/whisper-large-zh-cv11的说明下载6G(我有个显存为6G的N卡)的模型做推理,电脑会崩,不做此实验了

使用模型识别自己产生的语音的操作见使用whisper做一个本地语音识别程序

参考链接:

jonatasgrosman/wav2vec2-large-xlsr-53-english

jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn

创建于2023.2.19/22.27,修改于2023.2.19/22.27