概述

hubert首先通过kmeans集群识别特征的伪目标，之后模仿BERT做掩模预测，与伪目标进行对比。

因为只是为了跑通程序，所以本文的训练集、验证集和测试集用的是同一批数据：train.tsv、valid.tsv的内容是一样的，train.ltr、valid.ltr的内容也是一样的。

运行过程

特征提取和集群

在fairseq/examples/hubert/simple_kmeans目录下进行

文件列表

tsv文件存储音频文件路径，第一行是音频文件的根目录，后面的每一行是一个音频文件的相对路径和采样量（number of samples）。

d='/home/tellw/datasets/LibriSpeech/dev-clean'

import subprocess

def exec_shell(cmd,ignore_err=False):
	process=subprocess.Popen(cmd,shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
	output,err=process.communicate()
	retcode=process.poll()
	if retcode==0 or ignore_err:
		return output,err
	else:
		return -1000,f'execute "{cmd}" failed'

import os
print(d)
if os.path.exists('0212.txt'):
	os.remove('0212.txt')
for root,dirs,files in os.walk(d):
	for file in files:
		if file.endswith('.flac'):
			samNum,err=exec_shell(f'soxi -s {root}/{file}')
			if samNum!=-1000:
				print(f'{os.path.relpath(root,d)}/{file}\t{samNum.decode()[:-1]}')
			else:
				with open('0212.txt','a') as f:
					f.write(f'{root}/{file}')

python genTsv.py > tsv/train.tsv

获取音频的采样量的方法soxi -s wav_file，也可以用wave.open('wav_file','rb').getnframes()。

获取MFCC特征

python dump_mfcc_feature.py tsv train 1 0 feat
tsv目录下的train.tsv中的音频文件列表划分为1块，拿出0号块，计算MFCC特征保到feat目录下

K-menas集群

python learn_kmeans.py feat train 1 models/kmeans.clu 100 --percent 0.1
拿出10%的feat目录下train子集特征，分出100个集群，保存到models/kmeans.clu文件中

K-means应用

python dump_km_label.py feat train models/kmeans.clu 1 0 lab
识别出每个0号块的train子集的特征值所属的集群类别，识别结果保存到lab目录下

多个分块的识别结果需要合并到一个标签文件里。

创造dummy字典

n_clusters=100
lab_dir=lab

for x in $(seq 0 $((n_clusters - 1))); do
  echo "$x 1"
done >> $lab_dir/dict.km.txt

每一个编号集群的数量为1。

预训练HuBERT模型

python fairseq_cli/hydra_train.py --config-dir /home/tellw/fairseq/examples/hubert/config/pretrain --config-name hubert_base_librispeech task.data=/home/tellw/fairseq/examples/hubert/simple_kmeans/tsv task.label_dir=/home/tellw/fairseq/examples/hubert/simple_kmeans/lab task.labels='["km"]' model.label_rate=100

在fairseq目录下执行，因为提取语音数据的MFCC特征并进行聚类，故模型的label_rate=100

微调HuBERT模型

首先生成ltr文件：python examples/wav2vec/libri_labels.py examples/hubert/simple_kmeans/tsv --output-dir examples/hubert/simple_kmeans/ltr --output-name train在examples/hubert/simple_kmeans/ltr生成train.ltr和train.wrd文件

python fairseq_cli/hydra_train.py --config-dir /home/tellw/fairseq/examples/hubert/config/finetune --config-name base_10h task.data=/home/tellw/fairseq/examples/hubert/simple_kmeans/tsv task.label_dir=/home/tellw/fairseq/examples/hubert/simple_kmeans/ltr model.w2v_path=/home/tellw/fairseq/None/checkpoints/checkpoint_best.pt
pt文件是预训练后的模型文件

Viterbi解码HuBERT模型

python examples/speech_recognition/new/infer.py --config-dir /home/tellw/fairseq/examples/hubert/config/decode --config-name infer_viterbi task.data=/home/tellw/fairseq/examples/hubert/simple_kmeans/tsv task.normalize=true decoding.exp_dir=/home/tellw/fairseq/examples/hubert/decode_result/viterbi common_eval.path=/home/tellw/fairseq/None/hubert_ff_02132251/checkpoint_last.pt dataset.gen_subset=test

pt文件是微调后的模型文件

其他语言模型对HuBERT模型的解码

缺少参考资料，我就不做了

预训练、微调和解码模型时的配置文件

这些配置文件在examples/hubert/config/xxx/（xxx分别对应着decode,finetune,pretrain）目录下，在$config-name.yaml配置文件中，可以指定训练的轮数、模型的保存路径名，

其他注意事项

进行操作时产生有配置上的bug等,已经解决过。以后再次遇到问题再解吧。

参考链接

HuBERT: How to Apply BERT to Speech, Visually Explained

创建于2023.2.18/0.24,修改于2023.2.18/0.24

运行fairseq的hubert模型

概述