回到首页

interspeech2022 9月20日内容

I used to claim LM rescoring is important. Now I think that LM is important only for noisy speech where content is not clear. For clean speech greedy decoding is perfectly fine. The weight can change online. To demonstrate advantage of LM you need a noisy data (not even librispeech dev-other). A paper in this direction from Google:

On Adaptive Weight Interpolation of the Hybrid Autoregressive Transducer
https://www.isca-speech.org/archive/pdfs/interspeech_2022/variani22_interspeech.pdf



Systems going to have more features beside ASR - echo cancellation, emotion, punctuation predictions, many papers about that

End-to-end Speech-to-Punctuated-Text Recognition
https://www.isca-speech.org/archive/pdfs/interspeech_2022/nozaki22_interspeech.pdf

A Conformer-based Waveform-domain Neural Acoustic Echo Canceller (Google)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/panchapagesan22_interspeech.pdf



The whole conference is about RNN-T and medical things. Interesting papers about RNN:

Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition (Cambridge)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/sun22_interspeech.pdf

Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization (IBM)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/fasoli22_interspeech.pdf

On the Prediction Network Architecture in RNN-T for ASR (Nuance)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/albesano22_interspeech.pdf

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition
https://www.isca-speech.org/archive/pdfs/interspeech_2022/shinohara22_interspeech.pdf



You can train very good conformers on a single GPU:
Efficient Training of Neural Transducer for Speech Recognition (RWTH)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/zhou22c_interspeech.pdf



Technology approaches the limit, ensembles come back, the paper uses five deep sparse Conformer variants from 12 to 16, 17, 50, and finally 100 encoder layers
Deep Sparse Conformer for Speech Recognition (NVIDIA)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/wu22h_interspeech.pdf

本文创建于2022.9.21/10.42,修改于2022.9.21/10.42