I used to claim LM rescoring is important. Now I think that LM is important only for noisy speech where content is not clear. For clean speech greedy decoding is perfectly fine. The weight can change online. To demonstrate advantage of LM you need a noisy data (not even librispeech dev-other). A paper in this direction from Google:
On Adaptive Weight Interpolation of the Hybrid Autoregressive Transducer
https://www.isca-speech.org/archive/pdfs/interspeech_2022/variani22_interspeech.pdf
Systems going to have more features beside ASR - echo cancellation, emotion, punctuation predictions, many papers about that
End-to-end Speech-to-Punctuated-Text Recognition
https://www.isca-speech.org/archive/pdfs/interspeech_2022/nozaki22_interspeech.pdf
A Conformer-based Waveform-domain Neural Acoustic Echo Canceller (Google)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/panchapagesan22_interspeech.pdf
The whole conference is about RNN-T and medical things. Interesting papers about RNN:
Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition (Cambridge)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/sun22_interspeech.pdf
Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization (IBM)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/fasoli22_interspeech.pdf
On the Prediction Network Architecture in RNN-T for ASR (Nuance)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/albesano22_interspeech.pdf
Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition
https://www.isca-speech.org/archive/pdfs/interspeech_2022/shinohara22_interspeech.pdf
You can train very good conformers on a single GPU:
Efficient Training of Neural Transducer for Speech Recognition (RWTH)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/zhou22c_interspeech.pdf
Technology approaches the limit, ensembles come back, the paper uses five deep sparse Conformer variants from 12 to 16, 17, 50, and finally 100 encoder layers
Deep Sparse Conformer for Speech Recognition (NVIDIA)
https://www.isca-speech.org/archive/pdfs/interspeech_2022/wu22h_interspeech.pdf