A Segmental Crf Approach to Large Vocabulary Continuous Speech Recognition
1 Introduction State-of-the-art speech recognition accuracy has significantly improved over the past few years since the application of deep neural networks [1, 2] . Recently, it has been shown that with the application of both neural network acoustic model and language model, an automatic speech recognizer can approach human-level accuracy on the Switchboard conversational speech recognition benchmark using around 2,000 hours of transcribed data [3] . While progress is mainly driven by well engineered neural network architectures and a large amount of training data, the hidden Markov model (HMM) that has been the backbone for speech recognition for decades is still playing a central role. Though tremendously successful for the problem of speech recognition, the HMM-based pipeline factorizes the whole system into several components, and building these components sepa