Kaldi ASR: Research and Academic Users – Summary of the meeting
- Prepared by Desh Raj, Yiming Wang, Jonathan Chang, Paola Garcia
- September 17, 2020.
Introduction (Sanjeev Khudanpur)
The townhall was planned to discuss the future of Kaldi. Sanjeev Khudanpur gave an introduction to the evolution of the toolkit. Kaldi still remains one of the leading tools for researchers and small companies. Part of its success relies on the community contribution to maintain Kaldi up to date. Today, Kaldi is preparing major revisions such as pytorch-ification as the core neural components, automatic differentiation through the FST, and streamlining for data ingestion. The main questions to continue the project are what, how, and who. We will need the help of the community.
K2 (Dan Povey)
Dan Povey explained his vision of the next Kaldi. The overall plan is to have two main projects: K2 for the sequence modeling, Lhotse for the data preparation, but also to make sure it works with other platforms. The recipes’ structure will be decided later.
The current Kaldi is based on C++ and bash scripts. The neural network framework is not powerful enough for the new PyTorch and TensorFlow, and it is becoming complex. The Kaldi contributors started a prototype for building a python wrapper, but it was difficult to maintain, so it was abandoned. The idea, now, is to start from scratch.
Design Considerations
Kaldi new design will have separate packages for data preparation, training, etc, plus small and more maintainable projects. It will also work well with frameworks like PyTorch or Tensorflow. It will avoid the mismatch between training and inference codebases.
Goals
- Make FSA compatible with Pytorch and Tensorflow
- FSAs are differentiable, have fast GPU implementation, are compatible with PyTorch; “ragged tensors”.
- FSA support standard operations: pruned composition, determinization, n-best rescoring
- View CTC and LF-MMI as FSA
- Can use OpenFST but will eventually implement it natively
K2 implementation
K2 will be mostly C++/Cuda
The workhorse in K2 will be ``RaggedTensor’’: arbitrarily nesting vector<vector>
It will have dependencies on cub, DLPack, pybind11
Why not Pytorch or Tensorflow
The new Kaldi will not be PyTorch or TensorFlow because it is not possible to efficiently implement some algorithms like composition and determinization. TensorFlow RaggedTensor codebase is too complex and not integrable with PyTorch. Autograd can be used since it allows the algorithms to be pure FSAs, and it is implemented at Python level.
Code
The Kaldi contributors need to rethink algorithms for GPU implementation (because of loops). There will not be a lot of GPU specific code; most C++ lambdas are CPU-based, so that reusable code is accessible for more developers. Operations such as pruned composition, computing expectations are now on GPU; the determinization will come later.
Timeline
We plan to release the code Mid-to-late October. The initial release will be CTC and LF-MMI. By the end of the year, non-trivial uses such as multi-pass jointly optimized systems will be implemented. Later on, the contributors will focus on recipes (still undecided structure), modeling and Lhotse.
Lhotse (Piotr Zelasko)
Kaldi was designed with experts in mind, not easy-to-use. Lhotse proposes to make things accessible to a broader community.
Key aspects
- Python-centric: numpy, pandas, etc.
- Easy-to-manipulate data structures
- Data prep for standard speech corpora; e.g. SWBD, AMI, mini-Librispeech
- For now PyTorch but later TensorFlow
- Main idea: cuts (arbitrary segments), supervision (similar to segments - use of metadata).
Panel Discussion: Kaldi in education and research (Moderated by Jan “Yenda” Trmal
- ASR is a difficult subject - Yenda
- Kaldi is hard for students/beginners (experts know how to use it) - John Hansen, Eric Fosler, Hermann Ney, Peter Bell, Thomas Hain
- Kaldi is trying to be too many things: be cutting edge as well as being accessible as well as commercial - Eric Fosler
- Kaldi is a research tool, there are other tools to teach the basics - Hermann Ney
- Multi-task training (or anything non-standard with neural networks) is complicated with Kaldi so people move to PyTorch - Peter Bell
- All recipes (SWBD, Librispeech, etc.) are data-oriented rather than algorithm oriented - Thomas Hain
- One of the successes of Kaldi is the recipes (clone and run recipes) - Lukas Burget
- Glad that the community is still continuing in this area and not just in E2E - Peter Bell
Research goals
- Analysis and debugging capability needed - Chin-Hui Lee
- Plug-in Interface with other tools - Chin-Hui Lee
- Visual Interfaces for every module- Chin Hui Lee
- For DNNs Reinhold Haeb-Umbach
- Go Beyond ASR, machine translation for example - Ahmed Ali
- Able to handle more languages and low resource languages - Sakriani Sakti
- Kaldi needs scaffolding, different people taking on different parts -Eric Fosler, SK, Yenda
- Train on infinite large datasets - Reinhold Haeb-Umbach
- Separate data preparation and modeling - Reinhold Haeb-Umbach
- Perform augmentation on the fly - Reinhold Haeb-Umbach
- Make it easy to back-propagate from the acoustic model to the front-end - Reinhold Haeb-Umbach
- Able to adapt models to speech changes (patient monitoring) - Emily Mover
- Handle under-represented speech from disease patterns - Emily Mover
- Consider in the design that not everybody has access to multiple GPU’s - Sakriani Sakti
- Accommodate self-supervised pre-training - Sakriani Sakti
- Able to do adaptation etc. (not straightforward in LF-MMI in Kaldi) - Peter Bell
- Make things more algorithm oriented, rather than data-oriented - Thomas Hain, Peter Bell, Eric Fosler
- Have more modular systems instead of huge complex systems - Emily Mover
- Downstream integration with things like chat bots - Eric Fosler
- Design based on algorithms in the literature (pointers to papers)- Lukas Burget
- Able to be interfaced with other projects -Lukas Burget
- Make things more explicit in recipes instead of code -Lukas Burget
- Sustainable science!! We can’t compete with some of the research groups with more resources. What we can do is still contribute to science by making a change in a meaningful way - Thomas Hain
Education goals
- Create problems for teaching signal processing or ML -Mark Hasegawa
- Easy install “pip install k2” -Mark Hasegawa
- Easy to teach students how to implement a hybrid ASR -Mark Hasegawa
- build ASR from scratch; download and easy install, e.g. in Google colab -Eric Fosler
- Create smaller examples integrable into classroom teaching- John Hansen, SK
- Provide generic examples for building recipes -John Hansen
- visualization tools particularly for teaching purposes - Reinhold Haeb-Umbach
- New people want deep neural nets to solve everything. Kaldi needs easy onboarding examples; requires some soul-searching (can’t be everything to everyone) - Alex Weibel
- Instantly available recipes, easy to use, e.g. digit recognition - John Hansen, Emily Mover
- Able to Build small footprint tasks, for low resource languages, keyword search and people from other fields - John Hansen, Peter Bell
- Build examples of manageable/small code:
- Fearless Steps (100h of speech): independent tasks within the corpus can be identified -John Hansen
- Common Voice (Mozilla): include a layer that interfaces with Common Voice (or other corpora) to easily get data from different languages - Ahmed Ali
- Think about how to get users from outside the small niche of speech researchers so that it can work as a black box speech recognizer ( ASR for dialog, psychologists,etc) - Alex Weibel, John Hansen
- Students need instant gratification and make things fun - Alex Weibel
- Example: Click on the webpage, you don’t have to prepare anything - Alex Weibel.
- Need to involve the students in the process. Actively ask the students to create pull requests - Julia
- Invite students as contributors - Yenda
- Students should be able to use it easily so that they can feel confident and learn - Emily Mover
Commitments/plans
- Planning to use pip to easy install - Dan Povey
- Hopefully, in a month we will have the basics running - Dan Povey
- Kaldi 2 will try to be cutting edge (research), accessible (education), commercial (companies) - Dan Povey
- Support multichannel- Dan Povey
- Data augmentation on the fly should be possible in Lhotse - Dan Povey
- Recipes are TBD - Dan Povey
- Current Kaldi will be maintained - Dan Povey
- Plan to have a similar amount of coverage and to have all of the important aspects of the recipes. - Dan Povey
- Lhotse might be organized by subject instead of by database - Dan Povey
- The hope is that we won’t even need context (or transition probabilities, trees) because some of these neural network things can deal with it - Dan Povey
- Reproducible results with recipes will happen - SK
- We hope that Kaldi 2 will have better performance for specific hardware, but it might take a while to implement - Dan Povey
- The main avenue to contribute to K2 is pull requests. We are open to contributions! - Dan Povey
- K2 will never support the “exact” same features as the previous version. We estimate to have the same performance in February or March 2021 - Dan Povey
- Generally aiming for top performance, flexibility and producibility - Dan Povey
- We are creating jupyter notebooks in Lhotse to demonstrate usage examples -Piotr Zelasko
Some other ideas
- Teachers can have in-class contests building a system using Kaldi that will contribute to the repository. -SK
- Probably Kaldi should decide on industry, research, teaching -Alex Weibel