At present, Research is focused in two subjects: language modeling and audio-visual information fusion.
In relation to the language model most of the work done is based on the assumption, generally accepted but usually neglected in practice, that an appropriate integration of syntactic and higher level linguist knowledge may effectively reduce, in many applications and for common development resources, the recognition task perplexity. The project KRLM comprises the most directed effort that has been carried out motivated by this assumption. There, the Factorial Language Model (FLM) is the basic statistical formalism used to support the linguistic knowledge, which is partially encoded non-automatically "handcrafting" the FLM main structure. The final structure is then established based on structure learning automatic methods that have been investigated.
In relation to the audio-visual information fusion, most of the work performed was based on the hidden Markov models (HMM) for modelling jointly the speech units, considering the combined streams, or else, in the case of later integration approaches, based on product-HMMs or on independent models and multipass decoding strategies (this is the case of the project ABCP-1). More recently the Dynamic Bayesian Network (DBN) has became the basic statistical approach. The project ABCP-2, in concrete, tries to investigate (quite efficiently using the GMTK tool) the pros and cons of different DBN structures (potentially benefiting also of the LM integration).
Concerning to Development, valuable experience has been gained building (parts of) speech recognition prototypes dedicated to quite diverse applications, from just a few words spoken isolated (IWR), to medium-to-large vocabularies and continuous speech (CSR). Standard approaches have been implemented both for the acoustics front-end, based on cepstral analysis, and the visual front-end, where ROI detection and tracking operations are followed by common feature analysis methods. In relation to the audio-visual models, the HMMs, and in particular the semicontinuous-HMMs, have been at the basis of most implementations, though lately the DBNs are becoming more and more used given the new possibilities they bring. Both non-discriminative and discriminative criteria has been used for training some implemented models. Most often, language models have been built based on N-grams and single pronunciation lexicons. Standard decoding approaches (such as the Viterbi algorithm) and multi-pass strategies have been used. Tools such as the HTK, the GMTK or the CMU-LMtk, and the OpenCV among other libraries, have been at the basis of most of the implementations, though software has also been created to built some modules (in Downloads are available scripts for the used tools and also in-house created C/C++, Perl and Matlab/Octave pieces of code).