Currently, three R&D (or R/D) projects are active: ABCP-1 (D), ABCP-2 (R&D), and KRLM (R). Another one, AVSP, should result in a survey-like document. Follows a presentation of these projects.
This project has one main goal, which consists of building an audio visual speech recognizer dedicated to a speaker dependent continuous speech application, in portuguese, with vocabulary size 20K words. Well known approaches are followed, though less common solutions (eventually with some innovative aspects) are used when combining the acoustic and visual scores according to the multi-pass decoding strategy.
The available technical reports (bellow) present the specifications and main characteristics of the recognizer and also include relatively detailed description of some of its modules (for the time being, the visual front-end and the language model); some preliminary results are also included.
Some of the created or adapted data sets, textual and audiovisual materials (referring to the Portuguese language) and built software are available in Downloads; one Demo related to this project is also available.
One of the two central objectives of this project is to build a speech recognizer dedicated to an application with essentially the same specifications than the one considered in ABCP-1. However, excepting the audio visual acquisition and analysis modules, that remain unaltered, this recognizer is very different. Instead of the HMM based acoustic and visual models supporting, along with the N-grams (of different order at each pass), a two-pass decoding strategy, the ABCP-2 recognizer is based on a Dynamic Bayesian Network framework, jointly modeling the application's acoustic, visual and linguistic knowledge. It can be expected that this DBN based approach exhibits a better ability to face the modeling difficulties caused by the "erratic" asynchronism and differences of temporal granularity between the acoustic and the visual streams. Moreover, the whole knowledge concerning the application is used at once, running a single-pass decoding operation, which in principle may lead to a better recognition performance.
The other main objective of this project is more research oriented. Briefly, it consists of using the opportunity created by the task of building this recognizer to, simultaneously, develop the general insight and get a good practical sensibility in relation to the DBN based approaches (and more generally the Graphical Models) to multi-streaming speech recognition.
Some used data sets, created software (e.g. GMTK scripts or data managing software) and other products of this project are available in Downloads. Bellow are available a short memo and the used log-book:
The main goal of this research project is to create a new method to establish the structure of a language model (LM) to use in some speech recognition applications. The work in progress is considering a LM based on the Factorial Language Model statistical framework and on class dependent N-grams supporting multiple linguistic classes per word. The essential difference of the followed approach, in relation to formerly published methods, resides in the linguistic classes prediction model, which structure must result from combining linguistic expertise with a data-driven algorithm similar to one that has been developed for the so-called Buried Markov Models (grounded on Information Theory concepts). Based on linguistic expertise it is assumed that part of the manually selected linguistic factors are conditionally independent of the other ones given their own histories. This assumption, which reflects the role that the factors play in many real applications, leads to a proper factorization and to the partial definition of the statistical model main structure. Then, data-driven techniques that have been investigated can be used in order to find a good solution to the problem of selecting just the statistical dependencies (associated to the linguistic features) needed to build an accurate enough model. Results already obtained show that this approach is particularly effective in some circumstances when annotation deficiencies affect the text corpus. Experiments considering both syntactic and (pseudo-)semantic knowledge are planned to access the overall effectiveness of this method in some realistic applications.
Used data sets and created software (mostly Perl scripts) are available in Downloads. Bellow is available a draft version of a technical report:
The main goal of this project is to produce a document presenting an overview of the Audio Visual Speech Perception (AVSP) knowledge domain from an Audio Visual Speech Recognition (AVSR) perspective. Several reasons that justify the effort required to achieve that goal are sketched next.
Given the success of the AVSR systems in many applications, the interest in this R&D area has been increasing and many publications have been produced. By the other side, catching a glance over many of those publications, one concludes that (apparently) the AVSR community has not been paying enough attention to related issues in AVSP. Indeed, usually the papers on AVSR do not make any explicit reference to concepts or results from the AVSP field, or else just refer shortly the McGurck effect. And gaining some insight in AVSP can be important in the perspective of the AVSR endeavour.
One of the reasons is that likely the knowledge acquired on AVSP allows a better understanding of the achievements or the failures and limitations of some AVSR approaches designed for useful applications. Indeed, some common explanations of relevant AVSR phenomena and mechanisms are too much simplistic.
Moreover, many reported experiments in AVSP show quite clearly that AVSR is in a very early age. So, another reason, quite obvious, is that a substantial part of the large amount of empirical results gathered along several decades of research in AVSP can become very useful to develop AVSR systems, leading to the direct improvement of current techniques or else inspiring new ones.
A third reason for the AVSP interest to AVSR is that the referred knowledge migration can also be advantageous in the case of some more theoretical work in the AVSP area. Even if some of those works are in great part speculative and lack an acceptable consensus, by the other side some of those discussions may become very inspiring.
Bellow is available a very preliminary (draft) version of the referred document: