I want to develop an app for gesture recognition using Kinect and hidden Markov models. I watched a tutorial here: HMM lecture
But I don't know how to start. What is the state set and how to normalize the data to be able to realize HMM learning? I know (more or less) how it should be done for signals and for simple "left-to-right" cases, but 3D space makes me a little confused. Could anyone describe how it should be begun?
Could anyone describe the steps, how to do this? Especially I need to know how to do the model and what should be the steps of HMM algorithm.
A hidden Markov model (HMM) is a probabilistic graphical model that is commonly used in statistical pattern recognition and classification.
The Hidden Markov model is a probabilistic model which is used to explain or derive the probabilistic characteristic of any random process. It basically says that an observed event will not be corresponding to its step-by-step status but related to a set of probability distributions.
Hidden Markov Models (HMMs) are a class of probabilistic graphical model that allow us to predict a sequence of unknown (hidden) variables from a set of observed variables. A simple example of an HMM is predicting the weather (hidden variable) based on the type of clothes that someone wears (observed).
Hidden Markov models are well-known methods for image processing. They are used in many areas where 1D data are processed. In the case of 2D data, there appear some problems with application HMM.
One set of methods for applying HMMs to gesture recognition would be to apply a similar architecture as commonly used for speech recognition.
The HMM would not be over space but over time, and each video frame (or set of extracted features from the frame) would be an emission from an HMM state.
Unfortunately, HMM-based speech recognition is a rather large area. Many books and theses have been written describing different architectures. I recommend starting with Jelinek's "Statistical Methods for Speech Recognition" (http://books.google.ca/books?id=1C9dzcJTWowC&pg=PR5#v=onepage&q&f=false) then following the references from there. Another resource is the CMU sphinx webpage (http://cmusphinx.sourceforge.net).
Another thing to keep in mind is that HMM-based systems are probably less accurate than discriminative approaches like conditional random fields or max-margin recognizers (e.g. SVM-struct).
For an HMM-based recognizer the overall training process is usually something like the following:
1) Perform some sort of signal processing on the raw data
2) Apply vector quantization (VQ) (other dimensionality reduction techniques can also be used) to the processed data
3) Manually construct HMMs whose state transitions capture the sequence of different poses within a gesture.
Emission distributions of these HMM states will be centered on the VQ vector from step 2.
In speech recognition these HMMs are built from phoneme dictionaries that give the sequence of phonemes for each word.
4) Construct an single HMM that contains transitions between each individual gesture HMM (or in the case of speech recognition, each phoneme HMM). Then, train the composite HMM with videos of gestures.
For the recognition process, apply the signal processing step, find the nearest VQ entry for each frame, then find a high scoring path through the HMM (either the Viterbi path, or one of a set of paths from an A* search) given the quantized vectors. This path gives the predicted gestures in the video.
I implemented the 2d version of this for the Coursera PGM class, which has kinect gestures as the final unit.
https://www.coursera.org/course/pgm
Basically, the idea is that you can't use HMM to actually decide poses very well. In our unit, I used some variation of K-means to segment the poses into probabilistic categories. The HMM was used to actually decide what sequences of poses were actually viable as gestures. But any clustering algorithm run on a set of poses is a good candidate- even if you don't know what kind of pose they are or something similar.
From there you can create a model which trains on the aggregate probabilities of each possible pose for each point of kinect data.
I know this is a bit of a sparse interview. That class gives an excellent overview of the state of the art but the problem in general is a bit too difficult to be condensed into an easy answer. (I'd recommend taking it in april if you're interested in this field)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With