TL;DR

The work on Ellen DeGeneres gestures dataset was done during the GSoC period from 6th June to 16th August. During this period, initially I had to go through a bit of struggle in understanding Singularity containers and on how to use OpenPose (especially in a SIngularity.) Then, I had to work on making a good model for the gestures. In the following points I mention in short what I did, what worked for me and what didn’t.

First of all, I had to go through Singularity documentation, OpenPose and HPC commands to be used. So, I wrote down those in my blog. Check “Useful information while beginning to work” in the index page.
I had a notion that building a model from scratch (or even using transfer learning) to identify gestures might not work that well due to a lot of information (or noise) in the background. So, my plan was clear that I needed to use OpenPose keypoints. Check their extensive documentation (they have a number of pages in their GitHub repo other than the README). Then just model it as a 3D time series data.
Running OpenPose is easy, just use the command below. I used the subprocess module in Python for running the command. The keypoints are available in the json directory (one file per frame).
```
 $ <path-to-openpose-bin> --video <input-video-path> --display 0 --write-video <output-video-path> --write-json <output-json-dir>
```
The rest of the workflow is as follows:
- Consecutive frames may not be of the same shot/angle. OpenPose doesn’t not label persons across frames. So, I had to figure out a way on how to separate such frames so as to reduce the amount of unnecessary and irrelevant data being fed to the model.
- First, seperate on the basis of number of persons. In, most cases in Ellen videos, when the audience is viewed there are multiple people which is irrelevant and also not annotated. One con of this is that sudden appearance of a person in the background may separate frames of the same shot but I would not be losing the data, so I went ahead with the idea. Also, I would filter out frames which had more persons than a given threshold (say 6).
- Then, seperate on the basis of continuation of the frame. Say, in one frame we know there is a person, whose neck is at (200,340) for a 640x360 video. In the next frame there is also one person, whose neck is at (250, 100). Obviously, they aren’t the same person. So, I took a maximum allowable distance (say, 0.025 times (640+360)/2 i.e. about 10 pixels) beyond which they are of different person. For 2 or more perosn, if atleast one person matches the above condition, then we take it as the same person. This also has a negative factor, but it largely works.
- To train data, I have to get the same person on the same index of the array, say if for one frame it is [person_1_keypoints, person_2_keypoints] then in next frame it also has to be the same and not in the order [person_2_keypoints, person_1_keypoints]. For this, I applied the rule above again to get the order.
- Dense Layers require a fixed dimension. So, say for example I have 2 persons in one frame and 5 in another then I have to add dummy keypoints (just fill -1 to each, as if it were NaN data) to the first frame.
- After doing all this, I have to get the csv files from Elan annotations for the timestamps of gestures. I get the begin and end time and then calculate the corresponding frames using the fps to calculate which frames are in gestures.
- The train data will be formed bu putting a number of frames together to get a time-series data. Each frame corresponds to a list of keypoints of a given number of persons. Then, convert to numpy data. The output is whether the last frame is a gesture or not. The shape of the input will be (batch_size, no_channels, window_size, max_persons, no_of_keypoints). By default, the shape is (16, 1, 5, 6, 14). Keypoints taken are "neck_x", "neck_y", "right shoulder_x", "right shoulder_y", "right elbow_x", "right elbow_y", "right wrist_x","right wrist_y", "left shoulder_x", "left shoulder_y", "left elbow_x", "left elbow_y", "left wrist_x", "left wrist_y".
- This is also present in the GitHub repository and also in the Singularity Container.
- The training model consists of Conv3D and ConvLSTM2D layers, implemnted in Tensorflow Keras.
First of all, I tried to make some hyperparameter changes, specially in the layer combination. Since the data is so small, i.e. only 1x5x6x14 so using optimal layers along with normalization helped. After a bit of changing, the best performing one was one with just one ConvLSTM2D initially and then with 6 Conv3D layers just like the first 6 layers of VGG-16 network. There is also activation function and Batch Normalization aplied after each of the layers. Using a lot of LSTM based layers takes a bit of time, rather use Conv3D which gives good results taking lesser time. The final layers are Dense layers.
I also tried a simple multi-input computer vision model where the frames will be direct input to the model, followed by an image extractor followed by LSTM layers, however that didn’t work out.
Another possibility was to use encoder-decoder kind of structure but I couldn’t implement that in time. Something in the line of Attention mechanism can be tried. Also, one can use OpenPose keypoints as well as the image frames to get an attention to the pixels given by the keypoint values. These I couldn’t implement in time.