We propose a novel deep learning approach to solve simultaneous alignment and recognition problems (referred to as “Sequence-to-sequence” learning). We decompose the problem into a series of specialised expert systems referred to as SubUNets. The spatio-temporal relationships between these SubUNets are then modelled to solve the task, while remaining trainable end-to-end.
The approach mimics human learning and educational techniques, and has a number of significant advantages. SubUNets allow us to inject domain-specific expert knowledge into the system regarding suitable intermediate representations. They also allow us to implicitly perform transfer learning between different interrelated tasks, which also allows us to exploit a wider range of more varied data sources. In our experiments we demonstrate that each of these properties serves to significantly improve the performance of the overarching recognition system, by better constraining the learning problem.
The proposed techniques are demonstrated in the challenging domain of sign language recognition. We demonstrate state-of-the-art performance on hand-shape recognition outperforming previous techniques by more than 30%). Furthermore, we are able to obtain comparable sign recognition rates to previous research, without the need for an alignment step to segment out the signs for recognition.
In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatio-temporal deep neural networks using weak border level annotations.
The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage.
We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0.3646 and 0.3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches.
Automatic detection of fall events is vital to providing fast medical assistance to the causality, particularly when the injury causes loss of consciousness. Optimization of the energy consumption of mobile applications, especially those which run 24/7 in the background, is essential for longer use of smartphones. In order to improve energy-efficiency without compromising on the fall detection performance, we propose a novel 3-tier architecture that combines simple thresholding methods with machine learning algorithms. The proposed method is implemented on a mobile application, called uSurvive, for Android smartphones. It runs as a background service and monitors the activities of a person in daily life and automatically sends a notification to the appropriate authorities and/or user defined contacts when it detects a fall. The performance of the proposed method was evaluated in terms of fall detection performance and energy consumption. Real life performance tests conducted on two different models of smartphone demonstrate that our 3-tier architecture with feature reduction could save up to 62% of energy compared to machine learning only solutions. In addition to this energy saving, the hybrid method has a 93% of accuracy, which is superior to thresholding methods and better than machine learning only solutions.
In this paper, we propose using 3D Convolutional Neural Networks for large scale user-independent continuous gesture recognition. We have trained an end-to-end deep network for continuous gesture recognition (jointly learning both the feature representation and the classifier). The network performs three-dimensional (i.e. space-time) convolutions to extract features related to both the appearance and motion from volumes of color frames. Space-time invariance of the extracted features is encoded via pooling layers. The earlier stages of the network are partially initialized using the work of Tran et al. before being adapted to the task of gesture recognition. An earlier version of the proposed method, which was trained for 11,250 iterations, was submitted to ChaLearn 2016 Continuous Gesture Recognition Challenge and ranked 2nd with the Mean Jaccard Index Score of 0.269235. When the proposed method was further trained for 28,750 iterations, it achieved state-of-the-art performance on the same dataset, yielding a 0.314779 Mean Jaccard Index Score.
In this study, a real-time, computer vision based sign language recognition system aimed at aiding hearing impaired users in a hospital setting has been developed. By directing them through a tree of questions, the system allows the user to state their purpose of visit by answering between four to six questions. The deaf user can use sign language to communicate with the system, which provides a written transcript of the exchange. A database collected from six users was used for the experiments. User independent tests without using the tree-based interaction scheme yield a 96.67% accuracy among 1257 sign samples belonging to 33 sign classes. The experiments evaluated the efectiveness of the system in terms of feature selection and spatio-temporal modelling. The combination of hand position and movement features modelled by Temporal Templates and classied by Random Decision Forests yielded the best results. The tree-based interaction scheme further increased the recognition performance to more than 97.88%.
Sign language recognition has been the focus of research in recent years because it has enabled the use of sign languages, which are the main medium of communication for the hearing impaired, for human-computer interaction. In this work, we propose a method to recognize signs using Improved Dense Trajectory (IDT) features which were previously used in large-scale action recognition. Fisher Vectors (FV) are used to represent sign samples in the proposed method. Seven different combinations of features were compared using a test set of 200 signs, using a Support Vector Machine (SVM) classifier. The best combination yielded 80; 43% recognition performance when Histogram of Optical Flow (HOF) and Motion Boundary Histogram (MBH) components were used together.
In this thesis, we propose a human-computer interaction platform for the hearing impaired, that would be used in hospitals and banks. In order to develop such a system, we collected BosphorusSign, a Turkish Sign Language corpus in health and finance domains, by consulting sign language linguists, native users and domain specialists. Using a subset of the collected corpus, we have designed a prototype system, which we called HospiSign, that is aimed to help the Deaf in their hospital visits. The HospiSign platform guides its users through a tree-based activity diagram by asking specific questions and requiring the users to answer from the given options. In order to recognize signs that are given as answers to the interaction platform, we proposed using hand position, hand shape, hand movement and upper body pose features to represent signs. To model the temporal aspect of the signs we used Dynamic Time Warping and Temporal Templates. The classification of the signs are done using k-Nearest Neighbors and Random Decision Forest classifiers. We conducted experiments on a subset of BosphorusSign and evaluated the effectiveness of the system in terms of features, temporal modeling techniques and classification methods. In our experiments, the combination of hand position and hand movement features yielded the highest recognition performance while both of the temporal modeling and classification methods gave competitive results. Moreover, we investigated the effects of using a tree-based activity diagram and found the approach to not only increase the recognition performance, but also ease the adaptation of the users to the system. Furthermore, we investigated domain adaptation and facial landmark localization techniques and examined their applicability to the gesture and sign language recognition tasks.
There are as many sign languages as there are deaf communities in the world. Linguists have been collecting corpora of different sign languages and annotating them extensively in order to study and understand their properties. On the other hand, the field of computer vision has approached the sign language recognition problem as a grand challenge and research efforts have intensified in the last 20 years. However, corpora collected for studying linguistic properties are often not suitable for sign language recognition as the statistical methods used in the field require large amounts of data. Recently, with the availability of inexpensive depth cameras, groups from the computer vision community have started collecting corpora with large number of repetitions for sign language recognition research. In this paper, we present the BosphorusSign Turkish Sign Language corpus, which consists of 855 sign and phrase samples from the health, finance and everyday life domains. The corpus is collected using the state-of-the-art Microsoft Kinect v2 depth sensor, and will be the first in this sign language research field. Furthermore, there will be annotations rendered by linguists so that the corpus will appeal both to the linguistic and sign language recognition research communities.
Sign language is the natural medium of communication for the Deaf community. In this study, we have developed an interactive communication interface for hospitals, HospiSign, using computer vision based sign language recognition methods. The objective of this paper is to review sign language based human-computer interaction applications and to introduce HospiSign in this context. HospiSign is designed to meet deaf people at the information desk of a hospital, and to assist them in their visit. The interface guides the deaf visitors to answer certain questions and express intention of their visit, in sign language, without the need of a translator. The system consists of a computer, a touch display to visualize the interface, and a Microsoft Kinect v2 sensor to capture the users’ sign responses. HospiSign recognizes isolated signs in a structured activity diagram using Dynamic Time Warping based classifiers. In order to evaluate the developed interface, we performed usability tests and saw that the system was able to assist its users in real time with high accuracy.
Supervised Descent Method (SDM) has proven successful in many computer vision applications such as face alignment, tracking and camera calibration. Recent studies which used SDM, achieved state of the-art performance on facial landmark localization in depth images. In this study, we propose to use ridge regression instead of least squares regression for learning the SDM, and to change feature sizes in each iteration, effectively turning the landmark search into a coarse to fine process. We apply the proposed method to facial landmark localization on the Bosphorus 3D Face Database; using frontal depth images with no occlusion. Experimental results confirm that both ridge regression and using adaptive feature sizes improve the localization accuracy considerably.
This paper proposes using the state of the art 2D facial landmark localization method, Supervised Descent Method (SDM), for facial landmark localization in 3D depth images. The proposed method was evaluated on frontal faces with no occlusion from the Bosphorus 3D Face Database. In the experiments, in which 2D features were used to train SDM, the proposed approach achieved state-of-the-art performance for several landmarks over the currently available 3D facial landmark localization methods.
This paper presents a framework for spotting and recognizing continuous human gestures. Skeleton based features are extracted from normalized human body coordinates to represent gestures. These features are then used to construct spatio-temporal template based Random Decision Forest models. Finally, predictions from different models are fused at decision-level to improve overall recognition performance.
Our method has shown competitive results on the ChaLearn 2014 Looking at People: Gesture Recognition dataset. Trained on a dataset of 20 gesture vocabulary and 7754 gesture samples, our method achieved a Jaccard Index of 0.74663 on the test set, reaching 7th place among contenders. Among methods that exclusively used skeleton based features, our method obtained the highest recognition performance.
Gesture recognition is becoming popular as an efficient input method for human computer interaction. However, challenges associated with data collection, data annotation, maintaining standardization, and the high variance of data obtained from different users in different environments make developing such systems a difficult task. The purpose of this study is to integrate domain adaptation methods for the problem of gesture recognition. To achieve this task, domain adaptation is performed from hand written digit trajectory data to hand trajectories obtained from depth cameras. The performance of the applied Feature Augmentation method is evaluated through analysis of recognition performance vs percentage of target class samples in training and through the analysis of the transferability of different gestures.
Last decade witnessed the rapid development of Wireless Sensor Networks. More recently, the availability of inexpensive hardware such as CMOS cameras and microphones that are able to ubiquitously capture multimedia content from the environment has fostered the development of Wireless Multimedia Sensor Networks (WMSNs). There is a wide range of applications that are using Wireless Multimedia Sensor Networks, including Indoor Surveillance Systems. Nearly all surveillance systems start with a motion detection algorithm. After detection of motion in an image, either the motion areas are sent to another algorithm for more processing or an alarm is sent to the base station informing that there is motion in the environment. In this paper, we proposed a new motion detection algorithm, which is specifically designed for scenarios with no constant movement in the background. Our tests using Goyette’s datasets show that, our proposed algorithm achieved a 97% accuracy with an average execution time of 48ms for QVGA images on ARM9 architecture, and thus outperformed the two currently available methods.
Last decade witnessed the rapid development of Wireless Sensor Networks (WSNs). More recently, the availability of inexpensive hardware such as CMOS cameras and microphones that are able to ubiquitously capture multimedia content from the environment has fostered the development of Wireless Multimedia Sensor Networks (WMSNs). Nodes in such networks require significant amount of processing power to interpret the collected sensor data. Most of the currently available wireless multimedia sensor nodes are equipped with ARM7 core microcontrollers. On the other hand, ARM9 and ARM11 cores are viable alternatives, which deliver deterministic high performance and flexibility for demanding and cost-sensitive embedded applications. Thus, we evaluated the performance of the ARM9 and the ARM11 cores in terms of processing power and energy consumption. Our test results showed that the ARM11 core performed 3 to 4 times faster than the ARM9 core.