Workshop Paper

Skeletor: Skeletal Transformers for Robust Body-Pose Estimation

Workshop Paper

Tao Jiang, Necati Cihan Camgoz, Richard Bowden

IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021

Predicting 3D human pose from a single monoscopic video can be highly challenging due to factors such as low resolution, motion blur and occlusion, in addition to the fundamental ambiguity in estimating 3D from 2D. Approaches that directly regress the 3D pose from independent images can be particularly susceptible to these factors and result in jitter, noise and/or inconsistencies in skeletal estimation. Much of which can be overcome if the temporal evolution of the scene and skeleton are taken into account. However, rather than tracking body parts and trying to temporally smooth them, we propose a novel transformer based network that can learn a distribution over both pose and motion in an unsupervised fashion. We call our approach Skeletor. Skeletor overcomes inaccuracies in detection and corrects partial or entire skeleton corruption. Skeletor uses strong priors learn from on 25 million frames to correct skeleton sequences smoothly and consistently. Skeletor can achieve this as it implicitly learns the spatio-temporal context of human motion via a transformer based neural network. Extensive experiments show that Skeletor achieves improved performance on 3D human pose estimation and further provides benefits for downstream tasks such as sign language translation.

Shadow-Mapping for Unsupervised Neural Causal Discovery

Workshop Paper

Matthew Vowels, Necati Cihan Camgoz, Richard Bowden

IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021

An important goal across most scientific fields is the discovery of causal structures underling a set of observations. Unfortunately, causal discovery methods which are based on correlation or mutual information can often fail to identify causal links in systems which exhibit dynamic relationships. Such dynamic systems (including the famous coupled logistic map) exhibit `mirage’ correlations which appear and disappear depending on the observation window. This means not only that correlation is not causation but, perhaps counter-intuitively, that causation may occur without correlation. In this paper we describe Neural Shadow-Mapping, a neural network based method which embeds high-dimensional video data into a low-dimensional shadow representation, for subsequent estimation of causal links. We demonstrate its performance at discovering causal links from video-representations of dynamic systems.

Evaluating the Immediate Applicability of Pose Estimation for Sign Language Recognition

Workshop Paper

Amit Moryossef, Ioannis Tsochantaridis, Joe Dinn, Necati Cihan Camgoz, Richard Bowden, Tao Jiang, Annette Rios, Mathias Muller, Sarah Ebling,

IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021

Signed languages are visual languages produced by the movement of the hands, face, and body. In this paper, we evaluate representations based on skeleton poses, as these are explainable, person-independent, privacy-preserving, low-dimensional representations. Basically, skeletal representations generalize over an individual’s appearance and background, allowing us to focus on the recognition of motion. But how much information is lost by the skeletal representation? We perform two independent studies using two state-of-the-art pose estimation systems. We analyze the applicability of the pose estimation systems to sign language recognition by evaluating the failure cases of the recognition models. Importantly, this allows us to characterize the current limitations of skeletal pose estimation approaches in sign language recognition.

SLRTP 2020: The Sign Language Recognition, Translation & Production Workshop

Workshop Paper

Necati Cihan Camgoz, Gül Varol, Samuel Albanie, Neil Fox, Richard Bowden, Andrew Zisserman, Kearsy Cormier

16th European Conference on Computer Vision (ECCV), Sign Language Recognition, Translation & Production (SLRTP) Workshop, 2020

The objective of the “Sign Language Recognition, Translation & Production” (SLRTP 2020) Workshop was to bring together researchers who focus on the various aspects of sign language understanding using tools from computer vision and linguistics. The workshop sought to promote a greater linguistic and historical understanding of sign languages within the computer vision community, to foster new collaborations and to identify the most pressing challenges for the field going forwards. The workshop was held in conjunction with the European Conference on Computer Vision (ECCV), 2020.

Multi-channel Transformers for Multi-articulatory Sign Language Translation

Workshop Paper

Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden

16th European Conference on Computer Vision (ECCV), ACVR Workshop, 2020

Sign languages use multiple asynchronous information channels (articulators), not just the hands but also the face and body, which computational approaches often ignore. In this paper we tackle the multi-articulatory sign language translation task and propose a novel multi-channel transformer architecture. The proposed architecture allows both the inter and intra contextual relationships between different sign articulators to be modelled within the transformer network itself, while also maintaining channel specific information. We evaluate our approach on the RWTH-PHOENIX-Weather-2014T dataset and report competitive translation performance. Importantly, we overcome the reliance on gloss annotations which underpin other state-of-the-art approaches, thereby removing future need for expensive curated datasets.

BosphorusSign22k Sign Language Recognition Dataset

Workshop Paper

Ogulcan Ozdemir, Ahmet Alp Kindiroglu, Necati Cihan Camgoz, Lale Akarun

12th Edition of its Language Resources and Evaluation Conference (LREC), 9th Workshop on the Representation and Processing of Sign Languages, (2020)

Sign Language Recognition is a challenging research domain. It has recently seen several advancements with the increased availability of data. In this paper, we introduce the BosphorusSign22k, a publicly available large scale sign language dataset aimed at computer vision, video recognition and deep learning research communities. The primary objective of this dataset is to serve as a new benchmark in Turkish Sign Language Recognition for its vast lexicon, the high number of repetitions by native signers, high recording quality, and the unique syntactic properties of the signs it encompasses. We also provide state-of-the-art human pose estimates to encourage other tasks such as Sign Language Production. We survey other publicly available datasets and expand on how BosphorusSign22k can contribute to future research that is being made possible through the widespread availability of similar Sign Language resources. We have conducted extensive experiments and present baseline results to underpin future research on our dataset.

ExTOL: Automatic recognition of British Sign Language using the BSL Corpus

Workshop Paper

Kearsy Cormier, Neil Fox, Bencie Woll, Andrew Zisserman, Necati Cihan Camgoz, Richard Bowden

Sign Language Translation and Avatar Technology (SLTAT), 2019

Here we describe the project “ExTOL: End to End Translation of British Sign Language” – which has one aim of building the world’s first British Sign Language to English translation system and the first practically functional machine translation system for any sign language.

Poster Presentation

Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition

Workshop Paper

Necati Cihan Camgoz, Simon Hadfield, Richard Bowden

IEEE International Conference on Computer Vision Worshops (ICCVW), 2017

In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatio-temporal deep neural networks using weak border level annotations.

The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage.

We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0.3646 and 0.3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches.

Using Convolutional 3D Neural Networks for User-Independent Continuous Gesture Recognition

Workshop Paper

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Richard Bowden

IEEE International Conference of Pattern Recognition (ICPR), ChaLearn Workshop, 2016

In this paper, we propose using 3D Convolutional Neural Networks for large scale user-independent continuous gesture recognition. We have trained an end-to-end deep network for continuous gesture recognition (jointly learning both the feature representation and the classifier). The network performs three-dimensional (i.e. space-time) convolutions to extract features related to both the appearance and motion from volumes of color frames. Space-time invariance of the extracted features is encoded via pooling layers. The earlier stages of the network are partially initialized using the work of Tran et al. before being adapted to the task of gesture recognition. An earlier version of the proposed method, which was trained for 11,250 iterations, was submitted to ChaLearn 2016 Continuous Gesture Recognition Challenge and ranked 2nd with the Mean Jaccard Index Score of 0.269235. When the proposed method was further trained for 28,750 iterations, it achieved state-of-the-art performance on the same dataset, yielding a 0.314779 Mean Jaccard Index Score.

Sign Language Recognition for Assisting the Deaf in Hospitals

Workshop Paper

Necati Cihan Camgoz, Ahmet Alp Kindiroglu, Lale Akarun

Human Behavior Understanding, 2016

In this study, a real-time, computer vision based sign language recognition system aimed at aiding hearing impaired users in a hospital setting has been developed. By directing them through a tree of questions, the system allows the user to state their purpose of visit by answering between four to six questions. The deaf user can use sign language to communicate with the system, which provides a written transcript of the exchange. A database collected from six users was used for the experiments. User independent tests without using the tree-based interaction scheme yield a 96.67% accuracy among 1257 sign samples belonging to 33 sign classes. The experiments evaluated the efectiveness of the system in terms of feature selection and spatio-temporal modelling. The combination of hand position and movement features modelled by Temporal Templates and classied by Random Decision Forests yielded the best results. The tree-based interaction scheme further increased the recognition performance to more than 97.88%.

Facial Landmark Localization in Depth Images using Supervised Ridge Descent

Workshop Paper

Necati Cihan Camgöz, Vitomir Struc, Berk Gokberk, Lale Akarun, Ahmet Alp Kındıroğlu

IEEE International Conference on Computer Vision (ICCV), ChaLearn Looking at People Workshop, 11-18 December 2015

Supervised Descent Method (SDM) has proven successful in many computer vision applications such as face alignment, tracking and camera calibration. Recent studies which used SDM, achieved state of the-art performance on facial landmark localization in depth images. In this study, we propose to use ridge regression instead of least squares regression for learning the SDM, and to change feature sizes in each iteration, effectively turning the landmark search into a coarse to fine process. We apply the proposed method to facial landmark localization on the Bosphorus 3D Face Database; using frontal depth images with no occlusion. Experimental results confirm that both ridge regression and using adaptive feature sizes improve the localization accuracy considerably.

Gesture Recognition Using Template Based Random Forest Classifiers

Workshop Paper

Necati Cihan Camgöz, Ahmet Alp Kındıroğlu, Lale Akarun

13th European Conference on Computer Vision (ECCV), ChaLearn Looking at People Workshop, 6-12 September 2014

This paper presents a framework for spotting and recognizing continuous human gestures. Skeleton based features are extracted from normalized human body coordinates to represent gestures. These features are then used to construct spatio-temporal template based Random Decision Forest models. Finally, predictions from different models are fused at decision-level to improve overall recognition performance.

Our method has shown competitive results on the ChaLearn 2014 Looking at People: Gesture Recognition dataset. Trained on a dataset of 20 gesture vocabulary and 7754 gesture samples, our method achieved a Jaccard Index of 0.74663 on the test set, reaching 7th place among contenders. Among methods that exclusively used skeleton based features, our method obtained the highest recognition performance.

Necati Cihan Camgöz

Meta Reality Labs

Publication Types:

Skeletor: Skeletal Transformers for Robust Body-Pose Estimation

Shadow-Mapping for Unsupervised Neural Causal Discovery

Evaluating the Immediate Applicability of Pose Estimation for Sign Language Recognition

SLRTP 2020: The Sign Language Recognition, Translation & Production Workshop

Multi-channel Transformers for Multi-articulatory Sign Language Translation

BosphorusSign22k Sign Language Recognition Dataset

ExTOL: Automatic recognition of British Sign Language using the BSL Corpus

Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition

Using Convolutional 3D Neural Networks for User-Independent Continuous Gesture Recognition

Sign Language Recognition for Assisting the Deaf in Hospitals

Facial Landmark Localization in Depth Images using Supervised Ridge Descent

Gesture Recognition Using Template Based Random Forest Classifiers