Conference Paper

VDSM: Unsupervised Video Disentanglement with State-Space Modeling and Deep Mixtures of Experts

Conference Paper

Matthew J. Vowels, Necati Cihan Camgoz, Richard Bowden

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

Disentangled representations support a range of downstream tasks including causal reasoning, generative modeling, and fair machine learning. Unfortunately, disentanglement has been shown to be impossible without the incorporation of supervision or inductive bias. Given that supervision is often expensive or infeasible to acquire, we choose to incorporate structural inductive bias and present an unsupervised, deep State-Space-Model for Video Disentanglement (VDSM). The model disentangles latent time-varying and dynamic factors via the incorporation of hierarchical structure with a dynamic prior and a Mixture of Experts decoder. VDSM learns separate disentangled representations for the identity of the object or person in the video, and for the action being performed. We evaluate VDSM across a range of qualitative and quantitative tasks including identity and dynamics transfer, sequence generation, Frechet Inception Distance, and factor classification. VDSM achieves state-of-the-art performance and exceeds adversarial methods, even when the methods use additional supervision

SeeHear: Signer Diarisation and a New Dataset

Conference Paper

Samuel Albanie, Gul Varol, Liliane Momeni, Triantafyllos Afouras, Andrew Brown, Chuhan Zhang, Ernesto Coto, Necati Cihan Camgoz, Ben Saunders, Abhishek Dutta, Neil Fox, Richard Bowden, Bencie Woll, Andrew Zisserman

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021

In this work, we propose a framework to collect a large-scale, diverse sign language dataset that can be used to train automatic sign language recognition models.

The first contribution of this work is SDTrack, a generic method for signer tracking and diarisation in the wild. Our second contribution is SeeHear , a dataset of 90 hours of British Sign Language (BSL) content featuring a wide range of signers, and including interviews, monologues and debates. Using SDTrack, the SeeHear dataset is annotated with 35K active signing tracks, with corresponding signer identities and subtitles, and 40K automatically localised sign labels. As a third contribution, we provide benchmarks for signer diarisation and sign recognition on SeeHear.

Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation

Conference Paper

Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020 (Oral)

Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information, simultaneously solving two co-dependant sequence-to-sequence learning problems and leads to significant performance gains.

We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.

Progressive Transformers for End-to-End Sign Language Production

Conference Paper

Ben Saunders, Necati Cihan Camgoz, Richard Bowden

16th European Conference of Computer Vision (ECCV), 2020

The goal of automatic Sign Language Production (SLP) is to translate spoken language to a continuous stream of sign language video at a level comparable to a human translator. If this was achievable, then it would revolutionise Deaf hearing communications. Previous work on predominantly isolated SLP has shown the need for architectures that are better suited to the continuous domain of full sign sequences.

In this paper, we propose Progressive Transformers, a novel architecture that can translate from discrete spoken language sentences to continuous 3D skeleton pose outputs representing sign language. We present two model configurations, an end-to-end network that produces sign direct from text and a stacked network that utilises a gloss intermediary.

Our transformer network architecture introduces a counter that enables continuous sequence generation at training and inference. We also provide several data augmentation processes to overcome the problem of drift and improve the performance of SLP models. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset and setting baselines for future research.

NestedVAE: Isolating Common Factors via Weak Supervision

Conference Paper

Matthew J. Vowels, Necati Cihan Camgoz, Richard Bowden

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Fair and unbiased machine learning is an important and active field of research, as decision processes are increasingly driven by models that learn from data. Unfortunately, any biases present in the data may be learned by the model, thereby inappropriately transferring that bias into the decision making process. We identify the connection between the task of bias reduction and that of isolating factors common between domains whilst encouraging domain specific invariance. To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory for scenarios whereby data may be naturally paired across domains and no additional supervision is required. The result is the Nested Variational AutoEncoder (NestedVAE). Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempts to reconstruct the latent representation of one image, from the latent representation of its paired image. In so doing, the nested VAE isolates the common latent factors/causes and becomes invariant to unwanted factors that are not shared between paired images. We also propose a new metric to provide a balanced method of evaluating consistency and classifier performance across domains which we refer to as the Adjusted Parity metric. An evaluation of NestedVAE on both domain and attribute invariance, change detection, and learning common factors for the prediction of biological sex demonstrates that NestedVAE significantly outperforms alternative methods.

Gated Variational AutoEncoders: Incorporating Weak Supervision to Encourage Disentanglement

Conference Paper

Matthew J. Vowels, Necati Cihan Camgoz, Richard Bowden

IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020

Variational AutoEncoders (VAEs) provide a means to generate representational latent embeddings. Previous research has highlighted the benefits of achieving representations that are disentangled, particularly for downstream tasks. However, there is some debate about how to encourage disentanglement with VAEs and evidence indicates that existing implementations of VAEs do not achieve disentanglement consistently. The evaluation of how well a VAE’s latent space has been disentangled is often evaluated against our subjective expectations of which attributes should be disentangled for a given problem. Therefore, by definition, we already have domain knowledge of what should be achieved and yet we use unsupervised approaches to achieve it. We propose a weakly-supervised approach that incorporates any available domain knowledge into the training process to form a Gated-VAE. The process involves partitioning the representational embedding and gating backpropagation. All partitions are utilised on the forward pass but gradients are backpropagated through different partitions according to selected image/target pairings. The approach can be used to modify existing VAE models such as beta-VAE, InfoVAE and DIP-VAE-II. Experiments demonstrate that using gated backpropagation, latent factors are represented in their intended partition. The approach is applied to images of faces for the purpose of disentangling head-pose from facial expression. Quantitative metrics show that using Gated-VAE improves average disentanglement, completeness and informativeness, as compared with un-gated implementations. Qualitative assessment of latent traversals demonstrate its disentanglement of head-pose from expression, even when only weak/noisy supervision is available.

Adversarial Training for Multi-Channel Sign Language Production

Conference Paper

Ben Saunders, Necati Cihan Camgoz, Richard Bowden

British Machine Vision Conference (BMVC), 2020

Sign Languages are rich multi-channel languages, requiring articulation of both manual (hands) and non-manual (face and body) features in a precise, intricate manner. Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody this full sign morphology to be truly understandable by the Deaf community. Previous work has mainly focused on manual feature production, with an under-articulated output caused by regression to the mean.

In this paper, we propose an Adversarial Multi-Channel approach to SLP. We frame sign production as a minimax game between a transformer-based Generator and a conditional Discriminator. Our adversarial discriminator evaluates the realism of sign production conditioned on the source text, pushing the generator towards a realistic and articulate output. Additionally, we fully encapsulate sign articulators with the inclusion of non-manual features, producing facial features and mouthing patterns.

We evaluate on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, and report state-of-the art SLP back-translation performance for manual production. We set new benchmarks for the production of multi-channel sign to underpin future research into realistic SLP.

A Phonology-based Approach for Isolated Sign Production Assessment in Sign Language

Conference Paper

Sandrine Tornay, Necati Cihan Camgoz, Richard Bowden, Mathew Magimai- Doss

22nd ACM International Conference on Multimodal Interaction (Late-Breaking Results), 2020

Interactive learning platforms are in the top choices to acquire new languages. Such applications or platforms are more easily available for spoken languages, but rarely for sign languages. Assessment of the production of signs is a challenging problem because of the multichannel aspect (eg, hand shape, hand movement, mouthing, facial expression) inherent in sign languages. In this paper, we propose an automatic sign language production assessment approach which allows assessment of two linguistic aspects:(i) the produced lexeme and (ii) the produced forms. On a linguistically annotated Swiss German Sign Language dataset, SMILE DSGS corpus, we demonstrate that the proposed approach can effectively assess the two linguistic aspects in an integrated manner.

HMM-based Approaches to Model Multichannel Information in Sign Language Inspired from Articulatory Features-based Speech Processing

Conference Paper

Sandrine Tornay, Marzieh Razavi, Necati Cihan Camgoz, Richard Bowden, Mathew Magimai-Doss

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019

Sign language conveys information through multiple channels, such as hand shape, hand movement, and mouthing. Modeling this multichannel information is a highly challenging problem. In this paper, we elucidate the link between spoken language and sign language in terms of production phenomenon and perception phenomenon. Through this link we show that hidden Markov model-based approaches developed to model “articulatory” features for spoken language processing can be exploited to model the multichannel information inherent in sign language for sign language processing.

SMILE Swiss German Sign Language Dataset

Conference Paper

Sarah Ebling, Necati Cihan Camgoz, Penny Boyes Braem, Katja Tissi, Sandra Sidler-Miserez, Stephanie Stoll, Simon Hadfield, Tobias Haug, Richard Bowden, Sandrine Tornay, Marzieh Razavi, Mathew Magimai-Doss

11th Edition of the Language Resources and Evaluation Conference (LREC), 2018

Sign language recognition (SLR) involves identifying the form and meaning of isolated signs or sequences of signs. To our knowledge, the combination of SLR and sign language assessment is novel. The goal of an ongoing three-year project in Switzerland is to pioneer an assessment system for lexical signs of Swiss German Sign Language (Deutschschweizerische Gebärdensprache, DSGS) that relies on SLR. The assessment system aims to give adult L2 learners of DSGS feedback on the correctness of the manual parameters (handshape, hand position, location, and movement) of isolated signs they produce. In its initial version, the system will include automatic feedback for a subset of a DSGS vocabulary production test consisting of 100 lexical items. To provide the SLR component of the assessment system with sufficient training samples, a large-scale dataset containing videotaped repeated productions of the 100 items of the vocabulary test with associated transcriptions and annotations was created, consisting of data from 11 adult L1 signers and 19 adult L2 learners of DSGS. This paper introduces the dataset, which will be made available to the research community.

Sign Language Production using Neural Machine Translation and Generative Adversarial Networks

Conference Paper

Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield, Richard Bowden

British Machine Vision Conference (BMVC), 2018 (Oral)

We present a novel approach to automatic Sign Language Production using state-of-the-art Neural Machine Translation (NMT) and Image Generation techniques. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign gloss sequences using an encoder-decoder network. We then find a data driven mapping between glosses and skeletal sequences. We use the resulting pose information to condition a generative model that produces sign language video sequences. We evaluate our approach on the recently released PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach by sharing qualitative results of generated sign sequences given their skeletal correspondence.

Neural Sign Language Translation

Conference Paper

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here, the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar.

We formalize SLT in the framework of Neural Machine Translation (NMT) for both end-to-end and pretrained settings (using expert knowledge). This allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language.

To evaluate the performance of Neural SLT, we collected the first publicly available Continuous SLT dataset, RWTH-PHOENIX-Weather 2014T . It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss-level tokenization networks were able to achieve 9.58 and 18.13 respectively.

SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition

Conference Paper

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Richard Bowden

IEEE International Conference on Computer Vision (ICCV), 2017 (Oral: Spotlight)

We propose a novel deep learning approach to solve simultaneous alignment and recognition problems (referred to as “Sequence-to-sequence” learning). We decompose the problem into a series of specialised expert systems referred to as SubUNets. The spatio-temporal relationships between these SubUNets are then modelled to solve the task, while remaining trainable end-to-end.

The approach mimics human learning and educational techniques, and has a number of significant advantages. SubUNets allow us to inject domain-specific expert knowledge into the system regarding suitable intermediate representations. They also allow us to implicitly perform transfer learning between different interrelated tasks, which also allows us to exploit a wider range of more varied data sources. In our experiments we demonstrate that each of these properties serves to significantly improve the performance of the overarching recognition system, by better constraining the learning problem.

The proposed techniques are demonstrated in the challenging domain of sign language recognition. We demonstrate state-of-the-art performance on hand-shape recognition outperforming previous techniques by more than 30%). Furthermore, we are able to obtain comparable sign recognition rates to previous research, without the need for an alignment step to segment out the signs for recognition.

BosphorusSign: A Turkish Sign Language Recognition Corpus in Health and Finance Domains

Conference Paper

Necati Cihan Camgöz, Ahmet Alp Kındıroğlu, Serpil Karabüklü, Meltem Kelepir, A. Sumru Özsoy, Lale Akarun

10th Edition of the Language Resources and Evaluation Conference (LREC), 23-28 May 2016

There are as many sign languages as there are deaf communities in the world. Linguists have been collecting corpora of different sign languages and annotating them extensively in order to study and understand their properties. On the other hand, the field of computer vision has approached the sign language recognition problem as a grand challenge and research efforts have intensified in the last 20 years. However, corpora collected for studying linguistic properties are often not suitable for sign language recognition as the statistical methods used in the field require large amounts of data. Recently, with the availability of inexpensive depth cameras, groups from the computer vision community have started collecting corpora with large number of repetitions for sign language recognition research. In this paper, we present the BosphorusSign Turkish Sign Language corpus, which consists of 855 sign and phrase samples from the health, finance and everyday life domains. The corpus is collected using the state-of-the-art Microsoft Kinect v2 depth sensor, and will be the first in this sign language research field. Furthermore, there will be annotations rendered by linguists so that the corpus will appeal both to the linguistic and sign language recognition research communities.

HospiSign: An Interactive Sign Language Platform for Hearing Impaired

Conference Paper

Muhammed Miraç Süzgün, Hilal Özdemir, Necati Cihan Camgöz, Ahmet Alp Kındıroğlu, Doğaç Başaran, Cengiz Togay, Lale Akarun

International Conference on Computer Graphics, Animation and Gaming Technologie (Eurasia Graphics), 2015

Sign language is the natural medium of communication for the Deaf community. In this study, we have developed an interactive communication interface for hospitals, HospiSign, using computer vision based sign language recognition methods. The objective of this paper is to review sign language based human-computer interaction applications and to introduce HospiSign in this context. HospiSign is designed to meet deaf people at the information desk of a hospital, and to assist them in their visit. The interface guides the deaf visitors to answer certain questions and express intention of their visit, in sign language, without the need of a translator. The system consists of a computer, a touch display to visualize the interface, and a Microsoft Kinect v2 sensor to capture the users’ sign responses. HospiSign recognizes isolated signs in a structured activity diagram using Dynamic Time Warping based classifiers. In order to evaluate the developed interface, we performed usability tests and saw that the system was able to assist its users in real time with high accuracy.

Necati Cihan Camgöz

Meta Reality Labs

Publication Types:

VDSM: Unsupervised Video Disentanglement with State-Space Modeling and Deep Mixtures of Experts

SeeHear: Signer Diarisation and a New Dataset

Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation

Progressive Transformers for End-to-End Sign Language Production

NestedVAE: Isolating Common Factors via Weak Supervision

Gated Variational AutoEncoders: Incorporating Weak Supervision to Encourage Disentanglement

Adversarial Training for Multi-Channel Sign Language Production

A Phonology-based Approach for Isolated Sign Production Assessment in Sign Language

HMM-based Approaches to Model Multichannel Information in Sign Language Inspired from Articulatory Features-based Speech Processing

SMILE Swiss German Sign Language Dataset

Sign Language Production using Neural Machine Translation and Generative Adversarial Networks

Neural Sign Language Translation

SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition

BosphorusSign: A Turkish Sign Language Recognition Corpus in Health and Finance Domains

HospiSign: An Interactive Sign Language Platform for Hearing Impaired