Journal Paper

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Journal Paper

Benjamin Saunders, Necati Cihan Camgoz, Richard Bowden

International Journal of Computer Vision (IJCV), 2021

Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate. Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production.

In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a Mixture Density Network (MDN) formulation to produce realistic and expressive sign pose sequences.

We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging PHOENIX14T dataset and setting baselines for future research. We further provide a user evaluation of our SLP model, to understand the Deaf reception of our sign pose productions

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Journal Paper

Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield, Richard Bowden

International Journal of Computer Vision (IJCV), 2020

We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition settings qualitatively and quantitatively using broadcast quality assessment metrics.

Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos

Journal Paper

Oscar Koller, Necati Cihan Camgoz, Hermann Ney, Richard Bowden

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2019

In this work we present a new approach to the field of weakly supervised learning in the video domain. Our method is relevant to sequence learning problems which can be split up into sub-problems that occur in parallel. Here, we experiment with sign language data. The approach exploits sequence constraints within each independent stream and combines them by explicitly imposing synchronisation points to make use of parallelism that all sub-problems share. We do this with multi-stream HMMs while adding intermediate synchronisation constraints among the streams. We embed powerful CNN-LSTM models in each HMM stream following the hybrid approach. This allows the discovery of attributes which on their own lack sufficient discriminative power to be identified. We apply the approach to the domain of sign language recognition exploiting the sequential parallelism to learn sign language, mouth shape and hand shape classifiers. We evaluate the classifiers on three publicly available benchmark data sets featuring challenging real-life sign language with over 1000 classes, full sentence based lip-reading and articulated hand shape recognition on a fine-grained hand shape taxonomy featuring over 60 different hand shapes. We clearly outperform the state-of-the-art on all data sets and observe significantly faster convergence using the parallel alignment approach.

Approximation of Ensemble Boundary using Spectral Coefficients

Journal Paper

Terry Windeatt, Cemre Zor, Necati Cihan Camgoz

IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2018

A spectral analysis of a Boolean function is proposed for approximating the decision boundary of an ensemble of classifiers, and an intuitive explanation of computing Walsh coefficients for the functional approximation is provided. It is shown that the difference between the first- and third-order coefficient approximations is a good indicator of optimal base classifier complexity. When combining neural networks, the experimental results on a variety of artificial and real two-class problems demonstrate under what circumstances ensemble performance can be improved. For tuned base classifiers, the first-order coefficients provide performance similar to the majority vote. However, for weak/fast base classifiers, higher order coefficient approximation may give better performance. It is also shown that higher order coefficient approximation is superior to the Adaboost logarithmic weighting rule when boosting weak decision tree base classifiers.

An Energy-Efficient Multi-Tier Architecture for Fall Detection on Smartphones

Journal Paper

M. Amac Guvensan, A. Oguz Kansiz, N. Cihan Camgoz, H. Irem Turkmen, A. Gokhan Yavuz, M. Elif Karsligil

Sensors, Volume 17, Issue 7, 1487, 2017

Automatic detection of fall events is vital to providing fast medical assistance to the causality, particularly when the injury causes loss of consciousness. Optimization of the energy consumption of mobile applications, especially those which run 24/7 in the background, is essential for longer use of smartphones. In order to improve energy-efficiency without compromising on the fall detection performance, we propose a novel 3-tier architecture that combines simple thresholding methods with machine learning algorithms. The proposed method is implemented on a mobile application, called uSurvive, for Android smartphones. It runs as a background service and monitors the activities of a person in daily life and automatically sends a notification to the appropriate authorities and/or user defined contacts when it detects a fall. The performance of the proposed method was evaluated in terms of fall detection performance and energy consumption. Real life performance tests conducted on two different models of smartphone demonstrate that our 3-tier architecture with feature reduction could save up to 62% of energy compared to machine learning only solutions. In addition to this energy saving, the hybrid method has a 93% of accuracy, which is superior to thresholding methods and better than machine learning only solutions.

Necati Cihan Camgöz

Meta Reality Labs

Publication Types:

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos

Approximation of Ensemble Boundary using Spectral Coefficients

An Energy-Efficient Multi-Tier Architecture for Fall Detection on Smartphones