TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
Ishan Rajendrakumar Dave
Mamshad Nayeem Rizve
Chen Chen
Mubarak Shah
[Paper]
[Code]
Motivation for Temporally-Distinctive and Temporally-Invariant Representations. In order to leverage the unlabeled videos effectively, we consider two kinds of self-supervised video representation learning techniques with complementary goals: (1) Temporally Invariant Representations (Bottom Right) encourage learning the commonalities of the clips, hence it mainly focuses on learning features related to highly frequent repetitions and appearance. (2) Temporally Distinctive Representations (Bottom Left) encourage learning the dissimilarities between clips of the same video, hence it encourages learning features for sub-actions within the video. The plot shows the activity-wise UCF101 performance difference of finetuned models which were self-supervised pretrained with temporally-distinctive and temporally-invariant objectives. The plot shows extreme 25-25 classes after sorting the all classwise differences.

Abstract

Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inductive biases like using two-modalities (RGB and Optical-flow) or two-stream of different playback rates. Instead of utilizing unlabeled videos through diverse input streams, we rely on self-supervised video representations, particularly, we utilize temporally-invariant and temporally-distinctive representations. We observe that these representations complement each other depending on the nature of the action. Based on this observation, we propose a student-teacher semi-supervised learning framework, TimeBalance, where we distill the knowledge from a temporally-invariant and a temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. Our method achieves state-of-the-art performance on three action recognition benchmarks: UCF101, HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalance.


Talk


[Slides]

Framework

Our Framework We use a teacher-student framework where we use two teachers: fI and fD. The input clip x(i) is given to the teachers and student to get their predictions. We utilize a reweighting strategy to combine the predictions of two teachers. Regardless of whether the video v(i) is labeled or unlabeled, we distill the combined knowledge of teachers to the student. For the labeled samples, we also apply standard cross-entropy loss.

Temporal Similarity based Teacher Reweighting

Temporal Similarity based Teacher Reweighting

Firstly, a set of clips from video 𝐯i are passed through the distinctive and invariant teachers to get representations. Secondly, Temporal Similarity Matrices (𝐂iD and 𝐂iI) are constructed from the cosine similarity of one timestamp to another timestamp. Finally, from the average matrix 𝐂i, a temporal similarity score si is computed. si is utilized to combine predictions of teachers for video 𝐯i during semi-supervised training.


Results

Comparison with state-of-the-art methods: Methods using pre-trained ImageNet weights are shown in Grey. V- Video (RGB), F- Optical Flow, G- Temporal Gradients. Best results are shown in Red, and Second-best in Blue.
Method Backbone Params
(M)
Input #F UCF101 HMDB51 Kinetics400
1% 5% 10% 20% 50% 40% 50% 60% 1% 10%
PL ICML'13  3D-ResNet18 13.5 V 16 - 17.6 24.7 37.0 47.5 27.3 32.4 33.5 - -
MT NeuRIPS'17  3D-ResNet18 13.5 V 16 - 17.5 25.6 36.3 45.8 27.2 30.4 32.2 - -
S4L ICCV'19  3D-ResNet18 13.5 V 16 - 22.7 29.1 37.7 47.9 29.8 31.0 35.6 - -
SD ICCV'19  3D-ResNet18 13.5 V 16 - 31.2 40.7 45.4 53.9 32.6 35.1 36.3 - -
MT+SD  WACV'21  3D-ResNet18 13.5 V 16 - 30.3 40.5 45.5 53.0 32.3 33.6 35.7 - -
3DRotNet Arxiv'19  3D-ResNet18 13.5 V 16 15.0 31.5 40.4 47.1 - - - - - -
VideoSemi WACV'21  3D-ResNet18 13.5 V 16 - 32.4 42.0 48.7 54.3 32.7 36.2 37.0 - -
TCL CVPR'21  TSM-ResNet18 - V 8 - - - - - - - - 11.6 -
TG-FixMatch CVPR'21  3D-ResNet18 13.5 V 8 - 44.8 62.4 76.1 79.3 46.5 48.4 49.7 9.8 43.8
MvPL ICCV'21  3D-ResNet18 13.5 VFG 8 - 41.2 55.5 64.7 65.6 30.5 33.9 35.8 5.0 36.9
TCLR CVIU'22  3D-ResNet18 13.5 V 16 26.9 - 66.1 73.4 76.7 - - - - -
CMPL CVPR'22  3D-ResNet18 13.5 V 8 23.8 - 67.6 - - - - - 16.5 53.7
TACL TSVT'22  3D-ResNet18 13.5 V 16 - 35.6 50.9 56.1 65.8 34.6 37.2 39.5 - -
TACL TSVT'22  3D-ResNet18 13.5 V 16 - 43.7 55.6 59.2 67.2 38.7 40.2 41.7 - -
MemDPC ECCV'20  3D-ResNet18 13.5 V 16 - - 44.2 50.9 62.3 - - - - -
MotionFit ICCV'21  3D-ResNet18 13.5 VF 16 - - - 57.7 59.0 - - - - -
Ours (TimeBalance) 3D-ResNet18 13.5 V 8 29.1 47.9 69.8 79.1 83.3 49.8 51.4 53.1 17.1 54.9
ActorCM Arxiv'21  R(2+1)D-34 33.3 V 8 - 27.0 40.2 51.7 59.9 32.9 38.2 38.9 - -
ActorCM Arxiv'21  R(2+1)D-34 33.3 V 8 - 45.1 53.0 57.4 64.7 35.7 39.5 40.8 - -
FixMatch NeuRIPS'20  SlowFast-R50 60 V 8 16.1 - 55.1 - - - - - 10.1 49.4
MvPL ICCV'21 3D-ResNet50 31.8 VFG 8 22.8 - 80.5 - - - - - 17.0 58.2
CMPL CVPR'22  3D-ResNet50 31.8 V 8 25.1 - 79.1 - - - - - 17.6 58.4
Ours (TimeBalance) 3D-ResNet50 31.8 V 8 30.1 53.5 81.1 83.3 85.0 52.6 53.9 54.5 19.6 61.2


Paper and Supplementary Material

Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Chen Chen, Mubarak Shah.
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
(hosted on ArXiv)


[Bibtex]


Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.