Ishan Dave

I am a full-time Applied Scientist at Adobe Firefly working on image and video diffusion models. I finished my Ph.D. in in 2024, from the Center for Research in Computer Vision (CRCV), University of Central Florida (UCF), advised by Prof. Mubarak Shah. My Ph.D. dissertation was about Video Understanding and Self-Supervised Learning.

Email Email: ishandave95(a)gmail.com

Google ScholarGoogle Scholar  /  GithubGithub  /  LinkedInLinkedIn

CVCV (public version, for more details feel free to ask!)
Last updated: Dec 17, 2025

profile photo

Updates

2025
  Dec'25: A First author work "CreativeVR" released: state-of-the-art video restoration model
  Oct'25: My led work "Generative Upscaler" is shipped as the default upscaler in the Photoshop 💥
  Sep'25: A paper "Finegrained Video Retrieval" accepted at NeurIPS 2025
  Jun'25: A paper "GT-Loc" accepted at ICCV 2025: Oral presentation! (Top 0.5% papers)
  Apr'25: A paper "ALBAR" accepted at ICLR 2025
  Jan'25: Started full-time at Adobe Firefly, Seattle, WA
2024
  Oct'24: Successfully defended my Ph.D dissertation!
  Aug'24: SPAct Patent Approved! my first patent as the primary inventor 💥
  Jul'24: 2 First author papers accepted at ECCV 2024: Oral presentation! (Top 3% papers)💥💥
  Jun'24: Selected as Outstanding Reviewer of CVPR 2024! (top 2% among 10,000 reviewers)🥇
  May'24: Started internship at Apple, Cupertino, CA
2023
  Dec'23: A First author paper "No More Shortcuts" accepted to AAAI 2024 💥
  Jul'23: A First author paper "Event-TransAct" accepted at IROS 2023 💥
  Jul'23: A paper "TeD-SPAD" accepted at ICCV 2023
  May'23: Started summer internship at Adobe, San Jose, CA
  Mar'23: A First author paper "TimeBalance" accepted to CVPR 2023 💥
  Jan'23: A paper "TransVisDrone" accepted at ICRA 2023
2022
  May'22: Started summer internship at Adobe, USA (remote- Florida)
  Mar'22: A First author paper "TCLR" accepted to CVIU 2022 💥
  Mar'22: A First author paper "SPAct" accepted to CVPR 2022 💥
2021 & Earlier
  Jan'21: Our Gabriella paper has been awarded the best scientific paper award at ICPR 2020
Work Experience
Applied Scientist
Adobe Inc., Seattle, Washington, USA. Jan 2025 - Present

  • Working on image and video diffusion models across open research and in-house for Adobe Firefly.
  • Initiated and led the Generative Image Upscaler project end-to-end: from proposing the idea and developing the research prototype to training the in-house production model and driving performance improvements beyond competitive systems. Generative Upscaler shipped in general-release of Photoshop as the default upsampling solution and a core component of Adobe's flagship Firefly Custom Models ecosystem.
  • Creative Video Restoration: Developed a state-of-the-art video restoration model for AI-generated and real videos with severe structural and temporal artifacts.

PhD AI/ML Intern
Apple Inc., Cupertino, California, USA. May 2024- Aug 2024

  • Enhanced stable diffusion models for image editing by leveraging vision-language multimodal foundation models.
  • Trained diffusion models on a large-scale, high-resolution dataset of 10M samples.
  • Reproduced and outperformed state-of-the-art image editing methods using a novel approach, achieving superior results.
Research Scientist/ Engineer Intern
Adobe Inc., San Jose, California, USA. May 2023- Nov 2023
Host: Simon Jenni, Fabian Caba

  • Enhanced the fine-grained capabilities of existing video retrieval methods.
  • Worked on large-scale video galleries with millions of samples.
  • Filed a patent and had a paper accepted at ECCV 2024.
Research Scientist Intern
Adobe Inc., Remote, USA. May 2022 - Nov 2022
Host: Simon Jenni

  • Developed a novel self-supervised video representation framework by reformulating temporal self-supervision as frame-level recognition tasks and introducing an effective augmentation strategy to mitigate shortcuts.
  • Achieved state-of-the-art performance on 10 video understanding benchmarks across linear classification (Kinetics400, HVU, SSv2, Charades), video retrieval (UCF101, HMDB51), and temporal correspondence (CASIA-B).
  • Published a paper at AAAI 2024.
Publications

I have a broad interest in computer vision and machine learning. My primary research focuses on video understanding: self/semi supervised learning and action recognition. My recent research also includes enhancing the fine-grained video understanding of the large foundational models and improving multi-modal generative AI for image editing applications. I have also worked on various robotics-related vision tasks like event-camera-based action recognition and drone-to-drone detections from videos.
Below is a selected list of my works (in chronological order), representative papers are highlighted.

CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos
Ishan Rajendrakumar Dave*, Tejas Panambur*, Chongjian Ge, Ersin Yumer, Xue Bai
ArXiv Preprint, 2025
*= equal contribution

Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity.

We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures.

To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (~13 FPS @ 720p on a single 80 GB A100).

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah
Conference on Neural Information Processing Systems (NeurIPS) , 2025

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving.

GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space
David Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, Mubarak Shah
International Conference on Computer Vision (ICCV) , 2025
Oral presentation! (Top 0.6% papers)

Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. To address the interdependence between time and location, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space.

ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition
Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah
International Conference on Learning Representations (ICLR) , 2025
Poster Presentation

We propose ALBAR, an adversarial learning approach to mitigate biases in action recognition. Our method addresses the challenge of spurious correlations between visual features and action labels that can lead to biased model predictions.

Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
Ishan Rajendrakumar Dave, Fabian Caba, Mubarak Shah, Simon Jenni.
The 18th European Conference on Computer Vision (ECCV) , 2024
Oral presentation! (Top 3% of accepted papers)

Temporal video alignment synchronizes key events like object interactions or action phase transitions in two videos, benefiting video editing, processing, and understanding tasks. Existing methods assume a given video pair, limiting applicability. We redefine this as a search problem, introducing Alignable Video Retrieval (AVR), which identifies and synchronizes well-alignable videos from a large collection. Key contributions include DRAQ, a video alignability indicator, and a generalizable frame-level video feature design.

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Mubarak Shah.
The 18th European Conference on Computer Vision (ECCV) , 2024

We introduce Alignability-Verification-based Metric learning for semi-supervised fine-grained action recognition. Using dynamic time warping (DTW) for action-phase-aware comparison, our learnable alignability score refines pseudo-labels of the video encoder. Our framework, FinePseudo, outperforms prior methods on fine-grained action recognition datasets. Additionally, it demonstrates robustness in handling novel unlabeled classes in open-world setups.

CodaMal: Contrastive Domain Adaptation for Malaria Detection in Low-Cost Microscopes
Ishan Rajendrakumar Dave, Tristan de Blegiers, Chen Chen, Mubarak Shah.
31st IEEE International Conference on Image Processing (ICIP) , 2024
Oral presentation!

We propose a Domain Adaptive Contrastive objective to bridge the gap between High and Low Cost Microscopes. On the publicly available large-scale M5 dataset, our proposed method shows a significant improvement of 16% over the state-of-the-art methods in terms of the mean average precision metric (mAP), provides a 21× speed-up during inference, and requires only half as many learnable parameters as the prior methods.

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah.
AAAI Conference on Artificial Intelligence, Main Technical Track (AAAI) , 2024

We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations.

TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for Video Anomaly Detection
Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah.
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

We propose TeD-SPAD, a privacy-aware video anomaly detection framework that destroys visual private information in a self-supervised manner. In particular, we propose the use of a temporally-distinct triplet loss to promote temporally discriminative features, which complements current weakly-supervised VAD methods.

EventTransAct: A Video Transformer-based Framework for Event-camera Based Action Recognition
Tristan de Blegiers*, Ishan Rajendrakumar Dave*, Adeel Yousaf, Mubarak Shah.
*= equal contribution
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

We propose a video transformer-based framework for event-camera based action recognition, which leverages event-contrastive loss and augmentations to adapt the network to event data. Our method achieved state-of-the-art results on N-EPIC Kitchens dataset and competitive results on the standard DVS Gesture recognition dataset, while requiring less computation time compared to competitive prior approaches.

TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Chen Chen, Mubarak Shah.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

We propose a student-teacher semi-supervised learning framework, where we distill knowledge from a temporally-invariant and temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. State-of-the-art results on Kinetics400, UCF101, HMDB51.

Transvisdrone: Spatio-temporal Transformer for Vision-based Drone-to-drone Detection in Aerial Videos
Tushar Sangam, Ishan Rajendrakumar Dave, Waqas Sultani, Mubarak Shah.
2023 IEEE International Conference on Robotics and Automation (ICRA), 2023

We propose a simple yet effective framework, TransVisDrone, that provides an end-to-end solution with higher computational efficiency. We utilize CSPDarkNet-53 network to learn object-related spatial features and VideoSwin model to improve drone detection in challenging scenarios by learning spatio-temporal dependencies of drone motion.

SPAct: Self-supervised Privacy Preservation for Action Recognition
Ishan Rajendrakumar Dave, Chen Chen, Mubarak Shah.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

For the first time, we present a novel training framework that removes privacy information from input video in a self-supervised manner without requiring privacy labels. We train our framework using a minimax optimization strategy to minimize the action recognition cost function and maximize the privacy cost function through a contrastive self-supervised loss.

TCLR: Temporal Contrastive Learning for Video Representation
Ishan Dave, Rohit Gupta, Mamshad Nayeem Rizve, Mubarak Shah.
Computer Vision and Image Understanding (CVIU), 2022
(250+ citations, Among the top-10 most downloaded papers in CVIU)

We propose a new temporal contrastive learning framework for self-supervised video representation learning, consisting of two novel losses that aim to increase the temporal diversity of learned features. The framework achieves state-of-the-art results on various downstream video understanding tasks, including significant improvement in fine-grained action classification for visually similar classes.

Gabriellav2: Towards Better Generalization in Surveillance Videos for Action Detection
Ishan Dave, Zacchaeus Scheffer, Akash Kumar, Sarah Shiraz, Yogesh Singh Rawat, Mubarak Shah.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022

We propose a realtime, online, action detection system which can generalize robustly on any unknown facility surveillance videos. We tackle the challenging nature of action classification problem in various aspects like handling the class-imbalance training using PLM method and learning multi-label action correlations using LSEP loss. In order to improve the computational efficiency of the system, we utilize knowledge distillation.

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos
Mamshad Nayeem Rizve, Ugur Demir, Praveen Tirupattur, Aayush Jung Rana, Kevin Duarte, Ishan R Dave, Yogesh S Rawat, Mubarak Shah.
25th International Conference on Pattern Recognition (ICPR), 2021 (Best Paper Award)

Gabriella consists of three stages: tubelet extraction, activity classification, and online tubelet merging. Gabriella utilizes a localization network for tubelet extraction, with a novel Patch-Dice loss to handle variations in actor size, and a Tubelet-Merge Action-Split (TMAS) algorithm to detect activities efficiently and robustly.

Patents
Identifying and aligning video clips from large-scale video datasets System
Simon Jenni, Ishan Rajendrakumar Dave, Fabian Caba
US Patent US20250342699A1. (Status: Filed) , 2025
Self-Supervised Privacy Preservation Action Recognition System
Ishan Rajendrakumar Dave, Chen Chen, Mubarak Shah
US Patent US12142053B2. (Status: Granted) , 2024
Recognition
Trophy Image

2nd place, 2022 - ActivityNet ActEV Challenge (CVPR)



1st place, 2021 - PMiss@0.02tfa, ActivityNet ActEV SDL (CVPR)


1st place, 2020 - PMiss and nAUDC, ActivityNet ActEV SDL (CVPR)

2nd place, 2020 - TRECVID ActEV: Activities in Extended Video

ORCGS Doctoral Fellowship, 2019-2020

Professional Reviewing experience
Reviewer, CVPR 2025, 2024, 2023, 2022
Reviewer, ECCV 2024
Program Committee, AAAI 2025
Reviewer, ICCV 2023
Technical Committee, BMVC Workshop 2024
Reviewer, WACV 2025, 2024
Reviewer, IEEE Robotics and Automation Letters
Reviewer, ICRA 2024
Reviewer, IROS 2024
Reviewer, IEEE Transaction on Image Processing
Reviewer, IEEE Transaction on Pattern Analysis and Machine Intelligence
Reviewer, IEEE Transactions on Multimedia
Reviewer, IEEE Transactions on Circuits and Systems for Video Technology
Reviewer, IEEE Transactions on Neural Networks and Learning Systems
Reviewer, Computer Vision and Image Understanding
Reviewer, Pattern Recognition
Reviewer, Expert Systems with Applications
Reviewer, Image and Vision Computing
Reviewer, Journal of Real-Time Image Processing
Reviewer, Multimedia Tools and Applications


Mentor in NSF-REU
NSF Image Kevin Chung, REU 2022
Ethan Thomas, REU 2021
Kali Carter, REU 2020


Feel free to steal this website's source code. Do not scrape the HTML from this page itself, as it includes analytics tags that you do not want on your own website — use the github code instead. Also, consider using Leonid Keselman's Jekyll fork of this page.