Date of Award

8-23-2024

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical Engineering and Computer Science

Advisor(s)

Senem Velipasalar

Keywords

Action Recognition;Autonomous Driving System;CLIP;Contrastive Learning;Video Understanding;Vision Transformer

Abstract

Video understanding is a challenging task that requires models to effectively interpret and generalize from complex visual data. Traditional approaches often rely on deep neural networks to encode and decode video content, but these methods can struggle with generalization, particularly when faced with data imbalance or the need to recognize rare events. In this thesis, we explore a semantics-centered approach to video understanding, leveraging prompt learning to enhance model robustness and adaptability across various video domains. Our research demonstrates that integrating human-like logic and language prompts into video understanding models significantly improves their performance. We show that these prompt-based models not only enhance general video understanding but also excel in specific applications, such as egocentric action recognition and autonomous driving. By incorporating semantic information, these models can better recognize and interpret actions from a first-person perspective, understand complex driving scenarios, and generalize across diverse video datasets. We also explore the data fusion strategy of multi-modal input for egocentric video understanding based on CapsNet architecture, and subsequently develop a better CapsNet, PT-CapsNet, for deeper architectures. Through extensive experimentation, we highlight the advantages of our approach, including its ability to mitigate data imbalance and improve the recognition of uncommon objects and actions. However, we also acknowledge certain limitations, such as the need for evaluation across multiple random seeds and comprehensive dataset splits. Future work will focus on addressing these limitations, refining the models' sensitivity to structural variations, and expanding their applicability to new domains and real-time scenarios. Overall, this thesis advances the field of video understanding by introducing a novel framework that bridges the gap between human semantics and machine learning, paving the way for more nuanced and reliable AI-driven video analysis.

Access

Open Access

Available for download on Sunday, September 27, 2026

Share

COinS