HUMAN ACTIVITY RECOGNITION FROM EGOCENTRIC VIDEOS AND ROBUSTNESS ANALYSIS OF DEEP NEURAL NETWORKS
Date of Award
Doctor of Philosophy (PhD)
Electrical Engineering and Computer Science
In recent years, there has been significant amount of research work on human activity classification relying either on Inertial Measurement Unit (IMU) data or data from static cameras providing a third-person view. There has been relatively less work using wearable cameras, providing egocentric view, which is a first-person view providing the view of the environment as seen by the wearer. Using only IMU data limits the variety and complexity of the activities that can be detected. Deep machine learning has achieved great success in image and video processing in recent years. Neural network based models provide improved accuracy in multiple fields in computer vision. However, there has been relatively less work focusing on designing specific models to improve the performance of egocentric image/video tasks. As deep neural networks keep improving the accuracy in computer vision tasks, the robustness and resilience of the networks should be improved as well to make it possible to be applied in safety-crucial areas such as autonomous driving.
Motivated by these considerations, in the first part of the thesis, the problem of human activity detection and classification from egocentric cameras is addressed. First, anew method is presented to count the number of footsteps and compute the total traveled distance by using the data from the IMU sensors and camera of a smart phone. By incorporating data from multiple sensor modalities, and calculating the length of each step, instead of using preset stride lengths and assuming equal-length steps, the proposed method provides much higher accuracy compared to commercially available step counting apps. After the application of footstep counting, more complicated human activities, such as steps of preparing a recipe and sitting on a sofa, are taken into consideration. Multiple classification methods, non-deep learning and deep-learning-based, are presented, which employ both ego-centric camera and IMU data. Then, a Genetic Algorithm-based approach is employed to set the parameters of an activity classification network autonomously and performance is compared with empirically-set parameters.
Then, a new framework is introduced to reduce the computational cost of human temporal activity recognition from egocentric videos while maintaining the accuracy at a comparable level. The actor-critic model of reinforcement learning is applied to optical flow data to locate a bounding box around region of interest, which is then used for clipping a sub-image from a video frame. A shallow and deeper 3D convolutional neural network is designed to process the original image and the clipped image region, respectively.Next, a systematic method is introduced that autonomously and simultaneously optimizes multiple parameters of any deep neural network by using a bi-generative adversarial network (Bi-GAN) guiding a genetic algorithm(GA). The proposed Bi-GAN allows the autonomous exploitation and choice of the number of neurons for the fully-connected layers, and number of filters for the convolutional layers, from a large range of values. The Bi-GAN involves two generators, and two different models compete and improve each other progressively with a GAN-based strategy to optimize the networks during a GA evolution.In this analysis, three different neural network layers and datasets are taken into consideration:
First, 3D convolutional layers for ModelNet40 dataset. We applied the proposed approach on a 3D convolutional network by using the ModelNet40 dataset. ModelNet is a dataset of 3D point clouds. The goal is to perform shape classification over 40shape classes.
LSTM layers for UCI HAR dataset. UCI HAR dataset is composed of InertialMeasurement Unit (IMU) data captured during activities of standing, sitting, laying, walking, walking upstairs and walking downstairs. These activities were performed by 30 subjects, and the 3-axial linear acceleration and 3-axial angular velocity were collected at a constant rate of 50Hz.
2D convolutional layers for Chars74k Dataset. Chars74k dataset contains 64 classes(0-9, A-Z, a-z), 7705 characters obtained from natural images, 3410 hand-drawn characters using a tablet PC and 62992 synthesised characters from computer fonts giving a total of over 74K images.
In the final part of the thesis, network robustness and resilience for neural network models is investigated from adversarial examples (AEs) and automatic driving conditions. The transferability of adversarial examples across a wide range of real-world computer vision tasks, including image classification, explicit content detection, optical character recognition(OCR), and object detection are investigated. It represents the cybercriminal’s situation where an ensemble of different detection mechanisms need to be evaded all at once.Novel dispersion Reduction(DR) attack is designed, which is a practical attack that overcomes existing attacks’ limitation of requiring task-specific loss functions by targeting on the “dispersion” of internal feature map. In the autonomous driving scenario, the adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving is studied. A novel attack technique, tracker hijacking, that can effectively fool Multi-Object Tracking (MOT) using AEs on object detection is presented. Using this technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards.
Lu, Yantao, "HUMAN ACTIVITY RECOGNITION FROM EGOCENTRIC VIDEOS AND ROBUSTNESS ANALYSIS OF DEEP NEURAL NETWORKS" (2020). Dissertations - ALL. 1161.