Date of Award

12-20-2024

Date Published

January 2023

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical Engineering and Computer Science

Advisor(s)

Asif Salekin

Second Advisor

Jonathan Preston

Subject Categories

Computer Sciences | Physical Sciences and Mathematics

Abstract

Deep neural networks are extensively applied to real-world tasks. In many cases, the data feeding these tasks are human generated content, which emphasizes the criticality of privacy and data protection. The diverse modalities of this user data provide fertile ground for 3rd parties to monetize a user’s data. This work considers two such modalities: user speech when interacting with smart speaker voice assistants (VAs) and images shared with online service providers. In these cases, the user would like to support some form of machine learning (ML) inference without allowing others. A user interacting with a smart speaker would like the supporting automated speech recognition (ASR) network to understand the command. In the case of images, facial recognition or object detection might be authorized use cases. Other inference, such as speech emotion recognition (SER) for audio or facial expression recognition for images might be considered invasive by the user sharing the data. This dissertation presents work aimed at protecting user privacy in two different modalities: user speech and images. DARE-GP protects speech from speech emotion recognition (SER) when the user is interacting with a smart speaker voice assistant (VA). DARE-GP creates additive noise to mask users’ emotional information while preserving the transcription-relevant portions of their speech. DARE-GP does this by using a constrained genetic programming approach to learn the spectral frequency traits that depict target users’ emotional content, and then generating a universal adversarial audio perturbation that provides this privacy protection. Unlike existing works, DARE-GP provides: a) real-time protection of previously unheard utterances, b) against previously unseen black-box SER classifiers, c) while protecting speech transcription, and d) does so in a realistic, acoustic environment. Further, this evasion is robust against defenses employed by a knowledgeable adversary. The evaluations in this work culminate with acoustic evaluations against two off-the-shelf commercial smart speakers using a small-form-factor (raspberry pi) integrated with a wake-word system to evaluate the efficacy of its real-world, real-time deployment. E-MUSEUM protects user images from unauthorized inference by generating a noisy variant of a user image that allows normal inference by an authorized classifier while simultaneously degrading inference of other classifiers. E-MUSEUM does this in a black-box setting where we do not have any access to the unauthorized classifiers. We validate EMUSEUM’s efficacy on ImageNet, Celeba-HQ and AffectNet datasets for image, identity, and emotion classification tasks, respectively. Results show that the generated images can successfully maintain the accuracy of an authorized model and degrade the average accuracy of the unauthorized black-box models to 11.97%, 6.63%, and 55.51% on ImageNet, Celeba-HQ, and AffectNet datasets, respectively. We further demonstrate cross-task evasion where the authorized and unauthorized classifiers are performing different tasks.

Access

Open Access

Share

COinS