Description
This work considers the task of vision-and-language inference (VLI): predicting whether an inputthe sentence is true for given images or videos and starts with an investigation of model robustness to
a set of 13 linguistic transformations, categorized as Semantics-Preserving or Semantics-Inverting
based on whether they change the meaning of the sentence. It is observed that existing VLI models
degenerate to close-to-random performance when tested on these linguistic transformations which
include simple phenomena such as synonyms, antonyms, negation, swap-ping of subject and object,
paraphrasing, and the substitutions of pronouns, comparatives, and numbers.
This observation is utilized to design STAT(Semantics-Transformed Adversarial Training) { a
model-agnostic and task-agnostic min-max optimization algorithm, with an inner maximization
that utilizes semantic perturbations of in-put sentences to nd adversarial samples and an outer
maximization that updates model parameters. Extensive experiments on three benchmark datasets
(NLVR2, VIOLIN, VQA \Yes-No") not only demonstrate large gains in robustness to adversarial
input sentences but also show model-agnostic performance improvements. This works also presents
the suite of linguistic transformations as a robustness benchmark that may benet future research
in vision and language robustness.
Details
Title
- Robust Vision and Language Inference via Semantics Transformed Adversarial Training
Contributors
- Chaudhary, Abhishek (Author)
- Yang, Yezhou Dr. (Thesis advisor)
- Li, Baoxin Dr. (Committee member)
- Baral, Chitta Dr. (Committee member)
- Arizona State University (Publisher)
Date Created
The date the item was original created (prior to any relationship with the ASU Digital Repositories.)
2021
Subjects
Resource Type
Collections this item is in
Note
- Partial requirement for: M.S., Arizona State University, 2021
- Field of study: Computer Science