Robust Vision and Language Inference via Semantics Transformed Adversarial Training

Chaudhary, Abhishek

This work considers the task of vision-and-language inference (VLI): predicting whether an inputthe sentence is true for given images or videos and starts with an investigation of model robustness to a set of 13 linguistic transformations, categorized as Semantics-Preserving or Semantics-Inverting based…

This work considers the task of vision-and-language inference (VLI): predicting whether an inputthe sentence is true for given images or videos and starts with an investigation of model robustness to a set of 13 linguistic transformations, categorized as Semantics-Preserving or Semantics-Inverting based on whether they change the meaning of the sentence. It is observed that existing VLI models degenerate to close-to-random performance when tested on these linguistic transformations which include simple phenomena such as synonyms, antonyms, negation, swap-ping of subject and object, paraphrasing, and the substitutions of pronouns, comparatives, and numbers. This observation is utilized to design STAT(Semantics-Transformed Adversarial Training) { a model-agnostic and task-agnostic min-max optimization algorithm, with an inner maximization that utilizes semantic perturbations of in-put sentences to nd adversarial samples and an outer maximization that updates model parameters. Extensive experiments on three benchmark datasets (NLVR2, VIOLIN, VQA \Yes-No") not only demonstrate large gains in robustness to adversarial input sentences but also show model-agnostic performance improvements. This works also presents the suite of linguistic transformations as a robustness benchmark that may benet future research in vision and language robustness.

Copyright Statement