2026-05-22T06:48:23Zhttps://keep.lib.asu.edu/oai/request

oai:keep.lib.asu.edu:node-2011982025-05-05T15:53:02Zoai_pmh:alloai_pmh:repo_items

201198 https://hdl.handle.net/2286/R.2.N.201198 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2025 82 pages Masters Thesis Academic theses en Zinjad, Saurabh Bhausaheb Liu, Huan Davulcu, Hasan Gupta, Vivek Arizona State University Partial requirement for: M.S., Arizona State University, 2025 Field of study: Computer Science Despite their remarkable capabilities, Large Language Models (LLMs) exhibit concerning vulnerabilities to minor input perturbations, posing risks in safety-critical applications. This thesis systematically analyzes how semantic-preserving prompt perturbations—at the character, word, and sentence levels—affect LLM safety behavior. It investigates whether even slight rephrasings can cause an otherwise aligned model to flip from a safe (refusal or benign) response to an unsafe one, or vice versa. The study also identifies key perturbation characteristics that drive such flips and examines how flip likelihood varies across different categories of harmful content.To answer these questions, the study evaluates multiple advanced LLMs (LLaMA2, LLaMA3, Mistral) on the CatQA dataset of harmful queries. Each query is perturbed by applying controlled character-level typos, word-level substitutions, and sentence-level paraphrases while preserving semantic meaning. Model outputs are assessed using the automated safety evaluation tool LlamaGuard. Empirical results show that sentence-level paraphrasing consistently improves safety compliance, whereas fine-grained character-level noise often degrades it due to tokenization weaknesses. Word-level changes yield the most inconsistent behavior, with random insertions particularly likely to elicit unsafe outputs. Among the models tested, LLaMA3 was the most vulnerable, exhibiting the highest rate of unsafe responses under perturbation. Overall, these findings underscore the importance of perturbation-aware and context-aware safety evaluation and offer practical insights for improving LLM alignment in real-world deployments. Computer Science Artificial Intelligence Information science Input Sensitivity Prompt Perturbations Robustness Evaluation Safety Alignment Safety Flip Rates Can Typos Cause Harm? The Impact of Imperfect Input on LLM Safety