2026-05-20T05:00:53Zhttps://keep.lib.asu.edu/oai/request

oai:keep.lib.asu.edu:node-2009412025-04-29T18:01:09Zoai_pmh:alloai_pmh:repo_items

200941 https://hdl.handle.net/2286/R.2.N.200941 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2025 143 pages Doctoral Dissertation Academic theses en Cheng, Sheng Yang, Yezhou Ren, Yi Lee, Kookjin Xiao, Chaowei Gokhale, Tejas Arizona State University Partial requirement for: Ph.D., Arizona State University, 2025 Field of study: Computer Science Computer vision research has grown rapidly from deterministic tasks such as recognition and classification to generative tasks involving image and multimodal content creation. Despite these advancements, generalization remains a core challenge, as training data often fails to capture the full complexity of real-world scenarios. This dissertation addresses this issue along two main lines: extracting universal representations from limited data and expanding the training distribution by augmenting training data.The first line of work draws inspiration from Gestalt principles, underscoring the importance of intermediate structures in human perception, such as shape, color, and physical states. Concretely, this dissertation will explore (1) the explicit representation for sketches by decomposing each sketch into stroke-based elements to capture essential structural details, (2) the explicit representation of physical states in video-based environments, showing demonstrating how objects in the video obey common laws of motion; and (3) the implicit representation of physical states by employing Markov chain Monte Carlo sampling and energy-based models. The second line focuses on expanding the training distribution for better generalization. This dissertation will introduce (1) the adversarial Bayesian augmentation framework for image classification, which enhances model generalization to style transfer or subpopulation shifts; (2) the study on synthetic caption generation that uses multimodal language models to enhance text-to-image alignment tasks. Together, these efforts collectively strengthen the generalization ability of vision systems in various applications. Computer Science Enhancing Generalization in Computer Vision: Universal Representations and Data Augmentation