<?xml version="1.0"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-05-17T01:53:18Z</responseDate><request verb="GetRecord" metadataPrefix="oai_dc">https://keep.lib.asu.edu/oai/request</request><GetRecord><record><header><identifier>oai:keep.lib.asu.edu:node-201230</identifier><datestamp>2025-05-05T15:53:02Z</datestamp><setSpec>oai_pmh:all</setSpec><setSpec>oai_pmh:repo_items</setSpec></header><metadata><oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><dc:identifier>201230</dc:identifier>
          <dc:identifier>https://hdl.handle.net/2286/R.2.N.201230</dc:identifier>
                  <dc:rights>http://rightsstatements.org/vocab/InC/1.0/</dc:rights>
          <dc:rights>All Rights Reserved</dc:rights>
                  <dc:date>2025</dc:date>
                  <dc:format>66 pages</dc:format>
                  <dc:type>Masters Thesis</dc:type>
          <dc:type>Academic theses</dc:type>
                  <dc:language>en</dc:language>
                  <dc:contributor>Clark, Ethan</dc:contributor>
          <dc:contributor>Zhang, Yu</dc:contributor>
          <dc:contributor>Srivastava, Siddharth</dc:contributor>
          <dc:contributor>Gopalan, Nakul</dc:contributor>
          <dc:contributor>Arizona State University</dc:contributor>
                  <dc:description>Partial requirement for: M.S., Arizona State University, 2025</dc:description>
          <dc:description>Field of study: Computer Science</dc:description>
          <dc:description>Reinforcement learning (RL) is a powerful framework designed for optimizing sequential decision-making processes. However, this sequential nature implicitly assumes order-dependence of actions, making traditional RL methods inefficient for problems where action ordering does not affect outcomes. This research identifies and formalizes the Commutative Markov Decision Process (MDP), a distinct class of reinforcement learning problems where final states and rewards depend only on which actions are taken, not their sequence. In these problems, traditional RL methods inefficiently treat each possible ordering of the same actions as entirely different trajectories despite leading to identical outcomes.

To efficiently solve Commutative MDPs, this research introduces Commutative RL—a novel approach that exploits the order-invariance property through algorithmic modifications to Q-learning. Three distinct implementations are developed and evaluated: Super Action, Combined Reward, and Hash Map approaches. Each implementation enforces equivalence between permuted action sequences while preserving the theoretical guarantees of traditional RL. By recognizing that different orderings of the same actions yield equivalent outcomes, these algorithms enable faster convergence to optimal policies through elimination of redundant exploration.

The effectiveness of these implementations is evaluated across two domains of increasing complexity. The component selection task provides a controlled environment with mathematically verifiable optimal solutions where Commutative RL demonstrates significant advantages in convergence speed. The second domain introduces Environment Reconfiguration (ER)—a novel meta-optimization problem where &quot;instrumental&quot; actions modify the environment&#039;s structure and dynamics. This city planning application demonstrates how ER problems can be reduced to Commutative MDPs, enabling efficient optimization of urban infrastructure through strategic bridge placement despite stochastic outcomes. In domains exhibiting the necessary order-invariant properties, the experimental results show that Commutative RL implementations converge to optimal policies substantially faster than traditional Q-learning approaches.

This research establishes a new direction in reinforcement learning by identifying and addressing a specific class of problems with unique structural properties. By formalizing Commutative MDPs and developing initial algorithmic approaches that exploit order-invariance, this work lays theoretical foundations for a promising research area. The experimental results demonstrate the potential for significant sample efficiency improvements in domains where these properties naturally occur, such as resource allocation, inventory management, and infrastructure planning.

</dc:description>
                  <dc:subject>Artificial Intelligence</dc:subject>
          <dc:subject>Computer Science</dc:subject>
          <dc:subject>Artificial Intelligence</dc:subject>
          <dc:subject>Deep learning</dc:subject>
          <dc:subject>Markov Decision Processes</dc:subject>
          <dc:subject>Reinforcement Learning</dc:subject>
                  <dc:title>Reinforcement Learning for Commutative Markov Decision Processes</dc:title></oai_dc:dc></metadata></record></GetRecord></OAI-PMH>
