Search Content

The accuracy of accuracy estimates for single form dichotomous classification exams

Description

The use of exams for classification purposes has become prevalent across many fields including professional assessment for employment screening and standards based testing in educational settings. Classification exams assign individuals to performance groups based on the comparison of their observed test scores to a pre-selected criterion (e.g. masters vs. nonmasters…

The use of exams for classification purposes has become prevalent across many fields including professional assessment for employment screening and standards based testing in educational settings. Classification exams assign individuals to performance groups based on the comparison of their observed test scores to a pre-selected criterion (e.g. masters vs. nonmasters in dichotomous classification scenarios). The successful use of exams for classification purposes assumes at least minimal levels of accuracy of these classifications. Classification accuracy is an index that reflects the rate of correct classification of individuals into the same category which contains their true ability score. Traditional methods estimate classification accuracy via methods which assume that true scores follow a four-parameter beta-binomial distribution. Recent research suggests that Item Response Theory may be a preferable alternative framework for estimating examinees' true scores and may return more accurate classifications based on these scores. Researchers hypothesized that test length, the location of the cut score, the distribution of items, and the distribution of examinee ability would impact the recovery of accurate estimates of classification accuracy. The current simulation study manipulated these factors to assess their potential influence on classification accuracy. Observed classification as masters vs. nonmasters, true classification accuracy, estimated classification accuracy, BIAS, and RMSE were analyzed. In addition, Analysis of Variance tests were conducted to determine whether an interrelationship existed between levels of the four manipulated factors. Results showed small values of estimated classification accuracy and increased BIAS in accuracy estimates with few items, mismatched distributions of item difficulty and examinee ability, and extreme cut scores. A significant four-way interaction between manipulated variables was observed. In additional to interpretations of these findings and explanation of potential causes for the recovered values, recommendations that inform practice and avenues of future research are provided.

ContributorsKunze, Katie (Author) / Gorin, Joanna (Thesis advisor) / Levy, Roy (Thesis advisor) / Green, Samuel (Committee member) / Arizona State University (Publisher)

Created2013

Sample size and test length minima for DIMTEST with conditional covariance-based subtest selection

Description

The existing minima for sample size and test length recommendations for DIMTEST (750 examinees and 25 items) are tied to features of the procedure that are no longer in use. The current version of DIMTEST uses a bootstrapping procedure to remove bias from the test statistic and is packaged with…

The existing minima for sample size and test length recommendations for DIMTEST (750 examinees and 25 items) are tied to features of the procedure that are no longer in use. The current version of DIMTEST uses a bootstrapping procedure to remove bias from the test statistic and is packaged with a conditional covariance-based procedure called ATFIND for partitioning test items. Key factors such as sample size, test length, test structure, the correlation between dimensions, and strength of dependence were manipulated in a Monte Carlo study to assess the effectiveness of the current version of DIMTEST with fewer examinees and items. In addition, the DETECT program was also used to partition test items; a second feature of this study also compared the structure of test partitions obtained with ATFIND and DETECT in a number of ways. With some exceptions, the performance of DIMTEST was quite conservative in unidimensional conditions. The performance of DIMTEST in multidimensional conditions depended on each of the manipulated factors, and did suggest that the minima of sample size and test length can be made lower for some conditions. In terms of partitioning test items in unidimensional conditions, DETECT tended to produce longer assessment subtests than ATFIND in turn yielding different test partitions. In multidimensional conditions, test partitions became more similar and were more accurate with increased sample size, for factorially simple data, greater strength of dependence, and a decreased correlation between dimensions. Recommendations for sample size and test length minima are provided along with suggestions for future research.

ContributorsFay, Derek (Author) / Levy, Roy (Thesis advisor) / Green, Samuel (Committee member) / Gorin, Joanna (Committee member) / Arizona State University (Publisher)

Created2012

Assessment of item parameter drift of known items in a university placement exam

Description

ABSTRACT This study investigated the possibility of item parameter drift (IPD) in a calculus placement examination administered to approximately 3,000 students at a large university in the United States. A single form of the exam was administered continuously for a period of two years, possibly allowing later examinees to have…

ABSTRACT This study investigated the possibility of item parameter drift (IPD) in a calculus placement examination administered to approximately 3,000 students at a large university in the United States. A single form of the exam was administered continuously for a period of two years, possibly allowing later examinees to have prior knowledge of specific items on the exam. An analysis of IPD was conducted to explore evidence of possible item exposure. Two assumptions concerning items exposure were made: 1) item recall and item exposure are positively correlated, and 2) item exposure results in the items becoming easier over time. Special consideration was given to two contextual item characteristics: 1) item location within the test, specifically items at the beginning and end of the exam, and 2) the use of an associated diagram. The hypotheses stated that these item characteristics would make the items easier to recall and, therefore, more likely to be exposed, resulting in item drift. BILOG-MG 3 was used to calibrate the items and assess for IPD. No evidence was found to support the hypotheses that the items located at the beginning of the test or with an associated diagram drifted as a result of item exposure. Three items among the last ten on the exam drifted significantly and became easier, consistent with item exposure. However, in this study, the possible effects of item exposure could not be separated from the effects of other potential factors such as speededness, curriculum changes, better test preparation on the part of subsequent examinees, or guessing.

ContributorsKrause, Janet (Author) / Levy, Roy (Thesis advisor) / Thompson, Marilyn (Thesis advisor) / Gorin, Joanna (Committee member) / Arizona State University (Publisher)

Created2012

Listening to Leaders: Expanding Cultural Intelligence Learning Opportunities for U.S. Army Special Forces at the Captain’s Career Course

Description

This study investigated the impact of learning about cultural intelligence (CQ) from senior U.S. Army Special Forces leaders (Group Commanders and Group Command Sergeants Major) on aspiring Special Forces Captains (students) at the Captains Career Course. Three research questions addressed the influence of senior leader interventions on students’ CQ scores,…

This study investigated the impact of learning about cultural intelligence (CQ) from senior U.S. Army Special Forces leaders (Group Commanders and Group Command Sergeants Major) on aspiring Special Forces Captains (students) at the Captains Career Course. Three research questions addressed the influence of senior leader interventions on students’ CQ scores, motivation to work with partner forces, and intentions to improve CQ. The study involved quantitative and qualitative data for each of the three comparison groups: control, face-to-face (in-person interaction with senior leaders), and podcast (audio-only recordings). The quantitative data measured CQ capabilities of motivation, cognition, metacognition, and behavior. Descriptive statistics revealed that from the pre-test to the post-test, the control and podcast groups experienced increased self-assessment scores on all four constructs but decreased observer assessment scores. By contrast, the face-to-face group experienced both a decrease in observer assessment scores as well as a marginal decrease in self-assessment scores (on motivation and metacognition). Exploring motivation to work with partner forces, analysis of the group interview transcripts revealed that the control group attributed their motivation primarily to their prior experiences, while participants in the face-to-face group reported mixed feelings regarding prior experiences but highlighted the impact of senior Special Forces leaders' stories on their motivation. The podcast group credited their course experience and the senior leaders' narratives for their increased motivation. Examining the influence of senior leader stories on intent to improve CQ, the control group provided generic responses focused on improving cognition. The face-to-face group offered more specific, action-oriented answers emphasizing business systems, sociolinguistics, and cultural values. The podcast group produced varying responses, with some sharing basic intent and others detailing specific strategies such as language fluency and cultural immersion. Participants across all three groups expressed a strong intention to seek out mentorship and stories from experienced individuals. In conclusion, this study highlights the myriad influences on aspiring Special Forces Captains' CQ and the multifaceted impact of senior Special Forces leaders' stories. The narratives contributed to increased motivation, deeper understanding of the Special Forces mission, and specific strategies for improving CQ, providing valuable insights for military education and training programs.

ContributorsKohistany, Mahboba Lyla (Author) / Dorn, Sherman (Thesis advisor) / Livermore, David (Committee member) / Amrein-Beardsley, Audrey (Committee member) / Arizona State University (Publisher)

Created2023

Developing a Technology-Enhanced Solution to Language Inequality in English-Based Mathematics Tests

Description

In this mixed-methods study, I sought to design and develop a test delivery method to reduce linguistic bias in English-based mathematics tests. Guided by translanguaging, a recent linguistic theory recognizing the complexity of multilingualism, I designed a computer-based test delivery method allowing test-takers to toggle between English and their self-identified…

In this mixed-methods study, I sought to design and develop a test delivery method to reduce linguistic bias in English-based mathematics tests. Guided by translanguaging, a recent linguistic theory recognizing the complexity of multilingualism, I designed a computer-based test delivery method allowing test-takers to toggle between English and their self-identified dominant language. This three-part study asks and answers research questions from all phases of the novel test delivery design. In the first phase, I conducted cognitive interviews with 11 Mandarin Chinese dominant speakers and 11 Spanish speaking dominant undergraduate students while taking a well-regarded calculus conceptual exam, the Precalculus Concept Assessment (PCA). In the second phase, I designed and developed the linguistically adaptive test (LAT) version of the PCA using the Concerto test delivery platform. In the third phase, I conducted a within-subjects random-assignment study of the efficacy the LAT. I also conducted in-depth interviews with a subset of the test-takers. Nine items on the PCA revealed linguistic issues during the cognitive interviews demonstrating the need to improve the linguistic bias on the test items. Additionally, the newly developed LAT demonstrated evidence of reliability and validity. However, the large-scale efficacy study showed that the LAT did not appear to make a significant difference in scores for dominant speakers of Spanish or dominant speakers of Mandarin Chinese. This finding held true for overall test scores as well as at the item level indicating that the LAT test delivery system does not appear to reduce linguistic bias in testing. Additionally, in-depth interviews revealed that many students felt that the linguistically adaptive test was either the same or essentially the same as the non-LAT version of the test. Some participants felt that the toggle button was not necessary if they could understand the mathematics item well enough. As one participant noted, “It's math, It's math. It doesn't matter if it's in English or in Spanish.” This dissertation concludes with a discussion about the implications for test developers and suggestions for future direction of study.

ContributorsClose, Kevin (Author) / Zheng, Yi (Thesis advisor) / Amrein-Beardsley, Audrey (Thesis advisor) / Anderson, Kate (Committee member) / Arizona State University (Publisher)

Created2021

Exploring the Relationships Between PreK-6 Music Teachers’ Engagement in Professional Development and Their Self-Efficacy

Description

This study applied the Social Cognitive Theory (SCT) to explore the sources of self-efficacy and professional development activities that are most predictive of PreK-6 music teachers’ efficacious beliefs. This study also compared teacher efficacy levels across different groups. The target population for this study was PreK-6 music teachers in the…

This study applied the Social Cognitive Theory (SCT) to explore the sources of self-efficacy and professional development activities that are most predictive of PreK-6 music teachers’ efficacious beliefs. This study also compared teacher efficacy levels across different groups. The target population for this study was PreK-6 music teachers in the state of Arizona. The survey was disseminated through the National Association for Music Education (NAfME), the Arizona chapters of the American Orff-Schulwerk Association (AOSA), the Organization of American Kodály Educators (OAKE), and snowball sampling via a Facebook message. Of the 660 teachers invited to participate, 92 (13.94%) voluntarily completed the survey. Results from simultaneous multiple regression analyses indicated that teacher efficacy for instructional strategies was best predicted by their mastery experience, followed by vicarious experience, while mastery experience was the strongest predictor of teacher efficacy for student engagement. Additionally, the acquisition of method certification and watching teaching resources via YouTube were significant predictors of teacher efficacy for instructional strategies, while observation hours per year was the only predictor of teacher efficacy for student engagement. Results from factorial between-subjects ANOVAs indicated that teaching experience had a significant main effect on teacher efficacy for instructional strategies and student engagement. However, neither main teaching areas nor the combined effects of main teaching areas and teaching experience had a significant effect on teacher efficacy for instructional strategies and student engagement. Results from independent-samples t-test analyses showed that school types had a significant effect on teacher efficacy for student engagement, while no differences were found between school types regarding teacher efficacy for instructional strategies. The analysis of open-ended comments identified themes related to factors that strengthen or weaken participant teacher efficacy, the impact of the COVID-19 pandemic on teacher efficacy, the types of professional development activities that they engaged during the year, the most effective professional development activities for enhancing teacher efficacy. Findings of this study have theoretical and practical implications for school principals, school administrators, policy makers, music teacher educators, and music teachers to promote and support music teachers’ self-efficacy.

ContributorsCha, Dong-Ju (Author) / Amrein-Beardsley, Audrey (Thesis advisor) / Stauffer, Sandra (Thesis advisor) / Fiorentino, Matthew (Committee member) / Schmidt, Margaret (Committee member) / Arizona State University (Publisher)

Created2023

The Land of Disenchantment: Bias in New Mexico Teacher Evaluation Measures

Description

Over the past 20 years in the United States (U.S.), teachers have seen a marked

shift in how teacher evaluation policies govern the evaluation of their performance.

Spurred by federal mandates, teachers have been increasingly held accountable for their

students’ academic achievement, most notably through the use of value-added models…

Over the past 20 years in the United States (U.S.), teachers have seen a marked

shift in how teacher evaluation policies govern the evaluation of their performance.

Spurred by federal mandates, teachers have been increasingly held accountable for their

students’ academic achievement, most notably through the use of value-added models

(VAMs)—a statistically complex tool that aims to isolate and then quantify the effect of

teachers on their students’ achievement. This increased focus on accountability ultimately

resulted in numerous lawsuits across the U.S. where teachers protested what they felt

were unfair evaluations informed by invalid, unreliable, and biased measures—most

notably VAMs.

While New Mexico’s teacher evaluation system was labeled as a “gold standard”

due to its purported ability to objectively and accurately differentiate between effective

and ineffective teachers, in 2015, teachers filed suit contesting the fairness and accuracy

of their evaluations. Amrein-Beardsley and Geiger’s (revise and resubmit) initial analyses

of the state’s teacher evaluation data revealed that the four individual measures

comprising teachers’ overall evaluation scores showed evidence of bias, and specifically,

teachers who taught in schools with different student body compositions (e.g., special

education students, poorer students, gifted students) had significantly different scores

than their peers. The purpose of this study was to expand upon these prior analyses by

investigating whether those conclusions still held true when controlling for a variety of

confounding factors at the school, class, and teacher levels, as such covariates were not

included in prior analyses.

Results from multiple linear regression analyses indicated that, overall, the

measures used to inform New Mexico teachers’ overall evaluation scores still showed

evidence of bias by school-level student demographic factors, with VAMs potentially

being the most susceptible and classroom observations being the least. This study is

especially unique given the juxtaposition of such a highly touted evaluation system also

being one where teachers contested its constitutionality. Study findings are important for

all education stakeholders to consider, especially as teacher evaluation systems and

related policies continue to be transformed.

ContributorsGeiger, Tray (Author) / Amrein-Beardsley, Audrey (Thesis advisor) / Anderson, Kate (Committee member) / McGuire, Keon (Committee member) / Holloway, Jessica (Committee member) / Arizona State University (Publisher)

Created2020

Examining the validity of a state policy-directed framework for evaluating teacher instructional quality: informing policy, impacting practice

Description

ABSTRACT

This study examines validity evidence of a state policy-directed teacher evaluation system implemented in Arizona during school year 2012-2013. The purpose was to evaluate the warrant for making high stakes, consequential judgments of teacher competence based on value-added (VAM) estimates of instructional impact and observations of professional practice (PP). …

ABSTRACT

This study examines validity evidence of a state policy-directed teacher evaluation system implemented in Arizona during school year 2012-2013. The purpose was to evaluate the warrant for making high stakes, consequential judgments of teacher competence based on value-added (VAM) estimates of instructional impact and observations of professional practice (PP). The research also explores educator influence (voice) in evaluation design and the role information brokers have in local decision making. Findings are situated in an evidentiary and policy context at both the LEA and state policy levels.

The study employs a single-phase, concurrent, mixed-methods research design triangulating multiple sources of qualitative and quantitative evidence onto a single (unified) validation construct: Teacher Instructional Quality. It focuses on assessing the characteristics of metrics used to construct quantitative ratings of instructional competence and the alignment of stakeholder perspectives to facets implicit in the evaluation framework. Validity examinations include assembly of criterion, content, reliability, consequential and construct articulation evidences. Perceptual perspectives were obtained from teachers, principals, district leadership, and state policy decision makers. Data for this study came from a large suburban public school district in metropolitan Phoenix, Arizona.

Study findings suggest that the evaluation framework is insufficient for supporting high stakes, consequential inferences of teacher instructional quality. This is based, in part on the following: (1) Weak associations between VAM and PP metrics; (2) Unstable VAM measures across time and between tested content areas; (3) Less than adequate scale reliabilities; (4) Lack of coherence between theorized and empirical PP factor structures; (5) Omission/underrepresentation of important instructional attributes/effects; (6) Stakeholder concerns over rater consistency, bias, and the inability of test scores to adequately represent instructional competence; (7) Negative sentiments regarding the system's ability to improve instructional competence and/or student learning; (8) Concerns regarding unintended consequences including increased stress, lower morale, harm to professional identity, and restricted learning opportunities; and (9) The general lack of empowerment and educator exclusion from the decision making process. Study findings also highlight the value of information brokers in policy decision making and the importance of having access to unbiased empirical information during the design and implementation phases of important change initiatives.

ContributorsSloat, Edward F. (Author) / Wetzel, Keith (Thesis advisor) / Amrein-Beardsley, Audrey (Thesis advisor) / Ewbank, Ann (Committee member) / Shough, Lori (Committee member) / Arizona State University (Publisher)

Created2015

Filtering by