Search Content

A comparison of DIMTEST and generalized dimensionality discrepancy approaches to assessing dimensionality in item response theory

Description

Dimensionality assessment is an important component of evaluating item response data. Existing approaches to evaluating common assumptions of unidimensionality, such as DIMTEST (Nandakumar & Stout, 1993; Stout, 1987; Stout, Froelich, & Gao, 2001), have been shown to work well under large-scale assessment conditions (e.g., large sample sizes and item pools;…

Dimensionality assessment is an important component of evaluating item response data. Existing approaches to evaluating common assumptions of unidimensionality, such as DIMTEST (Nandakumar & Stout, 1993; Stout, 1987; Stout, Froelich, & Gao, 2001), have been shown to work well under large-scale assessment conditions (e.g., large sample sizes and item pools; see e.g., Froelich & Habing, 2007). It remains to be seen how such procedures perform in the context of small-scale assessments characterized by relatively small sample sizes and/or short tests. The fact that some procedures come with minimum allowable values for characteristics of the data, such as the number of items, may even render them unusable for some small-scale assessments. Other measures designed to assess dimensionality do not come with such limitations and, as such, may perform better under conditions that do not lend themselves to evaluation via statistics that rely on asymptotic theory. The current work aimed to evaluate the performance of one such metric, the standardized generalized dimensionality discrepancy measure (SGDDM; Levy & Svetina, 2011; Levy, Xu, Yel, & Svetina, 2012), under both large- and small-scale testing conditions. A Monte Carlo study was conducted to compare the performance of DIMTEST and the SGDDM statistic in terms of evaluating assumptions of unidimensionality in item response data under a variety of conditions, with an emphasis on the examination of these procedures in small-scale assessments. Similar to previous research, increases in either test length or sample size resulted in increased power. The DIMTEST procedure appeared to be a conservative test of the null hypothesis of unidimensionality. The SGDDM statistic exhibited rejection rates near the nominal rate of .05 under unidimensional conditions, though the reliability of these results may have been less than optimal due to high sampling variability resulting from a relatively limited number of replications. Power values were at or near 1.0 for many of the multidimensional conditions. It was only when the sample size was reduced to N = 100 that the two approaches diverged in performance. Results suggested that both procedures may be appropriate for sample sizes as low as N = 250 and tests as short as J = 12 (SGDDM) or J = 19 (DIMTEST). When used as a diagnostic tool, SGDDM may be appropriate with as few as N = 100 cases combined with J = 12 items. The study was somewhat limited in that it did not include any complex factorial designs, nor were the strength of item discrimination parameters or correlation between factors manipulated. It is recommended that further research be conducted with the inclusion of these factors, as well as an increase in the number of replications when using the SGDDM procedure.

ContributorsReichenberg, Ray E (Author) / Levy, Roy (Thesis advisor) / Thompson, Marilyn S. (Thesis advisor) / Green, Samuel B. (Committee member) / Arizona State University (Publisher)

Created2013

Posterior predictive model checking in Bayesian networks

Description

This simulation study compared the utility of various discrepancy measures within a posterior predictive model checking (PPMC) framework for detecting different types of data-model misfit in multidimensional Bayesian network (BN) models. The investigated conditions were motivated by an applied research program utilizing an operational complex performance assessment within a digital-simulation…

This simulation study compared the utility of various discrepancy measures within a posterior predictive model checking (PPMC) framework for detecting different types of data-model misfit in multidimensional Bayesian network (BN) models. The investigated conditions were motivated by an applied research program utilizing an operational complex performance assessment within a digital-simulation educational context grounded in theories of cognition and learning. BN models were manipulated along two factors: latent variable dependency structure and number of latent classes. Distributions of posterior predicted p-values (PPP-values) served as the primary outcome measure and were summarized in graphical presentations, by median values across replications, and by proportions of replications in which the PPP-values were extreme. An effect size measure for PPMC was introduced as a supplemental numerical summary to the PPP-value. Consistent with previous PPMC research, all investigated fit functions tended to perform conservatively, but Standardized Generalized Dimensionality Discrepancy Measure (SGDDM), Yen's Q3, and Hierarchy Consistency Index (HCI) only mildly so. Adequate power to detect at least some types of misfit was demonstrated by SGDDM, Q3, HCI, Item Consistency Index (ICI), and to a lesser extent Deviance, while proportion correct (PC), a chi-square-type item-fit measure, Ranked Probability Score (RPS), and Good's Logarithmic Scale (GLS) were powerless across all investigated factors. Bivariate SGDDM and Q3 were found to provide powerful and detailed feedback for all investigated types of misfit.

ContributorsCrawford, Aaron (Author) / Levy, Roy (Thesis advisor) / Green, Samuel (Committee member) / Thompson, Marilyn (Committee member) / Arizona State University (Publisher)

Created2014

Analytic Selection of a Valid Subtest for DIF Analysis when DIF has Multiple Potential Causes among Multiple Groups

Description

The study examined how ATFIND, Mantel-Haenszel, SIBTEST, and Crossing SIBTEST function when items in the dataset are modelled to differentially advantage a lower ability focal group over a higher ability reference group. The primary purpose of the study was to examine ATFIND's usefulness as a valid subtest selection tool, but…

The study examined how ATFIND, Mantel-Haenszel, SIBTEST, and Crossing SIBTEST function when items in the dataset are modelled to differentially advantage a lower ability focal group over a higher ability reference group. The primary purpose of the study was to examine ATFIND's usefulness as a valid subtest selection tool, but it also explored the influence of DIF items, item difficulty, and presence of multiple examinee populations with different ability distributions on both its selection of the assessment test (AT) and partitioning test (PT) lists and on all three differential item functioning (DIF) analysis procedures. The results of SIBTEST were also combined with those of Crossing SIBTEST, as might be done in practice.

ATFIND was found to be a less-than-effective matching subtest selection tool with DIF items that are modelled unidimensionally. If an item was modelled with uniform DIF or if it had a referent difficulty parameter in the Medium range, it was found to be selected slightly more often for the AT List than the PT List. These trends were seen to increase as sample size increased. All three DIF analyses, and the combined SIBTEST and Crossing SIBTEST, generally were found to perform less well as DIF contaminated the matching subtest, as well as when DIF was modelled less severely or when the focal group ability was skewed. While the combined SIBTEST and Crossing SIBTEST was found to have the highest power among the DIF analyses, it also was found to have Type I error rates that were sometimes extremely high.

ContributorsScott, Lietta Marie (Author) / Levy, Roy (Thesis advisor) / Green, Samuel B (Thesis advisor) / Gorin, Joanna S (Committee member) / Williams, Leila E (Committee member) / Arizona State University (Publisher)

Created2014

Examining the validity of a state policy-directed framework for evaluating teacher instructional quality: informing policy, impacting practice

Description

ABSTRACT

This study examines validity evidence of a state policy-directed teacher evaluation system implemented in Arizona during school year 2012-2013. The purpose was to evaluate the warrant for making high stakes, consequential judgments of teacher competence based on value-added (VAM) estimates of instructional impact and observations of professional practice (PP). …

ABSTRACT

This study examines validity evidence of a state policy-directed teacher evaluation system implemented in Arizona during school year 2012-2013. The purpose was to evaluate the warrant for making high stakes, consequential judgments of teacher competence based on value-added (VAM) estimates of instructional impact and observations of professional practice (PP). The research also explores educator influence (voice) in evaluation design and the role information brokers have in local decision making. Findings are situated in an evidentiary and policy context at both the LEA and state policy levels.

The study employs a single-phase, concurrent, mixed-methods research design triangulating multiple sources of qualitative and quantitative evidence onto a single (unified) validation construct: Teacher Instructional Quality. It focuses on assessing the characteristics of metrics used to construct quantitative ratings of instructional competence and the alignment of stakeholder perspectives to facets implicit in the evaluation framework. Validity examinations include assembly of criterion, content, reliability, consequential and construct articulation evidences. Perceptual perspectives were obtained from teachers, principals, district leadership, and state policy decision makers. Data for this study came from a large suburban public school district in metropolitan Phoenix, Arizona.

Study findings suggest that the evaluation framework is insufficient for supporting high stakes, consequential inferences of teacher instructional quality. This is based, in part on the following: (1) Weak associations between VAM and PP metrics; (2) Unstable VAM measures across time and between tested content areas; (3) Less than adequate scale reliabilities; (4) Lack of coherence between theorized and empirical PP factor structures; (5) Omission/underrepresentation of important instructional attributes/effects; (6) Stakeholder concerns over rater consistency, bias, and the inability of test scores to adequately represent instructional competence; (7) Negative sentiments regarding the system's ability to improve instructional competence and/or student learning; (8) Concerns regarding unintended consequences including increased stress, lower morale, harm to professional identity, and restricted learning opportunities; and (9) The general lack of empowerment and educator exclusion from the decision making process. Study findings also highlight the value of information brokers in policy decision making and the importance of having access to unbiased empirical information during the design and implementation phases of important change initiatives.

ContributorsSloat, Edward F. (Author) / Wetzel, Keith (Thesis advisor) / Amrein-Beardsley, Audrey (Thesis advisor) / Ewbank, Ann (Committee member) / Shough, Lori (Committee member) / Arizona State University (Publisher)

Created2015

Visual elements and their effects on the learning outcomes of E-learning

Description

This research contributes to emergent body of knowledge regarding the understanding of relationship between visual elements and E-learning outcomes. Visual images and texts are the main visual elements within the study.

A literature review was conducted on E-learning situations, and a discussion on the role of visual elements in E-learning. Data…

This research contributes to emergent body of knowledge regarding the understanding of relationship between visual elements and E-learning outcomes. Visual images and texts are the main visual elements within the study.

A literature review was conducted on E-learning situations, and a discussion on the role of visual elements in E-learning. Data collection was also conducted by way of a test, which randomly placed participants into three groups and assigned them to three different E-learning courses. The texts for the three courses were the same font, but the first course had text only, the second course had text and "bad" images, and the third one had text and "good" images. Every time participants finished a short course, they were requested to do a short quiz based on what they had learned. In addition, every participant needed to do a survey based on his or her E-learning experience. Research data was finally collected through the test scores and surveys.

Key findings of this research are: (1) The combination of text and "good" image materials in E-learning can greatly enhance the learning outcomes; (2) the "good" images in learning materials can add to the value of the text content as well as improve the satisfactory level of learners in E-learning; (3) "bad" images do not enhance E-learning outcomes; and (4) E-learners will spend a longer time to complete learning materials containing images, no matter how good or "bad" the images are.

ContributorsWang, Yanfei (Author) / Giard, Jacques (Thesis advisor) / Fehler, Michelle (Committee member) / Faria, Rowan De (Committee member) / Arizona State University (Publisher)

Created2015

Integration of traditional assessment and response to intervention in psychoeducational evaluations of culturally and linguistically diverse students

Description

The popularity of response-to-intervention (RTI) frameworks of service delivery has increased in recent years. Scholars have speculated that RTI may be particularly relevant to the special education assessment process for culturally and linguistically diverse (CLD) students, due to its suspected utility in ruling out linguistic proficiency as the primary factor…

The popularity of response-to-intervention (RTI) frameworks of service delivery has increased in recent years. Scholars have speculated that RTI may be particularly relevant to the special education assessment process for culturally and linguistically diverse (CLD) students, due to its suspected utility in ruling out linguistic proficiency as the primary factor in learning difficulties. The present study explored how RTI and traditional assessment methods were integrated into the psychoeducational evaluation process for students suspected of having specific learning disabilities (SLD). The content of psychoeducational evaluation reports completed on students who were found eligible for special education services under the SLD category from 2009-2013 was analyzed. Two main research questions were addressed: how RTI influenced the psychoeducational evaluation process, and how this process differed for CLD and non-CLD students. Findings indicated variability in the incorporation of RTI in evaluation reports, with an increase across time in the tendency to reference the prereferral intervention process. However, actual RTI data was present in a minority of reports, with the inclusion of such data more common for reading than other academic areas, as well as more likely for elementary students than secondary students. Contrary to expectations, RTI did not play a larger role in evaluation reports for CLD students than reports for non-CLD students. Evaluations of CLD students also did not demonstrate greater variability in the use of traditional assessments, and were more likely to rely on nonverbal cognitive measures than evaluations of non-CLD students. Methods by which practitioners addressed linguistic proficiency were variable, with parent input, educational history, and individually-administered proficiency test data commonly used. Assessment practices identified in this study are interpreted in the context of best practice recommendations.

ContributorsPlanck, Jennifer A (Author) / Caterino, Linda C (Thesis advisor) / Stamm, Jill (Committee member) / Cohen, Sylvia A. (Committee member) / Arizona State University (Publisher)

Created2014

The motivations and challenges of acquiring U.S. citizenship for South Sudanese refugees in the greater Phoenix area when language is a potential barrier

Description

South Sudanese refugees are among the most vulnerable immigrants to the U.S.. Many have spent years in refugee camps, experienced trauma, lost members of their families and have had minimal or no schooling or literacy prior to their arrival in the U.S. Although most South Sudanese aspire to become U.S.…

South Sudanese refugees are among the most vulnerable immigrants to the U.S.. Many have spent years in refugee camps, experienced trauma, lost members of their families and have had minimal or no schooling or literacy prior to their arrival in the U.S. Although most South Sudanese aspire to become U.S. citizens, finally giving them a sense of belonging and participation in a land they can call their own, they constitute a group that faces great challenges in terms of their educational adaptation and English-language learning skills that would lead them to success on the U.S. citizenship examination. This dissertation reports findings from a qualitative research project involving case studies of South Sudanese students in a citizenship preparation program at a South Sudanese refugee community center in Phoenix, Arizona. It focuses on the links between the motivations of students seeking citizenship and the barriers they face in gaining it. Though the South Sudanese refugee students aspiring to become U.S. citizens face many of the same challenges as other immigrant groups, there are some factors that in combination make the participants in this study different from other groups. These include: long periods spent in refugee camps, advanced ages, war trauma, absence of intact families, no schooling or severe disruption from schooling, no first language literacy, and hybridized forms of second languages (e.g. Juba Arabic). This study reports on the motivations students have for seeking citizenship and the challenges they face in attaining it from the perspective of teachers working with those students, community leaders of the South Sudanese community, and particularly the students enrolled in the citizenship program.

ContributorsJohnson, Erik (Author) / Adams, Karen (Thesis advisor) / Renaud, Claire (Committee member) / James, Mark (Committee member) / Arizona State University (Publisher)

Created2014

Improving Arizona English language learners' mathematics achievement using curriculum-based measures

Description

ABSTRACT

This study was an investigation of the effectiveness of curriculum-based measures (CBMs) on the math achievement of first and second grade English Language Learners (ELL). The No Child Left Behind Act (NCLB) of 2001 led to a new educational reform, which identifies and provides services to students in need of…

ABSTRACT

This study was an investigation of the effectiveness of curriculum-based measures (CBMs) on the math achievement of first and second grade English Language Learners (ELL). The No Child Left Behind Act (NCLB) of 2001 led to a new educational reform, which identifies and provides services to students in need of academic support based on English language proficiency. Students are from certain demographics: minorities, low-income families, students with disabilities, and students with limited English proficiency. NCLB intended to lead as to improvement in the quality of the United States educational system.

Four classes from the community of Kayenta, Arizona in the Navajo Nation were randomly assigned to control and experimental groups, one each per grade. All four classes used the state-approved, core math curriculum, but one class in each grade was provided with weekly CBMs for an entire school year that included sample questions developed from the Arizona Department of Education performance standards. The CBMs contained at least one question from each of the five math strands: number and operations, algebra, geometry, measurement, and data and probability.

The NorthWest Evaluation Assessment (NWEA) served as the pretest and posttest for all four groups. The SAT 10 (RIT scores) math test, administered near the time of the pretest, served as the covariate in the analysis. Two analysis of covariance tests revealed no statistically significant treatment effects, subject gender effects, or interactions for either Grade 1 or Grade 2. Achievement levels were relatively constant across both genders and the two grade levels.

Despite increasing emphasis on assessment and accountability, the achievement gaps between these subpopulations and the general population of students continues to widen. It appears that other variables are responsible for the different achievement levels found among students. Researchers have found that teachers with math certification, degrees related to math, and advanced course work in math leads to improved math performance over students of teachers who lack those qualifications. The design of the current study did not permit analyses of teacher or school effects.

ContributorsBenally, Jacqueline (Author) / Humphreys, Jere (Thesis advisor) / Spencer, Dee (Committee member) / Nicholas, Appleton (Committee member) / Arizona State University (Publisher)

Created2014

Assessing Postsecondary Students' Orientation toward Lifelong Learning

Description

Institutions of higher education often tout that they are developing students to become lifelong learners. Evaluative efforts in this area have been presumably hindered by the lack of a uniform conceptualization of lifelong learning. Lifelong learning has been defined from institutional, economic, socio-cultural, and pedagogical perspectives, among others. This study…

Institutions of higher education often tout that they are developing students to become lifelong learners. Evaluative efforts in this area have been presumably hindered by the lack of a uniform conceptualization of lifelong learning. Lifelong learning has been defined from institutional, economic, socio-cultural, and pedagogical perspectives, among others. This study presents the existing operational definitions and theories of lifelong learning in the context of higher education and synthesizes them to propose a unified model of college students' orientation toward lifelong learning. The model theorizes that orientation toward lifelong learning is a latent construct which manifests as students' likelihood to engage in four types of learning activities: formal work-related activities, informal work-related activities, formal personal interest activities, and informal personal interest activities. The Postsecondary Orientation toward Lifelong Learning scale (POLL) was developed and the validity of the resulting score interpretations was examined. The instrument was used to compare potential differences in orientation toward lifelong learning between freshmen and seniors. Exploratory factor analyses of the responses of 138 undergraduate college students in the pilot study data provided tentative support for the factor structure within each type of learning activity. Guttman's <λ>λ2 estimates of the learning activity subscales ranged from .78 to .85. Follow-up confirmatory factor analysis using structural equation modeling did not corroborate support for the hypothesized four-factor model using the main student sample data of 405 undergraduate students. Several alternative reflective factor structures were explored. A two-factor model representing factors for Instructing/Presenting and Reading learning activities produced marginal model-data fit and warrants further investigation. The summed POLL total scores had a relatively strong positive correlation with global interest in learning (.58), moderate positive correlations with civic engagement and participation (.38) and life satisfaction (.29), and a small positive correlation with social desirability (.15). The results of the main study do not provide support for the malleability of postsecondary students' orientation toward lifelong learning, as measured by the summed POLL scores. The difference between freshmen and seniors' average total POLL scores was not statistically significant and was negligible in size.

ContributorsArcuria, Phil (Author) / Thompson, Marilyn (Thesis advisor) / Green, Samuel (Committee member) / Levy, Roy (Committee member) / Arizona State University (Publisher)

Created2011

The factor structure of the English language development assessment: a confirmatory factor analysis

Description

This study investigated the internal factor structure of the English language development Assessment (ELDA) using confirmatory factor analysis. ELDA is an English language proficiency test developed by a consortium of multiple states and is used to identify and reclassify English language learners in kindergarten to grade 12. Scores on item…

This study investigated the internal factor structure of the English language development Assessment (ELDA) using confirmatory factor analysis. ELDA is an English language proficiency test developed by a consortium of multiple states and is used to identify and reclassify English language learners in kindergarten to grade 12. Scores on item parcels based on the standards tested from the four domains of reading, writing, listening, and speaking were used for the analyses. Five different factor models were tested: a single factor model, a correlated two-factor model, a correlated four-factor model, a second-order factor model and a bifactor model. The results indicate that the four-factor model, second-order model, and bifactor model fit the data well. The four-factor model hypothesized constructs for reading, writing, listening and speaking. The second-order model hypothesized a second-order English language proficiency factor as well as the four lower-order factors of reading, writing, listening and speaking. The bifactor model hypothesized a general English language proficiency factor as well as the four domain specific factors of reading, writing, listening, and speaking. The Chi-square difference tests indicated that the bifactor model best explains the factor structure of the ELDA. The results from this study are consistent with the findings in the literature about the multifactorial nature of language but differ from the conclusion about the factor structures reported in previous studies. The overall proficiency levels on the ELDA gives more weight to the reading and writing sections of the test than the speaking and listening sections. This study has implications on the rules used for determining proficiency levels and recommends the use of conjunctive scoring where all constructs are weighted equally contrary to current practice.

ContributorsKuriakose, Anju Susan (Author) / Macswan, Jeff (Thesis advisor) / Haladyna, Thomas (Thesis advisor) / Thompson, Marilyn (Committee member) / Arizona State University (Publisher)

Created2011