Search Content

A comparison of DIMTEST and generalized dimensionality discrepancy approaches to assessing dimensionality in item response theory

Description

Dimensionality assessment is an important component of evaluating item response data. Existing approaches to evaluating common assumptions of unidimensionality, such as DIMTEST (Nandakumar & Stout, 1993; Stout, 1987; Stout, Froelich, & Gao, 2001), have been shown to work well under large-scale assessment conditions (e.g., large sample sizes and item pools;…

Dimensionality assessment is an important component of evaluating item response data. Existing approaches to evaluating common assumptions of unidimensionality, such as DIMTEST (Nandakumar & Stout, 1993; Stout, 1987; Stout, Froelich, & Gao, 2001), have been shown to work well under large-scale assessment conditions (e.g., large sample sizes and item pools; see e.g., Froelich & Habing, 2007). It remains to be seen how such procedures perform in the context of small-scale assessments characterized by relatively small sample sizes and/or short tests. The fact that some procedures come with minimum allowable values for characteristics of the data, such as the number of items, may even render them unusable for some small-scale assessments. Other measures designed to assess dimensionality do not come with such limitations and, as such, may perform better under conditions that do not lend themselves to evaluation via statistics that rely on asymptotic theory. The current work aimed to evaluate the performance of one such metric, the standardized generalized dimensionality discrepancy measure (SGDDM; Levy & Svetina, 2011; Levy, Xu, Yel, & Svetina, 2012), under both large- and small-scale testing conditions. A Monte Carlo study was conducted to compare the performance of DIMTEST and the SGDDM statistic in terms of evaluating assumptions of unidimensionality in item response data under a variety of conditions, with an emphasis on the examination of these procedures in small-scale assessments. Similar to previous research, increases in either test length or sample size resulted in increased power. The DIMTEST procedure appeared to be a conservative test of the null hypothesis of unidimensionality. The SGDDM statistic exhibited rejection rates near the nominal rate of .05 under unidimensional conditions, though the reliability of these results may have been less than optimal due to high sampling variability resulting from a relatively limited number of replications. Power values were at or near 1.0 for many of the multidimensional conditions. It was only when the sample size was reduced to N = 100 that the two approaches diverged in performance. Results suggested that both procedures may be appropriate for sample sizes as low as N = 250 and tests as short as J = 12 (SGDDM) or J = 19 (DIMTEST). When used as a diagnostic tool, SGDDM may be appropriate with as few as N = 100 cases combined with J = 12 items. The study was somewhat limited in that it did not include any complex factorial designs, nor were the strength of item discrimination parameters or correlation between factors manipulated. It is recommended that further research be conducted with the inclusion of these factors, as well as an increase in the number of replications when using the SGDDM procedure.

ContributorsReichenberg, Ray E (Author) / Levy, Roy (Thesis advisor) / Thompson, Marilyn S. (Thesis advisor) / Green, Samuel B. (Committee member) / Arizona State University (Publisher)

Created2013

Do more comprehensive psychoeducational evaluations promote TBI educational diagnosis?

Description

Students with traumatic brain injury (TBI) sometimes experience impairments that can adversely affect educational performance. Consequently, school psychologists may be needed to help determine if a TBI diagnosis is warranted (i.e., in compliance with the Individuals with Disabilities Education Improvement Act, IDEIA) and to suggest accommodations to assist those students.…

Students with traumatic brain injury (TBI) sometimes experience impairments that can adversely affect educational performance. Consequently, school psychologists may be needed to help determine if a TBI diagnosis is warranted (i.e., in compliance with the Individuals with Disabilities Education Improvement Act, IDEIA) and to suggest accommodations to assist those students. This analogue study investigated whether school psychologists provided with more comprehensive psychoeducational evaluations of a student with TBI succeeded in detecting TBI, in making TBI-related accommodations, and were more confident in their decisions. To test these hypotheses, 76 school psychologists were randomly assigned to one of three groups that received increasingly comprehensive levels of psychoeducational evaluation embedded in a cumulative folder of a hypothetical student whose history included a recent head injury and TBI-compatible school problems. As expected, school psychologists who received a more comprehensive psychoeducational evaluation were more likely to make a TBI educational diagnosis, but the effect size was not strong, and the predictive value came from the variance between the first and third groups. Likewise, school psychologists receiving more comprehensive evaluation data produced more accommodations related to student needs and felt more confidence in those accommodations, but significant differences were not found at all levels of evaluation. Contrary to expectations, however, providing more comprehensive information failed to engender more confidence in decisions about TBI educational diagnoses. Concluding that a TBI is present may itself facilitate accommodations; school psychologists who judged that the student warranted a TBI educational diagnosis produce more TBI-related accommodations. Impact of findings suggest the importance of training school psychologists in the interpretation of neuropsychology test results to aid in educational diagnosis and to increase confidence in their use.

ContributorsHildreth, Lisa Jane (Author) / Hildreth, Lisa J (Thesis advisor) / Wodrich, David (Committee member) / Levy, Roy (Committee member) / Lavoie, Michael (Committee member) / Arizona State University (Publisher)

Created2012

Posterior predictive model checking in Bayesian networks

Description

This simulation study compared the utility of various discrepancy measures within a posterior predictive model checking (PPMC) framework for detecting different types of data-model misfit in multidimensional Bayesian network (BN) models. The investigated conditions were motivated by an applied research program utilizing an operational complex performance assessment within a digital-simulation…

This simulation study compared the utility of various discrepancy measures within a posterior predictive model checking (PPMC) framework for detecting different types of data-model misfit in multidimensional Bayesian network (BN) models. The investigated conditions were motivated by an applied research program utilizing an operational complex performance assessment within a digital-simulation educational context grounded in theories of cognition and learning. BN models were manipulated along two factors: latent variable dependency structure and number of latent classes. Distributions of posterior predicted p-values (PPP-values) served as the primary outcome measure and were summarized in graphical presentations, by median values across replications, and by proportions of replications in which the PPP-values were extreme. An effect size measure for PPMC was introduced as a supplemental numerical summary to the PPP-value. Consistent with previous PPMC research, all investigated fit functions tended to perform conservatively, but Standardized Generalized Dimensionality Discrepancy Measure (SGDDM), Yen's Q3, and Hierarchy Consistency Index (HCI) only mildly so. Adequate power to detect at least some types of misfit was demonstrated by SGDDM, Q3, HCI, Item Consistency Index (ICI), and to a lesser extent Deviance, while proportion correct (PC), a chi-square-type item-fit measure, Ranked Probability Score (RPS), and Good's Logarithmic Scale (GLS) were powerless across all investigated factors. Bivariate SGDDM and Q3 were found to provide powerful and detailed feedback for all investigated types of misfit.

ContributorsCrawford, Aaron (Author) / Levy, Roy (Thesis advisor) / Green, Samuel (Committee member) / Thompson, Marilyn (Committee member) / Arizona State University (Publisher)

Created2014

Multilevel multiple imputation: an examination of competing methods

Description

Missing data are common in psychology research and can lead to bias and reduced power if not properly handled. Multiple imputation is a state-of-the-art missing data method recommended by methodologists. Multiple imputation methods can generally be divided into two broad categories: joint model (JM) imputation and fully conditional specification (FCS)…

Missing data are common in psychology research and can lead to bias and reduced power if not properly handled. Multiple imputation is a state-of-the-art missing data method recommended by methodologists. Multiple imputation methods can generally be divided into two broad categories: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution (e.g., multivariate normal). FCS, on the other hand, imputes variables one at a time, drawing missing values from a series of univariate distributions. In the single-level context, these two approaches have been shown to be equivalent with multivariate normal data. However, less is known about the similarities and differences of these two approaches with multilevel data, and the methodological literature provides no insight into the situations under which the approaches would produce identical results. This document examined five multilevel multiple imputation approaches (three JM methods and two FCS methods) that have been proposed in the literature. An analytic section shows that only two of the methods (one JM method and one FCS method) used imputation models equivalent to a two-level joint population model that contained random intercepts and different associations across levels. The other three methods employed imputation models that differed from the population model primarily in their ability to preserve distinct level-1 and level-2 covariances. I verified the analytic work with computer simulations, and the simulation results also showed that imputation models that failed to preserve level-specific covariances produced biased estimates. The studies also highlighted conditions that exacerbated the amount of bias produced (e.g., bias was greater for conditions with small cluster sizes). The analytic work and simulations lead to a number of practical recommendations for researchers.

ContributorsMistler, Stephen (Author) / Enders, Craig K. (Thesis advisor) / Aiken, Leona (Committee member) / Levy, Roy (Committee member) / West, Stephen G. (Committee member) / Arizona State University (Publisher)

Created2015

Applying academic analytics: developing a process for utilizing Bayesian networks to predict stopping out among community college students

Description

Many methodological approaches have been utilized to predict student retention and persistence over the years, yet few have utilized a Bayesian framework. It is believed this is due in part to the absence of an established process for guiding educational researchers reared in a frequentist perspective into the realms of…

Many methodological approaches have been utilized to predict student retention and persistence over the years, yet few have utilized a Bayesian framework. It is believed this is due in part to the absence of an established process for guiding educational researchers reared in a frequentist perspective into the realms of Bayesian analysis and educational data mining. The current study aimed to address this by providing a model-building process for developing a Bayesian network (BN) that leveraged educational data mining, Bayesian analysis, and traditional iterative model-building techniques in order to predict whether community college students will stop out at the completion of each of their first six terms. The study utilized exploratory and confirmatory techniques to reduce an initial pool of more than 50 potential predictor variables to a parsimonious final BN with only four predictor variables. The average in-sample classification accuracy rate for the model was 80% (Cohen's κ = 53%). The model was shown to be generalizable across samples with an average out-of-sample classification accuracy rate of 78% (Cohen's κ = 49%). The classification rates for the BN were also found to be superior to the classification rates produced by an analog frequentist discrete-time survival analysis model.

ContributorsArcuria, Philip (Author) / Levy, Roy (Thesis advisor) / Green, Samuel B (Committee member) / Thompson, Marilyn S (Committee member) / Arizona State University (Publisher)

Created2015

Assessing dimensionality in complex data structures: a performance comparison of DETECT and NOHARM procedures

Description

The purpose of this study was to investigate the effect of complex structure on dimensionality assessment in compensatory and noncompensatory multidimensional item response models (MIRT) of assessment data using dimensionality assessment procedures based on conditional covariances (i.e., DETECT) and a factor analytical approach (i.e., NOHARM). The DETECT-based methods typically outperformed…

The purpose of this study was to investigate the effect of complex structure on dimensionality assessment in compensatory and noncompensatory multidimensional item response models (MIRT) of assessment data using dimensionality assessment procedures based on conditional covariances (i.e., DETECT) and a factor analytical approach (i.e., NOHARM). The DETECT-based methods typically outperformed the NOHARM-based methods in both two- (2D) and three-dimensional (3D) compensatory MIRT conditions. The DETECT-based methods yielded high proportion correct, especially when correlations were .60 or smaller, data exhibited 30% or less complexity, and larger sample size. As the complexity increased and the sample size decreased, the performance typically diminished. As the complexity increased, it also became more difficult to label the resulting sets of items from DETECT in terms of the dimensions. DETECT was consistent in classification of simple items, but less consistent in classification of complex items. Out of the three NOHARM-based methods, χ2G/D and ALR generally outperformed RMSR. χ2G/D was more accurate when N = 500 and complexity levels were 30% or lower. As the number of items increased, ALR performance improved at correlation of .60 and 30% or less complexity. When the data followed a noncompensatory MIRT model, the NOHARM-based methods, specifically χ2G/D and ALR, were the most accurate of all five methods. The marginal proportions for labeling sets of items as dimension-like were typically low, suggesting that the methods generally failed to label two (three) sets of items as dimension-like in 2D (3D) noncompensatory situations. The DETECT-based methods were more consistent in classifying simple items across complexity levels, sample sizes, and correlations. However, as complexity and correlation levels increased the classification rates for all methods decreased. In most conditions, the DETECT-based methods classified complex items equally or more consistent than the NOHARM-based methods. In particular, as complexity, the number of items, and the true dimensionality increased, the DETECT-based methods were notably more consistent than any NOHARM-based method. Despite DETECT's consistency, when data follow a noncompensatory MIRT model, the NOHARM-based method should be preferred over the DETECT-based methods to assess dimensionality due to poor performance of DETECT in identifying the true dimensionality.

ContributorsSvetina, Dubravka (Author) / Levy, Roy (Thesis advisor) / Gorin, Joanna S. (Committee member) / Millsap, Roger (Committee member) / Arizona State University (Publisher)

Created2011

Assessment of item parameter drift of known items in a university placement exam

Description

ABSTRACT This study investigated the possibility of item parameter drift (IPD) in a calculus placement examination administered to approximately 3,000 students at a large university in the United States. A single form of the exam was administered continuously for a period of two years, possibly allowing later examinees to have…

ABSTRACT This study investigated the possibility of item parameter drift (IPD) in a calculus placement examination administered to approximately 3,000 students at a large university in the United States. A single form of the exam was administered continuously for a period of two years, possibly allowing later examinees to have prior knowledge of specific items on the exam. An analysis of IPD was conducted to explore evidence of possible item exposure. Two assumptions concerning items exposure were made: 1) item recall and item exposure are positively correlated, and 2) item exposure results in the items becoming easier over time. Special consideration was given to two contextual item characteristics: 1) item location within the test, specifically items at the beginning and end of the exam, and 2) the use of an associated diagram. The hypotheses stated that these item characteristics would make the items easier to recall and, therefore, more likely to be exposed, resulting in item drift. BILOG-MG 3 was used to calibrate the items and assess for IPD. No evidence was found to support the hypotheses that the items located at the beginning of the test or with an associated diagram drifted as a result of item exposure. Three items among the last ten on the exam drifted significantly and became easier, consistent with item exposure. However, in this study, the possible effects of item exposure could not be separated from the effects of other potential factors such as speededness, curriculum changes, better test preparation on the part of subsequent examinees, or guessing.

ContributorsKrause, Janet (Author) / Levy, Roy (Thesis advisor) / Thompson, Marilyn (Thesis advisor) / Gorin, Joanna (Committee member) / Arizona State University (Publisher)

Created2012

The impact of information quantity and quality on parameter estimation for a selection of dynamic Bayesian network models with latent variables

Description

Dynamic Bayesian networks (DBNs; Reye, 2004) are a promising tool for modeling student proficiency under rich measurement scenarios (Reichenberg, in press). These scenarios often present assessment conditions far more complex than what is seen with more traditional assessments and require assessment arguments and psychometric models capable of integrating those complexities.…

Dynamic Bayesian networks (DBNs; Reye, 2004) are a promising tool for modeling student proficiency under rich measurement scenarios (Reichenberg, in press). These scenarios often present assessment conditions far more complex than what is seen with more traditional assessments and require assessment arguments and psychometric models capable of integrating those complexities. Unfortunately, DBNs remain understudied and their psychometric properties relatively unknown. If the apparent strengths of DBNs are to be leveraged, then the body of literature surrounding their properties and use needs to be expanded upon. To this end, the current work aimed at exploring the properties of DBNs under a variety of realistic psychometric conditions. A two-phase Monte Carlo simulation study was conducted in order to evaluate parameter recovery for DBNs using maximum likelihood estimation with the Netica software package. Phase 1 included a limited number of conditions and was exploratory in nature while Phase 2 included a larger and more targeted complement of conditions. Manipulated factors included sample size, measurement quality, test length, the number of measurement occasions. Results suggested that measurement quality has the most prominent impact on estimation quality with more distinct performance categories yielding better estimation. While increasing sample size tended to improve estimation, there were a limited number of conditions under which greater samples size led to more estimation bias. An exploration of this phenomenon is included. From a practical perspective, parameter recovery appeared to be sufficient with samples as low as N = 400 as long as measurement quality was not poor and at least three items were present at each measurement occasion. Tests consisting of only a single item required exceptional measurement quality in order to adequately recover model parameters. The study was somewhat limited due to potentially software-specific issues as well as a non-comprehensive collection of experimental conditions. Further research should replicate and, potentially expand the current work using other software packages including exploring alternate estimation methods (e.g., Markov chain Monte Carlo).

ContributorsReichenberg, Raymond E (Author) / Levy, Roy (Thesis advisor) / Eggum-Wilkens, Natalie (Thesis advisor) / Iida, Masumi (Committee member) / DeLay, Dawn (Committee member) / Arizona State University (Publisher)

Created2018

Assessing measurement invariance and latent mean differences with bifactor multidimensional data in structural equation modeling

Description

Investigation of measurement invariance (MI) commonly assumes correct specification of dimensionality across multiple groups. Although research shows that violation of the dimensionality assumption can cause bias in model parameter estimation for single-group analyses, little research on this issue has been conducted for multiple-group analyses. This study explored the effects of…

Investigation of measurement invariance (MI) commonly assumes correct specification of dimensionality across multiple groups. Although research shows that violation of the dimensionality assumption can cause bias in model parameter estimation for single-group analyses, little research on this issue has been conducted for multiple-group analyses. This study explored the effects of mismatch in dimensionality between data and analysis models with multiple-group analyses at the population and sample levels. Datasets were generated using a bifactor model with different factor structures and were analyzed with bifactor and single-factor models to assess misspecification effects on assessments of MI and latent mean differences. As baseline models, the bifactor models fit data well and had minimal bias in latent mean estimation. However, the low convergence rates of fitting bifactor models to data with complex structures and small sample sizes caused concern. On the other hand, effects of fitting the misspecified single-factor models on the assessments of MI and latent means differed by the bifactor structures underlying data. For data following one general factor and one group factor affecting a small set of indicators, the effects of ignoring the group factor in analysis models on the tests of MI and latent mean differences were mild. In contrast, for data following one general factor and several group factors, oversimplifications of analysis models can lead to inaccurate conclusions regarding MI assessment and latent mean estimation.

ContributorsXu, Yuning (Author) / Green, Samuel (Thesis advisor) / Levy, Roy (Committee member) / Thompson, Marilyn (Committee member) / Arizona State University (Publisher)

Created2018

Robustness of the General Factor Mean Difference Estimation in Bifactor Ordinal Data

Description

A simulation study was conducted to explore the robustness of general factor mean difference estimation in bifactor ordered-categorical data. In the No Differential Item Functioning (DIF) conditions, the data generation conditions varied were sample size, the number of categories per item, effect size of the general factor mean difference, and…

A simulation study was conducted to explore the robustness of general factor mean difference estimation in bifactor ordered-categorical data. In the No Differential Item Functioning (DIF) conditions, the data generation conditions varied were sample size, the number of categories per item, effect size of the general factor mean difference, and the size of specific factor loadings; in data analysis, misspecification conditions were introduced in which the generated bifactor data were fit using a unidimensional model, and/or ordered-categorical data were treated as continuous data. In the DIF conditions, the data generation conditions varied were sample size, the number of categories per item, effect size of latent mean difference for the general factor, the type of item parameters that had DIF, and the magnitude of DIF; the data analysis conditions varied in whether or not setting equality constraints on the noninvariant item parameters.

Results showed that falsely fitting bifactor data using unidimensional models or failing to account for DIF in item parameters resulted in estimation bias in the general factor mean difference, while treating ordinal data as continuous had little influence on the estimation bias as long as there was no severe model misspecification. The extent of estimation bias produced by misspecification of bifactor datasets with unidimensional models was mainly determined by the degree of unidimensionality (i.e., size of specific factor loadings) and the general factor mean difference size. When the DIF was present, the estimation accuracy of the general factor mean difference was completely robust to ignoring noninvariance in specific factor loadings while it was very sensitive to failing to account for DIF in threshold parameters. With respect to ignoring the DIF in general factor loadings, the estimation bias of the general factor mean difference was substantial when the DIF was -0.15, and it can be negligible for smaller sizes of DIF. Despite the impact of model misspecification on estimation accuracy, the power to detect the general factor mean difference was mainly influenced by the sample size and effect size. Serious Type I error rate inflation only occurred when the DIF was present in threshold parameters.

ContributorsLiu, Yixing (Author) / Thompson, Marilyn (Thesis advisor) / Levy, Roy (Committee member) / O’Rourke, Holly (Committee member) / Arizona State University (Publisher)

Created2019

Filtering by