Matching Items (15)
Filtering by

Clear all filters

149935-Thumbnail Image.png
Description
The purpose of this study was to investigate the effect of complex structure on dimensionality assessment in compensatory and noncompensatory multidimensional item response models (MIRT) of assessment data using dimensionality assessment procedures based on conditional covariances (i.e., DETECT) and a factor analytical approach (i.e., NOHARM). The DETECT-based methods typically outperformed

The purpose of this study was to investigate the effect of complex structure on dimensionality assessment in compensatory and noncompensatory multidimensional item response models (MIRT) of assessment data using dimensionality assessment procedures based on conditional covariances (i.e., DETECT) and a factor analytical approach (i.e., NOHARM). The DETECT-based methods typically outperformed the NOHARM-based methods in both two- (2D) and three-dimensional (3D) compensatory MIRT conditions. The DETECT-based methods yielded high proportion correct, especially when correlations were .60 or smaller, data exhibited 30% or less complexity, and larger sample size. As the complexity increased and the sample size decreased, the performance typically diminished. As the complexity increased, it also became more difficult to label the resulting sets of items from DETECT in terms of the dimensions. DETECT was consistent in classification of simple items, but less consistent in classification of complex items. Out of the three NOHARM-based methods, χ2G/D and ALR generally outperformed RMSR. χ2G/D was more accurate when N = 500 and complexity levels were 30% or lower. As the number of items increased, ALR performance improved at correlation of .60 and 30% or less complexity. When the data followed a noncompensatory MIRT model, the NOHARM-based methods, specifically χ2G/D and ALR, were the most accurate of all five methods. The marginal proportions for labeling sets of items as dimension-like were typically low, suggesting that the methods generally failed to label two (three) sets of items as dimension-like in 2D (3D) noncompensatory situations. The DETECT-based methods were more consistent in classifying simple items across complexity levels, sample sizes, and correlations. However, as complexity and correlation levels increased the classification rates for all methods decreased. In most conditions, the DETECT-based methods classified complex items equally or more consistent than the NOHARM-based methods. In particular, as complexity, the number of items, and the true dimensionality increased, the DETECT-based methods were notably more consistent than any NOHARM-based method. Despite DETECT's consistency, when data follow a noncompensatory MIRT model, the NOHARM-based method should be preferred over the DETECT-based methods to assess dimensionality due to poor performance of DETECT in identifying the true dimensionality.
ContributorsSvetina, Dubravka (Author) / Levy, Roy (Thesis advisor) / Gorin, Joanna S. (Committee member) / Millsap, Roger (Committee member) / Arizona State University (Publisher)
Created2011
149820-Thumbnail Image.png
Description
Children with epilepsy represent a unique group of students who may require accommodations in school to be optimally successful. Therefore, it is important for teachers to understand the possible academic consequences epilepsy can have on a child. An important step in providing this information about epilepsy to teachers

Children with epilepsy represent a unique group of students who may require accommodations in school to be optimally successful. Therefore, it is important for teachers to understand the possible academic consequences epilepsy can have on a child. An important step in providing this information about epilepsy to teachers is understanding where they would prefer to acquire this information. The current study examined differences between teachers of differing ages, school levels and special education teaching status in their preferences for gaining information from parents and the internet. Contrary to expectations, older teachers (those 56 years of age and older) were no less likely that younger teachers to prefer information from the internet. As predicted, elementary school teachers were more likely than high school teachers to prefer information from parents. However, interestingly middle school teachers were also more likely to prefer information from parents than high school teachers. Lastly, contrary to hypothesized results, special education teachers were no more likely to prefer information from parents than non-special education colleagues. Limitations of this study, implications for practice and directions for future research are discussed.
ContributorsGay, Catherine (Author) / Wodrich, David (Thesis advisor) / Levy, Roy (Committee member) / Hart, Juliet (Committee member) / Arizona State University (Publisher)
Created2011
151761-Thumbnail Image.png
Description
The use of exams for classification purposes has become prevalent across many fields including professional assessment for employment screening and standards based testing in educational settings. Classification exams assign individuals to performance groups based on the comparison of their observed test scores to a pre-selected criterion (e.g. masters vs. nonmasters

The use of exams for classification purposes has become prevalent across many fields including professional assessment for employment screening and standards based testing in educational settings. Classification exams assign individuals to performance groups based on the comparison of their observed test scores to a pre-selected criterion (e.g. masters vs. nonmasters in dichotomous classification scenarios). The successful use of exams for classification purposes assumes at least minimal levels of accuracy of these classifications. Classification accuracy is an index that reflects the rate of correct classification of individuals into the same category which contains their true ability score. Traditional methods estimate classification accuracy via methods which assume that true scores follow a four-parameter beta-binomial distribution. Recent research suggests that Item Response Theory may be a preferable alternative framework for estimating examinees' true scores and may return more accurate classifications based on these scores. Researchers hypothesized that test length, the location of the cut score, the distribution of items, and the distribution of examinee ability would impact the recovery of accurate estimates of classification accuracy. The current simulation study manipulated these factors to assess their potential influence on classification accuracy. Observed classification as masters vs. nonmasters, true classification accuracy, estimated classification accuracy, BIAS, and RMSE were analyzed. In addition, Analysis of Variance tests were conducted to determine whether an interrelationship existed between levels of the four manipulated factors. Results showed small values of estimated classification accuracy and increased BIAS in accuracy estimates with few items, mismatched distributions of item difficulty and examinee ability, and extreme cut scores. A significant four-way interaction between manipulated variables was observed. In additional to interpretations of these findings and explanation of potential causes for the recovered values, recommendations that inform practice and avenues of future research are provided.
ContributorsKunze, Katie (Author) / Gorin, Joanna (Thesis advisor) / Levy, Roy (Thesis advisor) / Green, Samuel (Committee member) / Arizona State University (Publisher)
Created2013
151505-Thumbnail Image.png
Description
Students with traumatic brain injury (TBI) sometimes experience impairments that can adversely affect educational performance. Consequently, school psychologists may be needed to help determine if a TBI diagnosis is warranted (i.e., in compliance with the Individuals with Disabilities Education Improvement Act, IDEIA) and to suggest accommodations to assist those students.

Students with traumatic brain injury (TBI) sometimes experience impairments that can adversely affect educational performance. Consequently, school psychologists may be needed to help determine if a TBI diagnosis is warranted (i.e., in compliance with the Individuals with Disabilities Education Improvement Act, IDEIA) and to suggest accommodations to assist those students. This analogue study investigated whether school psychologists provided with more comprehensive psychoeducational evaluations of a student with TBI succeeded in detecting TBI, in making TBI-related accommodations, and were more confident in their decisions. To test these hypotheses, 76 school psychologists were randomly assigned to one of three groups that received increasingly comprehensive levels of psychoeducational evaluation embedded in a cumulative folder of a hypothetical student whose history included a recent head injury and TBI-compatible school problems. As expected, school psychologists who received a more comprehensive psychoeducational evaluation were more likely to make a TBI educational diagnosis, but the effect size was not strong, and the predictive value came from the variance between the first and third groups. Likewise, school psychologists receiving more comprehensive evaluation data produced more accommodations related to student needs and felt more confidence in those accommodations, but significant differences were not found at all levels of evaluation. Contrary to expectations, however, providing more comprehensive information failed to engender more confidence in decisions about TBI educational diagnoses. Concluding that a TBI is present may itself facilitate accommodations; school psychologists who judged that the student warranted a TBI educational diagnosis produce more TBI-related accommodations. Impact of findings suggest the importance of training school psychologists in the interpretation of neuropsychology test results to aid in educational diagnosis and to increase confidence in their use.
ContributorsHildreth, Lisa Jane (Author) / Hildreth, Lisa J (Thesis advisor) / Wodrich, David (Committee member) / Levy, Roy (Committee member) / Lavoie, Michael (Committee member) / Arizona State University (Publisher)
Created2012
151992-Thumbnail Image.png
Description
Dimensionality assessment is an important component of evaluating item response data. Existing approaches to evaluating common assumptions of unidimensionality, such as DIMTEST (Nandakumar & Stout, 1993; Stout, 1987; Stout, Froelich, & Gao, 2001), have been shown to work well under large-scale assessment conditions (e.g., large sample sizes and item pools;

Dimensionality assessment is an important component of evaluating item response data. Existing approaches to evaluating common assumptions of unidimensionality, such as DIMTEST (Nandakumar & Stout, 1993; Stout, 1987; Stout, Froelich, & Gao, 2001), have been shown to work well under large-scale assessment conditions (e.g., large sample sizes and item pools; see e.g., Froelich & Habing, 2007). It remains to be seen how such procedures perform in the context of small-scale assessments characterized by relatively small sample sizes and/or short tests. The fact that some procedures come with minimum allowable values for characteristics of the data, such as the number of items, may even render them unusable for some small-scale assessments. Other measures designed to assess dimensionality do not come with such limitations and, as such, may perform better under conditions that do not lend themselves to evaluation via statistics that rely on asymptotic theory. The current work aimed to evaluate the performance of one such metric, the standardized generalized dimensionality discrepancy measure (SGDDM; Levy & Svetina, 2011; Levy, Xu, Yel, & Svetina, 2012), under both large- and small-scale testing conditions. A Monte Carlo study was conducted to compare the performance of DIMTEST and the SGDDM statistic in terms of evaluating assumptions of unidimensionality in item response data under a variety of conditions, with an emphasis on the examination of these procedures in small-scale assessments. Similar to previous research, increases in either test length or sample size resulted in increased power. The DIMTEST procedure appeared to be a conservative test of the null hypothesis of unidimensionality. The SGDDM statistic exhibited rejection rates near the nominal rate of .05 under unidimensional conditions, though the reliability of these results may have been less than optimal due to high sampling variability resulting from a relatively limited number of replications. Power values were at or near 1.0 for many of the multidimensional conditions. It was only when the sample size was reduced to N = 100 that the two approaches diverged in performance. Results suggested that both procedures may be appropriate for sample sizes as low as N = 250 and tests as short as J = 12 (SGDDM) or J = 19 (DIMTEST). When used as a diagnostic tool, SGDDM may be appropriate with as few as N = 100 cases combined with J = 12 items. The study was somewhat limited in that it did not include any complex factorial designs, nor were the strength of item discrimination parameters or correlation between factors manipulated. It is recommended that further research be conducted with the inclusion of these factors, as well as an increase in the number of replications when using the SGDDM procedure.
ContributorsReichenberg, Ray E (Author) / Levy, Roy (Thesis advisor) / Thompson, Marilyn S. (Thesis advisor) / Green, Samuel B. (Committee member) / Arizona State University (Publisher)
Created2013
150518-Thumbnail Image.png
Description
ABSTRACT This study investigated the possibility of item parameter drift (IPD) in a calculus placement examination administered to approximately 3,000 students at a large university in the United States. A single form of the exam was administered continuously for a period of two years, possibly allowing later examinees to have

ABSTRACT This study investigated the possibility of item parameter drift (IPD) in a calculus placement examination administered to approximately 3,000 students at a large university in the United States. A single form of the exam was administered continuously for a period of two years, possibly allowing later examinees to have prior knowledge of specific items on the exam. An analysis of IPD was conducted to explore evidence of possible item exposure. Two assumptions concerning items exposure were made: 1) item recall and item exposure are positively correlated, and 2) item exposure results in the items becoming easier over time. Special consideration was given to two contextual item characteristics: 1) item location within the test, specifically items at the beginning and end of the exam, and 2) the use of an associated diagram. The hypotheses stated that these item characteristics would make the items easier to recall and, therefore, more likely to be exposed, resulting in item drift. BILOG-MG 3 was used to calibrate the items and assess for IPD. No evidence was found to support the hypotheses that the items located at the beginning of the test or with an associated diagram drifted as a result of item exposure. Three items among the last ten on the exam drifted significantly and became easier, consistent with item exposure. However, in this study, the possible effects of item exposure could not be separated from the effects of other potential factors such as speededness, curriculum changes, better test preparation on the part of subsequent examinees, or guessing.
ContributorsKrause, Janet (Author) / Levy, Roy (Thesis advisor) / Thompson, Marilyn (Thesis advisor) / Gorin, Joanna (Committee member) / Arizona State University (Publisher)
Created2012
151021-Thumbnail Image.png
Description
The Culture-Language Interpretive Matrix (C-LIM) is a new tool hypothesized to help practitioners accurately determine whether students who are administered an IQ test are culturally and linguistically different from the normative comparison group (i.e., different) or culturally and linguistically similar to the normative comparison group and possibly have Specific Learning

The Culture-Language Interpretive Matrix (C-LIM) is a new tool hypothesized to help practitioners accurately determine whether students who are administered an IQ test are culturally and linguistically different from the normative comparison group (i.e., different) or culturally and linguistically similar to the normative comparison group and possibly have Specific Learning Disabilities (SLD) or other neurocognitive disabilities (i.e., disordered). Diagnostic utility statistics were used to test the ability of the Wechsler Intelligence Scales for Children-Fourth Edition (WISC-IV) C-LIM to accurately identify students from a referred sample of English language learners (Ells) (n = 86) for whom Spanish was the primary language spoken at home and a sample of students from the WISC-IV normative sample (n = 2,033) as either culturally and linguistically different from the WISC-IV normative sample or culturally and linguistically similar to the WISC-IV normative sample. WISC-IV scores from three paired comparison groups were analyzed using the Receiver Operating Characteristic (ROC) curve: (a) Ells with SLD and the WISC-IV normative sample, (b) Ells without SLD and the WISC-IV normative sample, and (c) Ells with SLD and Ells without SLD. Results of the ROC yielded Area Under the Curve (AUC) values that ranged between 0.51 and 0.53 for the comparison between Ells with SLD and the WISC-IV normative sample, AUC values that ranged between 0.48 and 0.53 for the comparison between Ells without SLD and the WISC-IV normative sample, and AUC values that ranged between 0.49 and 0.55 for the comparison between Ells with SLD and Ells without SLD. These values indicate that the C-LIM has low diagnostic accuracy in terms of differentiating between a sample of Ells and the WISC-IV normative sample. Current available evidence does not support use of the C-LIM in applied practice at this time.
ContributorsStyck, Kara M (Author) / Watkins, Marley W. (Thesis advisor) / Levy, Roy (Thesis advisor) / Balles, John (Committee member) / Arizona State University (Publisher)
Created2012
156690-Thumbnail Image.png
Description
Dynamic Bayesian networks (DBNs; Reye, 2004) are a promising tool for modeling student proficiency under rich measurement scenarios (Reichenberg, in press). These scenarios often present assessment conditions far more complex than what is seen with more traditional assessments and require assessment arguments and psychometric models capable of integrating those complexities.

Dynamic Bayesian networks (DBNs; Reye, 2004) are a promising tool for modeling student proficiency under rich measurement scenarios (Reichenberg, in press). These scenarios often present assessment conditions far more complex than what is seen with more traditional assessments and require assessment arguments and psychometric models capable of integrating those complexities. Unfortunately, DBNs remain understudied and their psychometric properties relatively unknown. If the apparent strengths of DBNs are to be leveraged, then the body of literature surrounding their properties and use needs to be expanded upon. To this end, the current work aimed at exploring the properties of DBNs under a variety of realistic psychometric conditions. A two-phase Monte Carlo simulation study was conducted in order to evaluate parameter recovery for DBNs using maximum likelihood estimation with the Netica software package. Phase 1 included a limited number of conditions and was exploratory in nature while Phase 2 included a larger and more targeted complement of conditions. Manipulated factors included sample size, measurement quality, test length, the number of measurement occasions. Results suggested that measurement quality has the most prominent impact on estimation quality with more distinct performance categories yielding better estimation. While increasing sample size tended to improve estimation, there were a limited number of conditions under which greater samples size led to more estimation bias. An exploration of this phenomenon is included. From a practical perspective, parameter recovery appeared to be sufficient with samples as low as N = 400 as long as measurement quality was not poor and at least three items were present at each measurement occasion. Tests consisting of only a single item required exceptional measurement quality in order to adequately recover model parameters. The study was somewhat limited due to potentially software-specific issues as well as a non-comprehensive collection of experimental conditions. Further research should replicate and, potentially expand the current work using other software packages including exploring alternate estimation methods (e.g., Markov chain Monte Carlo).
ContributorsReichenberg, Raymond E (Author) / Levy, Roy (Thesis advisor) / Eggum-Wilkens, Natalie (Thesis advisor) / Iida, Masumi (Committee member) / DeLay, Dawn (Committee member) / Arizona State University (Publisher)
Created2018
154063-Thumbnail Image.png
Description
Although models for describing longitudinal data have become increasingly sophisticated, the criticism of even foundational growth curve models remains challenging. The challenge arises from the need to disentangle data-model misfit at multiple and interrelated levels of analysis. Using posterior predictive model checking (PPMC)—a popular Bayesian framework for model criticism—the performance

Although models for describing longitudinal data have become increasingly sophisticated, the criticism of even foundational growth curve models remains challenging. The challenge arises from the need to disentangle data-model misfit at multiple and interrelated levels of analysis. Using posterior predictive model checking (PPMC)—a popular Bayesian framework for model criticism—the performance of several discrepancy functions was investigated in a Monte Carlo simulation study. The discrepancy functions of interest included two types of conditional concordance correlation (CCC) functions, two types of R2 functions, two types of standardized generalized dimensionality discrepancy (SGDDM) functions, the likelihood ratio (LR), and the likelihood ratio difference test (LRT). Key outcomes included effect sizes of the design factors on the realized values of discrepancy functions, distributions of posterior predictive p-values (PPP-values), and the proportion of extreme PPP-values.

In terms of the realized values, the behavior of the CCC and R2 functions were generally consistent with prior research. However, as diagnostics, these functions were extremely conservative even when some aspect of the data was unaccounted for. In contrast, the conditional SGDDM (SGDDMC), LR, and LRT were generally sensitive to the underspecifications investigated in this work on all outcomes considered. Although the proportions of extreme PPP-values for these functions tended to increase in null situations for non-normal data, this behavior may have reflected the true misfit that resulted from the specification of normal prior distributions. Importantly, the LR and the SGDDMC to a greater extent exhibited some potential for untangling the sources of data-model misfit. Owing to connections of growth curve models to the more fundamental frameworks of multilevel modeling, structural equation models with a mean structure, and Bayesian hierarchical models, the results of the current work may have broader implications that warrant further research.
ContributorsFay, Derek (Author) / Levy, Roy (Thesis advisor) / Thompson, Marilyn (Committee member) / Enders, Craig (Committee member) / Arizona State University (Publisher)
Created2015
152928-Thumbnail Image.png
Description
The study examined how ATFIND, Mantel-Haenszel, SIBTEST, and Crossing SIBTEST function when items in the dataset are modelled to differentially advantage a lower ability focal group over a higher ability reference group. The primary purpose of the study was to examine ATFIND's usefulness as a valid subtest selection tool, but

The study examined how ATFIND, Mantel-Haenszel, SIBTEST, and Crossing SIBTEST function when items in the dataset are modelled to differentially advantage a lower ability focal group over a higher ability reference group. The primary purpose of the study was to examine ATFIND's usefulness as a valid subtest selection tool, but it also explored the influence of DIF items, item difficulty, and presence of multiple examinee populations with different ability distributions on both its selection of the assessment test (AT) and partitioning test (PT) lists and on all three differential item functioning (DIF) analysis procedures. The results of SIBTEST were also combined with those of Crossing SIBTEST, as might be done in practice.

ATFIND was found to be a less-than-effective matching subtest selection tool with DIF items that are modelled unidimensionally. If an item was modelled with uniform DIF or if it had a referent difficulty parameter in the Medium range, it was found to be selected slightly more often for the AT List than the PT List. These trends were seen to increase as sample size increased. All three DIF analyses, and the combined SIBTEST and Crossing SIBTEST, generally were found to perform less well as DIF contaminated the matching subtest, as well as when DIF was modelled less severely or when the focal group ability was skewed. While the combined SIBTEST and Crossing SIBTEST was found to have the highest power among the DIF analyses, it also was found to have Type I error rates that were sometimes extremely high.
ContributorsScott, Lietta Marie (Author) / Levy, Roy (Thesis advisor) / Green, Samuel B (Thesis advisor) / Gorin, Joanna S (Committee member) / Williams, Leila E (Committee member) / Arizona State University (Publisher)
Created2014