Researchers that create tests or construct multi-item surveys to measure constructs might use Mokken scaling approaches as a scaling tool (Sijtsma et al., 2021). Mokken scaling can also be used as a secondary analysis technique to assess the applicability and performance of more well-known parametric item response theory (IRT) methods, such as the Rasch family of models (Rasch, 1960), which rely on more robust statistical assumptions. It can also be used to test whether existing items are consistent with these assumptions when established items are applied to new respondent samples. Mokken models incorporate the Guttman concept into a probabilistic framework, allow- ing researchers to describe data while accounting for measurement error (Van Schuur, 2020).
The main advantage of NIRT over more widely used item response models, such as the Rasch model, is that it relaxes some of the rigid (logistic ogive or sigmoid shape) assumptions regarding the nonlinear behavior of response probabilities that the para- metric IRT models impose (Sijtsma & Molenaar, 2019). Sijtsma and Molenaar (2019) carried out a study on the use of Non-parametric item response theory on a reading comprehension test and found out that efficient way to explore the behavior of items in scales in an attempt to order persons and items by addressing the underpinning assumptions of IRT through two nonparametric models, namely, MHM and DMM. The findings of their studies are in the same line with results of the present study in which NIRT was used for TOEFL iBT listening items.
In the typical parametric approach, the item characteristic curve is expected to follow a smooth and symmetric S-shaped function based on the family of logistic or pro- bit cumulative distribution functions with single, 2-parameter, or more complex (3- or even 4-parameter models). The model becomes increasingly restrictive as the number of parameters calculated for each item decreases. The number of parameters required to characterize each item’s shape and location grows, so does the number of features in the data that the final scale can handle. The present study applied NIRT to assess the psycho- metric properties and to determine the adaptability of a TOEFL listening comprehension test. In particular, this study used the NIRT to investigate whether the measurement quality of the TOEFL listening comprehension test is satisfactory to order learners and make decisions according to the measurement results.
The work of Meijer et al. (2020) contains a full comparison between the Mokken and Rasch techniques. The decision to keep or delete an item from the final scale under a parametric IRT model approach is based in part on item fit, whether informally or formally graphically assessed, or testing using, for example, a multiple degree of freedom likelihood ratio (chi-square) test (LRT). Some features of item misfit can be traced back to a divergence from the item regression model’s anticipated functional form, which is commonly logistic or normal ogive (probit). Similarly, the results of NIRT analysis to explore the TOEFL listening comprehension test showed that MHM fitted all items of the test very well as measured by the adaptability coefficient which specifies that the test’s items can order students in terms of their listening comprehension well on the latent trait, meaning that students with a higher level of listening comprehension ability score higher on the test. Accordingly, it allows the researchers to interpret the participants’ total scores as an indicator of student ordering on the latent variable.
It should be noted, however, that using this nonparametric approach for scale building means that the researcher only has ordinal information about the location of items and people on the latent measurement continuum, not ability assessments or item locations (Molenaar & Sijtsma, 2020). The monotone homogeneity model (MHM) and the double monotonicity model (DMM) are two NIRT models for dichotomous items that have been detailed in the literature: both were introduced in a work by Mokken (1971) over 40 years ago for dichotomous (binary) questions. Molenaar (Sijtsma & Molenaar, 2019) offered an extension of both models for polytomous (ordinal) objects a decade later.
If the data are suitably fit to the MHM for dichotomous item replies, the ordering of respondents with regard to their latent value/“ability” on the basis of the simple sum score of their right responses (denoted as X+) is a highly important quality that these results ensure. Apart from ties, another aspect of the MHM is that for each selection of items, the expected order of the persons on the latent measurement continuum is the same. Person ordering with items with monotonely nondecreasing IRFs can be thought of as “item-free,” at least in theory, a property that might be useful for applied research- ers who face challenges in their research that can only be overcome by exposing different individuals to different items, such as to avoid repeated exposure in repeated measure studies or panel designs (Junker & Sijtsma, 2001). If the dichotomous items fit the more restrictive DMM, it means that the items are ordered in the same order (in terms of the likelihood of the indicated response) at all points along the latent measurement continuum. Invariant item ordering (IIO) is the name for this functionality (Wind, 2017).
IIO allows researchers to order items based on their difficulty (facility) or commonality/prevalence, a feature that aids researchers in communicating valuable qualities of scale item hierarchical ordering to users. This quality is particularly valued and has been widely used, for example, in IQ or developmental assessments, as well as in rehabilitation outcome evaluation, where recovery is defined as the reacquisition of skills or functions of varying degrees of difficulty in a predictable order. On a unidimensional continuum, this emphasizes the formation of cumulative hierarchical scales. Scales that fit the DMM have a number of other advantages; for example, if a respondent is known to have answered 6 (out of 10) dichotomous items correctly, the 6 items answered correctly were most likely the easiest items in the set (because the DMM is a probabilistic model).
This would apply to common or rare symptoms in health applications, with rare symptoms usually indicating the presence of common problems (if a cumulative hierarchical scale holds). Furthermore, if the DMM fits the item response data, the IIO property is predicted to hold in any subgroup of the population and is thus deemed to be “person- free.” Because this failure can result from the presence of differential item functioning (Junker & Sijtsma, 2001), where the issue of IRF shape possibly differing across groups is considered, if a researcher finds that the DMM does not fit the data, it may be an indication that measurement invariance needs to be considered. Despite their importance, broader questions of measurement invariance and DIF in NIRT are outside the focus of this paper. Fitting the MHM does not (theoretically) imply that respondents can be sorted based on their sum score X+ for items scored in more than two categories (i.e., polytomous items) (Wind, 2017).
However, according to one simulation research (Jong & Molenaar, 1987), X+ can be used in practice as a proxy for sorting people according to their latent values without causing many major problems. Importantly, the IIO feature is not implied by fitting the DMM to polychromous item response data. If this feature is desired, different ways for evaluating this aspect have been presented (Oppenheim, 2020). These approaches have yet to be incorporated in the commercial software MSPW in, but they are useful because they have lately become freely available within the freeware statistical computing environment R’s package “mokken” (Van der Ark, 2021).
The Nonparametric item response theory is a beneficial instrument for inspecting the performance of items on scales in an effort to order people and things by using two nonparametric models, MHM and DMM, to address the IRT’s supporting assumptions.
The psychometric traits, as well as the adaptability, of a TOEFL iBT listening comprehension exam developed and delivered in the Iranian context were measured using NIRT. The NIRT was utilized in this study to realize if the measurement quality of the TOEFL iBT listening comprehension test is good enough to rank learners and make judgments based on the results.
Many NIRT researchers (Sijtsma & Van der Ark, 2017; Van der Ark, 2021) claim that NIRT can provide scholars with a complete examination of item adaptability structure. The findings of NIRT analysis of the TOEFL iBT listening comprehension exam revealed that MHM fit all test items very well, as indicated by the adaptability coefficient, which specifies that the test’s items may order the students appropriately on the latent trait in terms of their TEOFL iBT listening comprehension and denoting that students with a higher level of listening comprehension ability score higher on the test.
As a result, the researchers can use the total scores of the participants to determine student ordering on the latent variable. Nevertheless, the HT value of 0.24 specified the fact that the test items are not precisely ordered. In addition, based on the IIO assumption, two items (items 3 and 11) should be deleted to attain IIO. Considering the IIO, we can draw the conclusion that the order- ing of items according to their means is invariant in subgroups (Ligtvoet et al., 2010).
To summarize, this research applies Non-parametric item response theory to educational assessments and assesses the NIRT models’ fit. The MHM and DMM both fit the test, demonstrating its validity and usefulness.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.