Having analyzed the data during the pre-training phase, the FACETS variable map representing all the facets was obtained. In the FACETS variable map, presented in Fig. 1, the facets are placed on a common logit scale that facilitates interpretation and comparison across and within the facets in one report. The figure plots test takers’ ability, raters’ severity, task difficulty, scale criterion difficulty, test version difficulty, and group expertise. According to McNamara (1996), the logit scale is a measurement scale that expresses the probabilities of test takers’ responses in various conditions of measurement. It also contains the means and standard deviations of the distributions of estimates for test-takers, raters, and tasks at the bottom.

Fig. 1
figure 1

FACETS variable map (pre-training)

The first column (logit scale) in the map depicts the logit scale. It acts as a fixed reference frame for all the facets. It is a true interval scale that has got equal distances between the intervals (Prieto & Nieto, 2019). Here, the scale ranges from 4.0 to – 4.0 logits.

The second column (Test Taker) displays estimates of test takers’ proficiency. Each star displays a singlet test taker. Higher scoring (more competent) test takers are at the top of the column whereas lower scoring (less competent) ones are at the bottom. Here, the range of the test takers proficiency ranges from 3.81 to – 3.69 logits; thus making a spread of 7.50 with respect to test takers’ ability. It is worthwhile to specify that no test taker was identified as misfitting; thus, none of them was excluded from data analysis during the pre-training phase of this research.

The third column (rater) displays raters concerning their severity or leniency estimates in scoring test takers’ oral proficiency. Since there was more than one rater scoring each test taker’s performance, raters’ severity or leniency scoring patterns can be estimated. This will give us raters’ severity indices. In this column, each star displays one rater. Severer raters appear at the top of the column, whereas more lenient ones at the bottom. At the pre-training, rater OLD8 (severity measure 1.72) was the severest rater and rater NEW6 (severity measure – 1.97) was found to be the most lenient rater. Besides, in this phase, OLD raters, on average, were rather severer than NEW raters who tended to be more lenient than the OLD ones. Here, raters’ severity estimate ranges from 1.72 to – 1.97 logits which makes the distribution of rater severity measures (logit range = 3.69) which is much narrower than the distribution of the test takers’ proficiency measures (logit range = 7.50) in which the highest and lowest proficiency logit measures were 3.81 and – 3.69 respectively. This demonstrates that the effect of individual differences on behalf of raters on test-takers was relatively small. Raters, as shown in the figure, seem to have spread equally above and below the 0.00 logits.

The fourth column (task) displays the oral tasks used in this study in terms of their difficulty estimates. The tasks appearing at the top of the column are harder for the test takers to implement than the ones at the bottom. Here, the Exposition task (logit value = 0.82) was harder for the test takers than the other tasks, while the Description task (logit value = – 0.37) was the least difficult one; therefore, making a spread of 1.19 logit range variation. This column has the lowest variation in which all the elements are gathered around the mean.

The fifth column (scale category) displays the severity of scoring the rating scale categories. The most severely rated scale category appears at the top and the least severely rated scale category appears at the bottom. Here, Cohesion was measured to be the most severely scored category (logit value = 0.79) for raters to use whereas Grammar was the least severely scored one (logit value = – 0.46).

Columns 6 to 11 (rating scale categories) display the six-point rating scale categories employed by the raters to evaluate the test takers’ oral performances. The horizontal lines across the columns are the categories threshold measures that specify the points at which the probability of achieving the next rating (score) starts. The figure shows that each score level was used although there was less frequency at the extreme points. Here, the test takers with the proficiency measure of between – 1.0 and + 1.0 logits were likely to get ratings of 3 to 4 in Cohesion. Similarly, the test takers at the logit proficiency of 2.0 logits had a relatively high probability of receiving a 5 from a rater at the severity level of 2.0 in Intelligibility.

RQ1: How much of test takers’ total score variance can be accounted for in each facet?

A FACETS program enables us to determine how much each score variance is attributed to which of the facets employed. Accordingly, one more data analysis was done to measure to what extent the total score variance is associated with each of the facets identified in this study. Table 5 shows the percentage of total score variance associated with each of the facets used in the study prior to the training program. The information provided in the table shows that the greatest percentage of the total variance (44.82%) is related to the test takers’ ability differences however the remaining variance (55.18%) is related to other facets including rater’s severity, group expertise, test version, task difficulty, and scale categories.

Table 5 Effect of each facet on total score variance (pre-training)

The rather high percentage of total score variance, other than that of test takers’ capability at the pre-training phase calls up the caution to be taken about the effect of unsystematicity of rating and the existence of undesirable facets influencing the final obtained score. Furthermore, it shows that the rater’s facet entails a significant extent of total test variance (26.13) which indicates that there is a likelihood of inconsistency and disagreement between raters and their judgments proving that a number of raters are relatively severer or more lenient towards the test takers than the other raters. This finding represents that the test-takers will be scored differently depending on the rater. The rather small effect of other facets including test version, task difficulty, and scale categories shows that there is a slight bilateral and multilateral interactional effect of the facets involved in test variability; thus, proving the neutralizing effect of test variability through the combination of other test facets.

Having analyzed the data at the immediate post-training phase, the FACETS variable map representing all the facets and briefly stating the main information about each one was obtained. The FACETS variable map, displayed in Fig. 2, plots test takers’ ability, raters’ severity, task difficulty, scale criterion difficulty, test version difficulty, and group expertise.

Fig. 2
figure 2

FACETS variable map (immediate post-training)

The second column (test taker) displays estimates of test takers’ proficiency. Here, the range of the test takers’ proficiency ranges from 3.62 to – 3.16 logits, with a spread of 6.78 logit value. The reduction of test takers’ proficiency logit from 7.50 (before training) to 6.78 (after training) shows that they were rated more similarly with regard to severity/leniency indices. This reflects that the test takers have been more clustered around the mean concerning raters’ scoring of their oral proficiency level.

The third column (rater) displays raters about severity or leniency estimates in rating test takers’ oral proficiency. Here, raters’ severity estimate ranges from 1.26 to – 1.05 logits which makes the distribution of rater severity measures (logit range = 2.31) which is again a lot narrower than (almost one-third) the distribution of the test takers’ proficiency measures (logit range = 6.78) in which the highest and lowest proficiency logit measures were 3.62 and – 3.16 respectively. This demonstrates that the effect of individual differences on behalf of raters on test-takers was relatively small. Likewise, in the pre-training phase, raters, as shown in the figure, seem to have spread equally above and below the 0.00 logits. Besides, the significant reduction of raters’ severity measure distribution from 3.69 in the pre-training phase to 2.31 in the immediate post-training phase displays the efficiency of the training program in bringing raters closer to one another concerning severity/leniency indices. In other words, they rated more similarly concerning severity/leniency after the training program.

The fourth column (task) displays the oral tasks used in this study in terms of their difficulty estimates. Here, the Exposition task (logit value = 0.61) was harder for the test takers than the other tasks while the Description task (logit value = – 0.14) was the least difficult one; therefore, making a spread of 0.75 logit range variation. The reduction of logit range, compared to the pre-training phase, indicates that the tasks were rated with less severity and leniency. This column has the lowest variation in which all the elements are gathered around the mean.

The fifth column (scale category) displays the rating scale category severity in scoring. Here, Cohesion was measured to be the most severe category (logit value = 0.58) for raters to use whereas Grammar was the least severe one (logit value = –0.17).

Similar to the pre-training phase, the total score variance attributable to each facet was calculated to measure the effect of each facet on total score variance immediately following the training program. Table 6 displays the percentage of total score variance associated with each of the facets used in the study at the immediate post-training phase. The information provided in the table shows that the greatest percentage of the total variance (67.12%) is related to the test takers’ ability differences however the remaining variance (32.88%) is related to other facets including rater’s severity, group expertise, test version, task difficulty, and scale categories.

Table 6 Effect of each facet on total score variance (immediate post-training)

The considerable increase of total score variance percentage attributed to test takers’ ability and reduction of variance percentage attributed to other facets indicates the significant increase of systematicity and consistency in scoring following the training program. In other words, the training program was quite effective in the reduction of undesirable facets and unsystematicity of scoring influencing total score variance in the immediate post-training phase. The scoring procedure moved towards the establishment of consistency in scoring in a way that a majority of score variance was associated to test takers’ performance ability differences.

Having analyzed the data at the delayed post-training phase of this research, the FACETS variable map representing all the facets was obtained. The FACETS variable map, displayed in Fig. 3, plots test takers’ ability, raters’ severity, task difficulty, scale criterion difficulty, test version difficulty, and group expertise.

Fig. 3
figure 3

FACETS variable map (delayed post-training)

The second column (test taker) displays estimates of test takers’ proficiency. Here, the range of the test-taker’s proficiency ranges from 3.70 to – 3.53 logits, with a logit distribution of 7.23.

The third column (rater) displays raters concerning their severity or leniency estimates in rating test takers’ oral proficiency. Here, raters’ severity estimate ranges from 1.28 to – 1.26 logits which makes the distribution of rater severity measures (logit range = 2.54) which is again a lot narrower than (almost one-third) the distribution of the test takers’ proficiency measures (logit range = 7.23) in which the highest and lowest proficiency logit measures were 3.70 and – 3.53 respectively. This demonstrates that the effect of individual differences on behalf of raters on test-takers was relatively small. Similar to the previous two phases of the study, raters, as shown in the figure, seems to have spread equally above and below the 0.00 logits. Through comparing the measures of severity distribution, raters were still closer to one another in the delayed post-training phase (2.54 logits) regarding severity/leniency measure compared to the pre-training phase (3.69 logits) which shows the rather long-lasting effectiveness of the training program. However, the increase in severity logit measure compared to the immediate post-training phase (2.31 logits) reflects the raters’ tendency in moving gradually to their way of rating which implied a need for ongoing training programs in specific intervals.

The fourth column (task) displays the oral tasks used in this study regarding their difficulty estimates. Here, the Exposition task (logit value = 0.66) was harder for the test takers than the other tasks while the Description task (logit value = – 0.24) was the least difficult one. This column has the lowest variation in which all the elements are gathered around the mean.

The fifth column (scale category) displays the rating scale category severity of scoring. The most severely scored scale category was at the top and the least severely scored scale category was at the bottom. Here, Cohesion was measured to be the most severely scored category (logit value = 0.62) for raters to use whereas Vocabulary was the least severely scored one (logit value = – 0.24).

Figures 4, 5, 6, 7, 8, and 9 graphically plot the raters’ bias interaction curve to the test-takers in Z-scores for new and old raters at the three phases of the study. The graphs display all rater biases be they significant or not. In each plot, the curved line displays the raters’ severity logit. The symbols show z-scores that indicate non-significant bias, and the ✖ symbols indicate significant bias.

Fig. 4
figure 4

Old raters’ bias interaction (pre-training)

Fig. 5
figure 5

New raters’ bias interaction (pre-training)

Fig. 6
figure 6

Old raters’ bias interaction (immediate post-training)

Fig. 7
figure 7

New raters’ bias interaction (immediate post-training)

Fig. 8
figure 8

Old raters’ bias interaction (delayed post-training)

Fig. 9
figure 9

New raters’ bias interaction (delayed post-training)

Pre-training: there were 3 significant biases for NEW raters which were identified as significantly lenient. For old raters, the data showed 4 significant biases among which 3 were identified as significantly severe and 1 lenient.

Immediate post-training: there were 3 significant biases for OLD raters which were identified as significantly severe. No NEW raters were spotted to have a significant bias in the immediate post-training phase of the study.

Delayed post-training: there was 1 significant bias for NEW raters who were identified as significantly lenient; however, the leniency was slightly below the acceptable range which could be ignored, too. For OLD raters, the data showed 4 significant biases among which 3 were identified as significantly severe and 1 lenient. One rater was on the borderline of severity measure.

Additionally, in order to graphically represent the raters’ consistency measures throughout the three phases of the study, the raters’ infit mean square values were employed. As indicated before, the infit mean square that ranges between 0.6 and 1.4 is considered the acceptable range (Wright & Linacre, 1994). The following figure (Fig. 10) plots graphically the change of raters’ consistency in rating using infit mean square values in the three phases of the study.

Fig. 10
figure 10

Raters’ rating consistency measures in the three phases of the study

The raters achieved more consistency in the immediate post-training phase. In the delayed post-training phase, although the raters were still more consistent than in the pre-training phase, they had reduced consistency compared to the immediate post-training phase to a considerable extent. For a great number of the raters, the training program and feedback were pretty beneficial and brought the raters within the acceptable range of consistency after training. It was only rater OLD8 (Infit MnSq. = 0.5) who still displayed inconsistency after training. In the delayed post-training phase, although there was more consistency compared to the pre-training phase, a few more raters seem to have lost consistency compared to the immediate post-training phase. Raters OLD3 and OLD8 with the Infit Mean Square values of 1.5 and 0.4 respectively show inconsistency after training. It must be indicated that the raters who did not improve or even lost consistency after training were among the ones who were not positive about the rater training program and the feedback the raters were to be provided.

Likewise, in the previous two phases of the study, the total score variance associated with each facet was calculated to measure the effect of each facet on total score variance during the delayed post-training phase. Table 7 displays the percentage of total score variance associated with each of the facets used in the study at the immediate post-training phase. The information provided in the table shows that once again the greatest percentage of the total variance (61.85%) is attributed to the test takers’ ability differences however the remaining variance (38.15%) is related to other facets including rater’s severity, group expertise, test version, task difficulty, and scale categories.

Table 7 Effect of each facet on total score variance (delayed post-training)

In the delayed post-training phase still, a significant increase is observed towards the establishment of consistency in scoring and reduction of the influence of other intervening facets in total score variance. Here, a considerable degree of the sum of score variance is related to test takers’ oral ability performance differences which shows the relative systematicity and consistency in scoring compared to the pre-training phase. This outcome provides evidence of the ongoing efficiency of the training program in the long term. However, comparing the outcomes to the immediate post-training phase, a reduction of total score variance associated to test takers’ ability and an increase of variance related to other intervening facets is observed. This outcome although still shows consistency of scoring based on test takers’ oral ability, and it calls upon the gradual loss of consistency and increase of error and unsystematicity after training.

RQ2: To what extent was the provided feedback successful following the training program concerning severity, bias, and consistency measures?

The following tables (Tables 5, 6, 7, and 8) demonstrate the result of training and feedback provision on severity, bias, and consistency measurement during the three phases for both successful and unsuccessful adjustments.

Table 8 Effectiveness of training program and feedback provision on raters’ severity measures

Table 8 shows the differences in the successful application of the training program and the feedback effectiveness on raters’ severity reduction based on severity logit values during the three phases of the study. A pairwise comparison using a Chi-square analysis revealed that there is a considerable difference in successful severity reduction between the pre-training and the immediate post-training phase (χ2(1) = 32.59, p < 0.05) and between the pre-training and the delayed post-training phase (χ2(1) = 9.761, p < 0.05). However, there observed no statistically significant difference between the immediate post-training and the delayed post-training phase (χ2(1) = 1.408, p > 0.05).

Table 9 demonstrates the same comparison but concerning biasedness. The analysis is based on the comparison of Z-score values obtained from the FACETS. The result is fairly similar to the one on severity analysis. A pairwise comparison using a chi-square analysis revealed that there is a considerable difference with respect to successful bias reduction between the pre-training and the immediate post-training phase (χ2(1) = 16.42, p < 0.05) and between the pre-training and the delayed post-training phase (χ2(1) = 4.97, p < 0.05). However, there observed no statistically significant difference between the immediate post-training and the delayed post-training phase (χ2(1) = 0.154, p > 0.05).

Table 9 Effectiveness of training program and feedback provision on raters’ bias measures

Table 10 displays the results of consistency comparison across the three phases by comparing the data obtained from infit mean square values. The result, like what was found in the aforementioned two tables, was found. Using a chi-square analysis, there observed a significant difference in terms of successful consistency achievement between the pre-training and the immediate post-training phase (χ2(1) = 23.14, p < 0.05) and between the pre-training and the delayed post-training phase (χ2(1) = 07.63, p < 0.05). However, no statistically significant difference was obtained between the immediate post-training and the delayed post-training phase (χ2(1) = 0.822, p > 0.05).

Table 10 Effectiveness of training program and feedback provision on raters’ consistency measures

As indicated before, fit statistics is used to identify which raters tended to overfit (having too much consistency) or underfit (misfit) (having too much variation) the model and at the same time to identify which raters rated consistently with the rating model. Table 11 displays the frequency and percentages of rater fit values placed within the overfit, acceptable, or underfit (misfit) categories.

Table 11 Percentages of rater mean square fit statistics

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Disclaimer:

This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.springeropen.com/)