Adequate Sample Size
What constitutes an adequate sample size for calibrating the graded response model (GRM) and the three-parameter logistic (3PL) model?
These are two of the most commonly used models in standardized assessment instruments, and the question of adequacy of sample size often arises. This question naturally leads to a discussion of a number of topics, including a discussion of what literature is available in regard to sample size recommendations. The purpose of this learning tool is to report in a thoughtful manner on a representative selection of the pertinent literature.
Three-Parameter Logistic Model
The 3PL model was first introduced in the literature by Birnbaum (1968), and one of the earliest (if not, the earliest) published calibration studies appeared in Lord (1968). In this article, Lord calibrated the 3PL model on SAT data by an iterative technique that closely resembled Joint Maximum Likelihood estimation (JMLE), but did not include maximum likelihood estimation of the c parameter (it was estimated in an ad hoc non-parametric manner). After much difficulty, he was able to obtain convergence. In describing his difficulties, Lord commented that the sampling errors of the estimated discrimination parameters "seem to be excessive unless n> 50, perhaps, and N> 1000," where n is the number of items, and N is the number of examinees.
Another frequently referenced study with respect to 3PL estimation sample size is a Monte Carlo simulation study by Hulin, Lissak, and Drasgow (1982). They referenced Lord (1968) as recommending a sample size of at least 1000 examinees and investigated sample sizes 200, 500, 1000, and 2000 with repeated simulation trials to get a better idea of the effect of sample size (and test length, too). Like Lord (1968), they estimated parameters using a form of JMLE, as operationalized in the LOGIST computer program. In regard to recovery of the true item characteristic curves (ICC's), they reported average Root Mean Square Error (RMSE) values of about 0.03, 0.04, 0.05, and 0.06, for sample sizes of 2000, 1000, 500, and 200, respectively, for a test length of 60 items. Hulin et al. did not make a specific recommendation with respect to sample size, but others have referenced them as recommending 1000 examinees and 60 items (e.g., see Baker, 1992, p. 106).
Unfortunately, the JMLE method has since been found to be inconsistent (not guaranteed to converge as sample size increases) (Little & Rubin, 1983). In spite of this, many researchers have referenced both Lord (1968) and Hulin et al. (1982) as recommending a sample size of at least 1000 examinees for calibrating the 3PL model.
The first statistically justifiable method for 3PL item parameter estimation was the marginal maximum likelihood estimation (MMLE) procedure of Bock and Aitkin (1981). Swaminathan and Gifford (1986) conducted one of the earliest evaluations of MMLE, but the largest sample size they employed was only 400. Mislevy (1986) was one of the first articles to apply MMLE with a more reasonable sample size – 1000. But Mislevy's paper was mostly theoretical in nature, and the estimation was conducted with only a single simulated dataset as a demonstration of the procedure without any significant conclusions about the adequacy of the estimation. Still, others have referenced Mislevy's article as support for the use of 1000 examinees and have interpreted his results as showing that the item parameters were accurately recovered (e.g., Harwell & Janosky, 1991).
One of the best early evaluations of the use of MMLE in estimating the 3PL model was by Yen (1987), who investigated MMLE (as implemented in BILOG) using a sample size of 1000. She reported that the RMSE for the a and b parameter estimators were approximately 0.15 and 0.10, respectively, for a 40-item test of moderate difficulty (similarly good results for other realistic settings were also reported), thus giving significant support to the use of 1000 examinees as an adequate sample size.
One of the most recent studies on 3PL estimation that included sample size as a factor was by Gao and Chen (2005), who looked at sample sizes of 100, 500, and 2000 with test lengths of 10, 30, and 60. For the case of 2000 examinees and 60 items (the most realistic in comparison with typical standardized tests), the RMSE was about 0.11 for a parameter estimation, 0.12 for b parameter estimation, and 0 (to the nearest hundredth) for c parameter estimation, with correlations between estimated and true values being 0.97, 1.00, and 1.00, respectively. These results certainly give strong support that a sample size of 2000 is more than what is needed.
Some recent equating studies have also included an item calibration component that can be viewed as adding to the literature on sample size for 3PL estimation. Hanson and Beguin (2002) included sample sizes of 1000 and 3000. They simulated two 60-item tests with either 10 or 20 items in common. They reported RMSE's for ICC estimation of about 0.006 for 1000 examinees, and about 0.004 for 3000 examinees. Kim (2006) looked at sample sizes of 300, 1000, and 3000. Kim reported RMSE values of about 0.13, 0.14, and 0.05 for the a, b, and c parameters, respectively, for a sample size of 1000. These studies can both be interpreted as lending support to the adequacy of using a sample size of 1000 in calibrating the 3PL model.
The combined results of all the 3PL studies, with more emphasis given to the MMLE results of Yen (1987), Hanson and Beguin (2002), Gao and Chen (2005), and Kim (2006), all seem to indicate that the use of 1000 examinees can be depended upon to give adequate parameter estimation results.
Graded Response Model
Let us turn now to investigations of sample size requirements for GRM, which was first introduced in the literature by Samejima (1969). The development of the MULTILOG computer program by Thissen (1986) made the calibration of the GRM to data using MMLE much more accessible. Not long afterward, Reise and Yu (1990) published an extensive thorough large-scale simulation study to evaluate GRM parameter recovery using MULTILOG. They simulated sample sizes of 250, 500, 1000, and 2000 for a 25-item test using a variety of ability distributions. Each item was simulated to have 5 response categories, scored 0 to 4. They recommended that at least 500 examinees were needed for "adequate calibration" and that 1000 to 2000 may be needed "if structural parameter recovery is crucial." Specifically, they reported that the average RMSE for a estimation was about 0.08 for 500 and 1000 examinees and around 0.05 for a sample size of 2000. For estimation of mid-level difficulty parameters they reported average RMSE values of about 0.011 for 500 examinees, 0.08 for 1000 and 2000 examinees. More extreme difficulty level parameters had RMSE's that were about 0.05 higher.
As in the case of the 3PL model, a recent equating study by Kim and Cohen (2002) has added to the literature on GRM parameter recovery. Specifically, they looked at sample sizes of 300 and 1000. For a sample size of 1000, they reported average RMSE's about 0.08 for a, about 0.04 for mid-level difficulty parameters, and about 0.05 for more extreme difficulty parameters.
Thus, taken together, the studies of Reise and Yu (1990) and Kim and Cohen (2002) seem to make a compelling case for a sample size of 1000 being more than adequate for GRM estimation.
References
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397—472). Reading, MA: Addison-Wesley.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46, 443—459.
Gao, F., & Chen, L. (2005). Bayesian or non-Bayesian: A comparison study of item parameter estimation in the three-parameter logistic model. Applied Measurement in Education, 18 (4), 351—380.
Hanson, B. A., and Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26 (1), 3—24.
Harwell, M. R., and Janosky, J. E. (1991). An empirical study of the effects of small datasets and varying prior variances on item parameter estimation in BILOG. Applied Psychological Measurement, 15, 279—291.
Hulin, C. L., Lissak, R. I., & Drasgow, F. (1982). Recovery of two- and three-parameter logistic item characteristic curves: A Monte Carlo study. Applied Psychological Measurement, 6(3), 249—260.
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43, 355—381.
Kim, S. & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26 (1), 25—41.
Little, R. J. A., & Rubin, D. B. (1983). On jointly estimating parameters and missing data. American Statistician, 37, 218—220.
Lord, F. M. (1968). An analysis of the verbal scholastic aptitude test using Birnbaum's three-parameter logistic model. Educational and Psychological Measurement, 28, 989—1020.
Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177—195.
Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educational Measurement, 27 (2), 133—144.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 34 (4, Pt. 2).
Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation in the three-parameter logistic model. Psychometrika, 51, 581—601.
Thissen, D. (1986). MULTILOG: Multiple categorical item analysis and test scoring, Version 5. Mooresville, IN: Scientific Software.
Yen, W. M. (1987). A comparison of the efficiency and accuracy of BILOG and LOGIST. Psychometrika, 52, 275—291.

