Estimating District Means from Binned Test Scores with binest

Paul T. von Hippel and David J. Hunter

2026-06-02

Overview

The binest package estimates group-level means and standard deviations from counts of observations falling in ordered bins, when individual observations are not available. Three estimators are provided:

bin_means() is the recommended function. It runs in milliseconds per group and produces estimates practically identical to HETOP. The HETOP commands mle_hetop() and fh_hetop() are superseded by bin_means() but remain in the package for comparison; they are harder to use and at least 100 times slower.

For a detailed empirical comparison and the theoretical case for preferring bin means, see the accompanying paper.

This vignette compares all three estimators on a bundled dataset: district-level bin counts and reported mean scale scores from the 2017-18 administration of the State of Texas Assessments of Academic Readiness (STAAR) Grade 6 mathematics test.

The Texas data

library(binest)
#> Loading required package: splines
data(tx_g6_math_2018)
dim(tx_g6_math_2018)
#> [1] 1151    8
head(tx_g6_math_2018, 3)
#>   district_id district_name n_tested unsatisfactory approaches meets masters
#> 1           1    CAYUGA ISD       40              8         13    13       6
#> 2           2   ELKHART ISD       85             12         30    35       8
#> 3           3 FRANKSTON ISD       63              3         26    23      11
#>   reported_mean
#> 1          1657
#> 2          1653
#> 3          1675

The dataset has 1,151 districts, each with bin counts in four proficiency categories and a reported average scale score that serves as ground truth.

The published cut scores defining the bin boundaries are 1,536, 1,653, and 1,772:

ngk  <- with(tx_g6_math_2018,
             cbind(unsatisfactory, approaches, meets, masters))
cuts <- c(1536, 1653, 1772)
truth <- tx_g6_math_2018$reported_mean

Method 1: bin_means() with known cutpoints

This uses the published cut scores directly. Output est_raw$group_mean_mle is on the test-score scale; est_std$group_mean_mle is on a standardized scale (population-weighted state mean 0, total state SD 1).

t0 <- Sys.time()
fit_bm_known <- bin_means(ngk, cutpoints = cuts)
t_bm_known   <- as.numeric(Sys.time() - t0, units = "secs")
cor(fit_bm_known$est_raw$group_mean_mle, truth)
#> [1] NA
plot(truth, fit_bm_known$est_raw$group_mean_mle,
     pch = 16, cex = 0.5, col = rgb(0, 0, 0, 0.3),
     xlab = "True district mean",
     ylab = "Estimated mean (test-score scale)",
     main = "bin_means (known cuts)")
abline(0, 1, col = "red", lty = 2)

Method 2: bin_means() with cutpoints estimated from data

When cutpoints are not known, bin_means() derives them from pooled state bin proportions via qnorm(cumsum(pooled_props)). Output is on the standardized scale.

t0 <- Sys.time()
fit_bm_null <- bin_means(ngk)
t_bm_null   <- as.numeric(Sys.time() - t0, units = "secs")
plot(truth, fit_bm_null$est_std$group_mean_mle,
     pch = 16, cex = 0.5, col = rgb(0, 0, 0, 0.3),
     xlab = "True district mean",
     ylab = "Estimated mean (standardized)",
     main = "bin_means (cuts from data)")

Method 3: HETOP MLE on a 50-district subsample

mle_hetop() fits a joint maximum-likelihood model across all groups. On the full 1,151-district dataset, it does not converge within a reasonable runtime. We illustrate the method on a representative random subsample of 50 districts.

set.seed(1)
sub <- sample(nrow(ngk), 50)
t0 <- Sys.time()
fit_mle <- mle_hetop(ngk[sub, ], iterlim = 200)
#> Warning in mle_hetop(ngk[sub, ], iterlim = 200): optimization algorithm may not
#> have converged properly; see 'nlmdetails' element of object
t_mle <- as.numeric(Sys.time() - t0, units = "secs")
cor(fit_mle$est_star$mug, truth[sub])
#> [1] 0.9716946
plot(truth[sub], fit_mle$est_star$mug,
     pch = 16, cex = 0.5, col = rgb(0, 0, 0, 0.3),
     xlab = "True district mean",
     ylab = "Estimated mean (standardized)",
     main = "HETOP MLE (50-district subsample)")

Method 4: HETOP Bayes on the full dataset

fh_hetop() is the Bayesian variant. Each Gibbs iteration is linear in the number of groups; on 1,151 districts the model fits in about nine minutes. The runtime in this vignette is deliberately short to keep the package build tractable; for production use, longer chains are recommended.

state_props <- colSums(ngk) / sum(ngk)
state_cum   <- cumsum(state_props)

t0 <- Sys.time()
fit_fh <- fh_hetop(
  ngk       = ngk,
  fixedcuts = qnorm(state_cum[1:2]),
  p         = c(10, 10),
  m         = c(100, 100),
  gridL     = c(-5.0, log(0.10)),
  gridU     = c( 5.0, log(5.0)),
  n.iter    = 2000,
  n.burnin  = 1000,
  seed      = 3142
)
t_fh <- as.numeric(Sys.time() - t0, units = "secs")
cor(fit_fh$fh_hetop_extras$est_star_mug$theta_pm, truth)

Because this chunk takes about nine minutes, the package ships with the per-district HETOP-Bayes posterior means pre-computed and stored in inst/extdata/fh_hetop_means_tx_g6_math_2018.rds. The scatterplot below uses those cached values.

fh_means_file <- system.file("extdata",
                             "fh_hetop_means_tx_g6_math_2018.rds",
                             package = "binest")
fh_means <- readRDS(fh_means_file)

plot(truth, fh_means,
     pch = 16, cex = 0.5, col = rgb(0, 0, 0, 0.3),
     xlab = "True district mean",
     ylab = "Estimated mean (standardized)",
     main = "HETOP Bayes (full data, cached)")

Summary

Method Runtime Districts
bin_means() (known cutpoints) 1 s 1134 / 1151
bin_means() (cutpoints from data) 1 s 1134 / 1151
mle_hetop() (50-district subsample) 3 s 50 / 50
fh_hetop (full data, not run above) 450 s 1151 / 1151

Note 1: bin_means() ran on all 1,151 districts in a small fraction of a second — roughly 500 times faster than fh_hetop(), and twice as fast as mle_hetop() ran on a 50-district subsample.

Note 2: bin_means() returned an estimate for 1,134 of the 1,151 districts. It returned NA for the remaining 17 districts, where the mean and SD wer unidentified because 2 of the 4 bins were empty. These districts contained fewer than 20 students each, and the Stanford Education Data Archive does not publish estimates for districts with fewer than 20 students (Fahle et al. 2021). Note that the HETOP functions return estimates for 2-bin districts, but they do so using fallback rules that are not part of the main estimation logic.

References

Fahle, E. M., Chavez, B., Kalogrides, D., Shear, B. R., Reardon, S. F., & Ho, A. D. (2021). Stanford Education Data Archive: Technical Documentation, Version 4.1. Stanford University.

Lockwood, J. R., Castellano, K. E., & Shear, B. R. (2018). Flexible Bayesian models for inferences from coarsened, group-level achievement data. Journal of Educational and Behavioral Statistics, 43(6), 663–692. https://doi.org/10.3102/1076998618795124

Reardon, S. F., Shear, B. R., Castellano, K. E., & Ho, A. D. (2017). Using heteroskedastic ordered probit models to recover moments of continuous test score distributions from coarsened data. Journal of Educational and Behavioral Statistics, 42(1), 3–45. https://doi.org/10.3102/1076998616666279