Research foundation

Thirteen research lineages, one product.

Every major feature in Growing Standard traces to published research spanning assessment design, literacy, mathematics pedagogy, motivation, and cognitive science. Empirical calibration and external validation await pilot data — what follows is the research informing our design, not evidence about product outcomes.

ESSA evidence-tier classification

Tier 4 today. Tier 3 by design, after the pilot.

ESSA Section 8101(21) defines the four evidence tiers districts use to evaluate education products: Strong (Tier 1, RCT), Moderate (Tier 2, quasi-experimental), Promising (Tier 3, correlational with controls), and Demonstrates a Rationale (Tier 4, logic model + ongoing evaluation). Here's where Grow sits today, and where it's heading.

Tier 1Strong

Randomized controlled trial

Not currently planned

Tier 2Moderate

Quasi-experimental, with a comparison group

2028+ pathway

Tier 3Promising

Correlational with statistical controls

Spring 2028 target

Tier 4Demonstrates a Rationale

Logic model + ongoing evaluation

Today

The path between tiers

OSF analysis-plan draftsdeposited

Formal Registrationslocked 30–60 days before pilot data

First contracted pilotthree windows per school year

First Tier 3 reportspring 2028

Why Tier 4 today. Every feature is mapped to peer-reviewed research (the citation map below), and the four pre-committed validity surfaces — Mantel-Haenszel DIF, conditional SEM by theta band, classification accuracy at Tier-2/Tier-3 cuts, and validity-evidence framework (OSF-deposited at osf.io/akzec, formal Registration locked 30-60 days before the first contracted pilot data flows) — are committed in the pilot evaluation plan. Together those meet the two operational requirements of Tier 4: a well-defined logic model based on high-quality research, plus an ongoing effort to study the effects of the intervention.

Pathway to Tier 3 — two pre-registration drafts, not one. The Tier 3 case rests on two independent pre-registration analysis plans, each deposited as a draft on OSF. Formal Registrations (the locked, time-stamped, DOI-minted snapshots) are locked 30-60 days before first contracted pilot data flows, after academic methodologist review. Either one clearing its thresholds supports a Tier 3 claim; both clearing strengthens the package substantially.

Pathway to Tier 2 — two complementary study designs. Tier 2 (Moderate) requires a quasi-experimental design. Two distinct study designs are in flight, each addressing a different piece of the Tier 2 question: (a) within-platform intervention efficacy — when the platform's diagnostic loop assigns a student a focused-lesson remediation for a tagged misconception, do those students show greater reduction in same-tag misconception rate on the next class test than within-classroom matched-comparison students with the same baseline misconception who didn't get the assignment? This study runs entirely within the platform (no external-benchmark data-sharing agreement required), produces evidence on a narrower construct (within-platform misconception resolution, not external-benchmark growth), and has the fastest realistic Tier 2 timeline of any study in the program — contingent on early pilot windows accumulating N ≥ 200 paired observations per misconception family × grade-band cell, which in turn depends on a partner district contracting and on methodologist signoff before lock. (b) External-benchmark transfer — among students at participating sites, do students with high practice volume show larger one-year iReady / MAP / SOL Δscore than within-school matched low-practice students at the same grade × baseline-quintile? This study answers the question every superintendent asks ("will my students' state-test scores go up?"), but the construct is broader and the timeline is slower — gated on at least one DSA-bearing district licensing iReady or MAP and on academic methodologist signoff before lock.

Study 1 — Concurrent validity (OSF Project, formal Registration locked 30-60 days before first contracted pilot data flows). Pearson correlations between Grow placement and external benchmarks (Virginia SOL scale scores, NWEA MAP Growth, iReady) plus polychoric correlations against teacher 5-point judgment ratings, controlling for grade and prior achievement. Pre-committed thresholds: r ≥ 0.50 with 95% lower bound ≥ 0.30 against state tests; r ≥ 0.60 against same-construct CATs. Contingent on at least one Virginia public-school data-sharing agreement (DSA) being executed in time for the spring 2028 reporting window. DSA-free signals (in-product ORF probe and teacher-judgment ratings) are collected universally and reported from the first pilot window onward as exploratory evidence even without an external benchmark in place.

Study 2 — Practice dosage and learning growth (OSF Project; formal Registration locks after methodologist review, before any pilot data flows). Asks the mechanism question districts care about: does in-game engagement actually predict learning growth, and which engagement vectors carry the signal? Six telemetry vectors are tested confirmatorily and separately — practice minutes, library reading minutes, math game minutes, lesson minutes, class-test minutes, and grow-session minutes — against window-over-window Grow placement growth and (DSA permitting) spring external-benchmark outcomes. Pre-committed thresholds: standardized β ≥ 0.10 on within-system growth, β ≥ 0.08 on external benchmarks. A construct-correctness check (grow-session minutes should NOT predict growth) protects against test-taking experience confounds. Library minutes are tested separately because independent reading volume is the strongest documented predictor of reading growth (Allington 2014; Mol & Bus 2011; Stanovich 1986).

Study 3 — Within-platform intervention efficacy (advisor-review draft, OSF deposit pending methodologist input). The within-platform Tier 2 study above, in detail. When a class test surfaces a tagged misconception (per the misconception taxonomy described in the assessment section), teachers can assign a focused-lesson remediation tied to that exact tag. Subsequent class tests on the same construct produce a paired pre/post measurement of same-tag misconception rate. The confirmatory comparison is within-classroom matched pairs (treatment = assigned and attempted the focused lesson; comparison = same baseline tag + classroom + grade + subject + baseline-placement quartile, no assignment). Pre-committed threshold: paired Cohen's d ≥ 0.20 with 95% lower bound > 0 at N ≥ 200 paired observations per misconception family × grade-band cell, and at least three of the top six most-prevalent families clearing after Benjamini-Hochberg correction. Mandatory prerequisite: each family used must independently clear the response-pattern detection validity study below — Tier 2 claims here are conditional on the diagnostic instrument's reliability being established for that family first.

Study 4 — Response-pattern detection validity (advisor-review draft, OSF deposit pending methodologist input). The supporting-evidence layer that gates Study 3's family-level claims. Tests whether the platform's misconception-tagging pipeline reliably reproduces flags within session, distinguishes them across constructs, and corresponds to qualitatively distinct underlying reasoning when independently probed via researcher-led clinical interview (N = 40-60). Construct framing is response-pattern detection, not misconception-as- stable-trait — consistent with the knowledge-in-pieces tradition (diSessa 1988; Hammer 2000; Smith, diSessa & Roschelle 1993). Most cohort-flexible of the five studies — the first credible reporting window is gated on methodologist recruitment, OSF Registration lock, and the first pilot cohorts producing sufficient interview-eligible flag events.

Study 5 — External-benchmark transfer (advisor-review draft, OSF deposit pending methodologist input). The external-benchmark Tier 2 study above, in detail. Quasi- experimental matched-pair within-school design comparing high- practice-volume vs. low-practice-volume students on iReady / MAP / SOL one-year Δscore at the same school × grade × classroom × baseline-quintile. Pre-committed threshold: paired-t Cohen's d ≥ 0.20 at p < .05 with n_pairs ≥ 200 per (subject × instrument × grade-band) cell. Realistic timeline: spring 2028 onward, contingent on at least one DSA-bearing district licensing iReady or MAP. We will not claim Tier 2 evidence until the comparison-group design is run, the Registration is locked before data flows, and the results are published.

What we will not claim today.No causal effectiveness claims have been independently validated. We say “research-aligned,” not “validated.” We say “pre-calibrated heuristic difficulty,” not “empirically calibrated.” We do not publish numeric outcome claims (X% improvement, Y SD growth) — we have no outcome data yet. The pilot is structured to produce that data on a documented timeline.

Why the adaptive assessment is adaptive

Faster placements, fairer items, cleaner data — by design.

Per-domain adaptive testing places students faster and more accurately than fixed-form tests, especially at the tails of the distribution (Weiss, 2004; Thompson & Weiss, 2011). Math assessment routes four domains independently through graduated difficulty phases, ends early when measurement confidence is high enough (Babcock & Weiss, 2012), and balances substandard coverage in the process (Kingsbury & Zara, 1989).

Every item is engineered, not written. Four-option items with three carefully designed distractors are psychometrically equivalent to items with four or five distractors (Rodriguez, 2005). Distractors are designed to target documented misconceptions rather than merely plausible wrong answers (Haladyna, Downing & Rodriguez, 2002; Gierl et al., 2017), and tagged against a closed taxonomy of 31 misconception families (680+ specific tags) so teacher-facing reports can name the precise error pattern, with a one-click assignable corrective lesson behind every family. Rapid-guess responses are filtered at grade-adjusted thresholds following Wise & Kong (2005). The design intent is to measure what a student knows rather than how fast they can click; the empirical evaluation of that intent is part of the pre-registered pilot work below.

Why the practice loops work

Space it, scaffold it, let the struggle happen first.

Distributed practice consistently outperforms massed practice on delayed retention tests, with the optimal spacing gap scaling with how long the material needs to be retained (Ebbinghaus, 1885; Cepeda et al., 2006 meta-analysis of 254 studies). Our review queue derives SM-2 ease factors (Woźniak, 1990) — a widely deployed heuristic that approximates the empirical forgetting curve — from each student's score history, with expanding intervals scaled by recent performance.

Students who attempt problems before instruction can outperform those taught first — but only when structured consolidation follows the struggle (Kapur, 2016; Sinha & Kapur, 2021; Loibl, Roll & Rummel, 2017 on boundary conditions). Practice uses a try-first hint system: hints are always available but the bulb is dim until the first wrong answer highlights it; the second wrong answer steps through progressive hints and the final step reveals the full explanation — the consolidation step the meta-analytic evidence requires. Hints follow a multi-strategy structured format (labeled solution paths with aligned equation chains) so the scaffolding doubles as worked-example pedagogy (Sweller, 1988; Renkl, 2014). Feedback language is process-focused (e.g., “Nice strategy — breaking the problem into smaller steps paid off”) rather than trait-focused (“You’re smart”); the most rigorous recent evidence on growth-mindset interventions (Yeager et al., 2019, *Nature*; Sisk et al., 2018 meta-analysis) shows modest effects concentrated in lower-achieving students during transitions, and we treat process-focused feedback as a low-risk, modest-benefit design choice rather than a silver bullet. The collection layer — accessories, companion customization, pet adoption — is built on self-determination theory's autonomy/competence/relatedness triad (Deci & Ryan, 2000; Ryan, Rigby & Przybylski, 2006). Cosmetics are free for everyone regardless of payment status; nothing in the collection confers an academic advantage, and the practice is the substance the student keeps coming back for.

Why the reading model is what it is

Big Five skills. Passage-level testlets. Manipulatives before symbols.

Explicit instruction across the National Reading Panel's Big Five — phonemic awareness, phonics, fluency, vocabulary, comprehension — produces the strongest reading outcomes (NRP, 2000; Ehri, 2005; Scarborough, 2001). Our reading assessment tests 16 skill areas anchored in those five — main idea, detail, inference, vocabulary, theme, character, text structure, author’s purpose, figurative language, sequence (chronological order), evidence, argument, cause-and-effect, rhetoric, point of view, and counterclaim — and reports per-skill proficiency separately so teachers can target interventions rather than chase a single composite. The corpus also includes 100+ long-form library books carrying F&P, Lexile, DRA, and Grade-Equivalent leveling, so independent reading volume — the strongest correlate of vocabulary and comprehension growth (Mol & Bus, 2011; Allington, 2014) — is on the same instructional shelf as the assessment.

Comprehension is measured with passage-based testlets, not isolated items (Wainer, Bradlow & Wang, 2007). Passages are matched to Smarter Balanced word-count ranges by grade — G3 at 300-450 words, G6 at 650-850, with grade-graduated floors through high school (G9-10 at ~700-900, G11 at ~760-920, G12 at ~850-1080) — with questions distributed across Depth of Knowledge levels (recall, inference, analysis; Hess, 2008) — pool-wide measured today at 23/53/24, well within the Webb-balanced 20/55/25 target. Every question is authored to fail the text-dependence test: if a student can answer it without the passage, the question is rewritten (Fisher & Frey, 2012).

On the math side, K-8 tiers walk the Concrete-Representational-Abstract path (Bruner, 1966; Witzel, Mercer & Miller, 2003): ten frames, base-10 blocks, fraction bars, tape diagrams before symbolic forms, with a 20+ interactive manipulative shelf — including number lines, balance scales, coordinate planes, geoboards, algebra tiles, pattern blocks, protractors, clocks, coins, an abacus, and a graphing calculator — for free exploration alongside guided practice.

The full citation map

All thirteen lineages, every citation.

The narrative above compresses 100+ peer-reviewed studies across thirteen research lineages into three beats. Below is the uncompressed version: every lineage, every citation, and the specific design choice each one shaped.

Adaptive assessment

Adaptive Placement Flow

Round 1

Round 2

Round 3

4/5 correct

Escalate UP

2-3/5 correct

Place HERE

0-1/5 correct

Test DOWN

The research finding

Per-domain adaptive testing places students faster and more accurately than fixed-form tests, especially at the tails of the distribution. Content blueprinting ensures balanced coverage across substandards.

How we apply it

Math assessment routes 4 domains independently using a per-domain Rasch-Elo ability estimate (theta) updated after each non-rapid-guess response, in grade-equivalent units. Item selection within each domain follows a 50/30/20 calibration-aware cycle: 50% target the student's current ability estimate, 30% explore the near-boundary (±0.5 GE — the most informative range for future calibration), 20% sample uniformly across tiers (preserves data quality for later IRT calibration). Substandard coverage is binding across the full domain (every substandard with available tiers gets sampled). SE-floor early stopping ends testing when measurement confidence is high enough (typical 9–13 items per domain). Each domain's reported placement is an expected-a-posteriori (EAP) ability estimate computed from the full response pattern (a standard Bayesian ability estimate, Bock & Mislevy 1982), reported with its posterior standard error; confidence-weighted compositing then means domains measured more precisely count more in the overall placement. The running ability estimate above drives item routing; the EAP is what gets reported. Per-skill proficiency uses Haberman (2008) Laplace smoothing to stabilize estimates at the small per-domain item counts a CAT produces. Rapid-guess filtering uses grade-adjusted response-time thresholds (see Test-taking behavior below). Anchor items link test forms across windows for future IRT calibration. Between formal windows, the same Rasch-Elo engine (smaller step size for practice items) updates an interim ability estimate after each practice response, seeded from the most recent CAT placement, so growth between Fall/Winter/Spring is visible without re-administering the formal test. Aligned to Embretson & Reise (2000) Rasch IRT, van der Linden & Glas (2010) adaptive item selection, Wainer (2000) CAT design, Babcock & Weiss (2012) stopping rules, Haberman (2008) Laplace-smoothed scoring, Kingsbury & Zara (1989) content balancing, and Brinkhuis & Maris (2009)-style rating updates.

Citations

Babcock, B. & Weiss, D. J. (2012). Termination criteria in computerized adaptive tests. Journal of Computerized Adaptive Testing.
Embretson, S. E. & Reise, S. P. (2000). Item Response Theory for Psychologists. Lawrence Erlbaum Associates. (Rasch IRT foundational reference for the Grow placement instrument scaling.)
Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33(2), 204–229.
Kingsbury, G. G. & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education.
Thompson, N. A. & Weiss, D. J. (2011). A framework for the development of computerized adaptive tests. Practical Assessment, Research & Evaluation.
van der Linden, W. J. & Glas, C. A. W. (Eds.). (2010). Elements of Adaptive Testing. Springer. (Adaptive item-selection methodology — boundary-information principle for the 30% near-boundary cycle.)
Wainer, H. (Ed.). (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Lawrence Erlbaum Associates. (Operational CAT design — content balancing, exposure control, item-pool requirements.)
Weiss, D. J. (2004). Computerized adaptive testing for effective and efficient measurement. Measurement and Evaluation in Counseling and Development.
Brinkhuis, M. J. S. & Maris, G. (2009). Dynamic parameter estimation in student monitoring systems. Cito / Univ. Amsterdam.

Science of Reading

The research finding

Explicit instruction across the National Reading Panel's Big Five (phonemic awareness, phonics, fluency, vocabulary, comprehension) produces the strongest reading outcomes. Independent reading volume is the strongest correlate of vocabulary and comprehension growth across the elementary years.

How we apply it

Reading assessment tests 16 skill areas spanning comprehension, vocabulary, and analysis (anchored in the National Reading Panel's Big Five) — main idea, detail, inference, vocabulary, theme, character, text structure, author's purpose, figurative language, sequence (chronological order), evidence, argument, cause-and-effect, rhetoric, point of view (CCSS RL.3.6 / RI.6.6), and counterclaim (CCSS W.7.1.B). Per-skill proficiency is reported separately, so teachers can target interventions rather than relying on a single composite score. Assessment passages are Lexile-banded; the long-form library corpus (100+ books) carries F&P A-Z, Lexile, DRA, and Grade-Equivalent leveling alongside built-in vocabulary scaffolding for nonfiction reading.

Citations

National Reading Panel. (2000). Teaching Children to Read. National Institute of Child Health and Human Development.
Ehri, L. C. (2005). Learning to read words: Theory, findings, and issues. Scientific Studies of Reading.
Scarborough, H. S. (2001). Connecting early language and literacy to later reading disabilities. Handbook of Early Literacy Research.
Mol, S. E. & Bus, A. G. (2011). To read or not to read: A meta-analysis of print exposure from infancy to early adulthood. Psychological Bulletin, 137(2), 267–296.
Allington, R. L. (2014). How reading volume affects both reading fluency and reading achievement. International Electronic Journal of Elementary Education, 7(1), 13–26.

Concrete–Representational–Abstract (CRA)

CRA Progression

Concrete

Ten frames

→

3/4

Representational

Fraction bars

→

3/4 + 1/4

Abstract

Symbolic

The research finding

Math concepts are learned most durably when students move from physical/visual models to representations to symbolic abstraction.

How we apply it

K-8 tiers use concrete manipulatives (ten frames, base-10 blocks, fraction bars, coins, tape diagrams) before introducing symbolic forms. A dedicated Tools shelf gives students 20+ interactive manipulatives — number lines, balance scales, coordinate planes, geoboards, algebra tiles, pattern blocks, protractors, clocks, an abacus, and a graphing calculator among them — available for free exploration alongside guided practice. Tier labels indicate the CRA stage so teachers can sequence instruction.

Citations

Bruner, J. S. (1966). Toward a Theory of Instruction. Harvard University Press.
Witzel, B. S., Mercer, C. D., & Miller, M. D. (2003). Teaching algebra to students with learning difficulties: An investigation of an explicit instruction model. Learning Disabilities Research & Practice.
Bouck, E. C., Satsangi, R., & Park, J. (2018). The concrete–representational–abstract approach for students with learning disabilities: An evidence-based practice synthesis. Remedial and Special Education, 39(4), 211–228. (CRA meta-analysis, g ≈ 0.68 for K–8 students with math difficulties.)

Spaced repetition

The research finding

Distributed practice consistently outperforms massed practice on delayed retention tests across 254 studies and 14,000+ participants, with the optimal spacing gap scaling with how long the material needs to be retained (Cepeda et al., 2006). The size of the effect depends on the gap-to-retention-interval ratio.

How we apply it

Our review queue derives SM-2 ease factors from each student's score history — SM-2 is a widely deployed heuristic that approximates the empirical forgetting curve. Intervals start at 1 day, expand to 6 days, then scale by ease factor — capped at 60 days. A 15% trajectory bonus extends intervals when recent scores improve.

Citations

Ebbinghaus, H. (1885). Memory: A Contribution to Experimental Psychology.
Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin.
Woźniak, P. A. (1990). Optimization of learning. SuperMemo.

Interleaved practice

The research finding

Practicing two or more confusable skills in an intermixed order produces stronger transfer than practicing them in blocks. The 60-study meta-analysis of interleaved learning (Brunmair & Richter, 2019) reports an average effect of g ≈ 0.42, with the gain concentrated where the practiced skills share enough surface features that students would otherwise apply the wrong approach. Interleaving trains the discrimination — “which strategy applies here?” — that real-world transfer requires.

How we apply it

Math practice sessions interleave deliberately, not by accident. A fraction of each session's questions are drawn from the immediately prior tier of the same skill, so students practice telling apart the new skill from the closest sibling skill they already know. A second mode draws from sibling skills authored as adjacent in the curriculum — area vs. perimeter, mean vs. median, slope vs. y-intercept — when the platform's adjacency map identifies a sibling-skill pair students typically confuse. The mix of within-skill and adjacent-skill insertions trains the “which approach applies here?” decision that blocked practice does not exercise. The size of the interleaving fraction is itself a randomized arm in the platform's pre-registered analysis at OSF, so the dose-response relationship between interleaving and learning growth is measured rather than assumed.

Citations

Rohrer, D., & Taylor, K. (2007). The shuffling of mathematics problems improves learning. Instructional Science, 35(6), 481–498.
Birnbaum, M. S., Kornell, N., Bjork, E. L., & Bjork, R. A. (2013). Why interleaving enhances inductive learning: The roles of discrimination and retrieval. Memory & Cognition, 41, 392–402.
Brunmair, M., & Richter, T. (2019). Similarity matters: A meta-analysis of interleaved learning and its moderators. Psychological Bulletin (g ≈ 0.42 across 60 studies).

Growth mindset & try-first learning

Feedback Language

✓

Process-focused

“Nice strategy — breaking the problem into smaller steps paid off.”

✗

Trait-focused

“You’re so smart!”

Try-first hints:First wrong answer lights up the hint bulb — student chooses when to use it.

The research finding

Process-focused feedback is associated with modest improvements in persistence and academic outcomes, particularly for lower-achieving students during transitions. The current best-evidence picture is more conservative than the early framing — meta-analytic effects are small (r ≈ 0.10) and the largest preregistered RCT (Yeager et al., 2019, Nature) found ~0.10 SD improvements concentrated in specific subgroups. We treat process-focused feedback as a low-risk, modest-benefit design choice rather than a silver bullet.

How we apply it

Feedback language is process-focused (e.g., highlighting the strategy a student used) rather than trait-focused ('You're smart'). Hints are always available but the bulb is dim until the first wrong answer highlights it — students choose when to use hints, building self-regulation, and the multi-step hint format itself doubles as worked-example pedagogy. Spaced review resurfaces past mistakes for retrieval practice.

Citations

Dweck, C. S. (2006). Mindset: The New Psychology of Success. Random House.
Yeager, D. S. et al. (2019). A national experiment reveals where a growth mindset improves achievement. Nature, 573(7774), 364–369.
Sisk, V. F., Burgoyne, A. P., Sun, J., Butler, J. L., & Macnamara, B. N. (2018). To what extent and under which circumstances are growth mind-sets important to academic achievement? Two meta-analyses. Psychological Science, 29(4), 549–571.
Burnette, J. L., O’Boyle, E. H., VanEpps, E. M., Pollack, J. M., & Finkel, E. J. (2013). Mind-sets matter: A meta-analytic review of implicit theories and self-regulation. Psychological Bulletin.

Passage-based reading assessment

The research finding

Reading comprehension is best measured with passage-based testlets that adapt at the passage level, not individual items. Passages must meet research-based length and complexity standards, with text-dependent questions spanning multiple Depth of Knowledge levels.

How we apply it

Assessment passages are matched to Smarter Balanced word-count ranges by grade (G3: 300-450 words, G6: 650-850; high-school floors graduate by grade — G9-10: ~700-900, G11: ~760-920, G12: ~850-1080). Each passage has 12-16 authored questions; the system selects 6-8 per sitting stratified by skill type — giving two equivalent test forms per passage and doubling the effective item pool. Passages are tagged by within-grade difficulty (below/at/above) for adaptive routing: easier passages in early phases, harder in escalation phases. Questions are tagged by DOK level — pool-wide measured at approximately 23/53/24 (recall/inference/analysis) across a corpus of nearly 6,000 authored questions, with a developmentally graduated mix per grade band: K-2 emphasizes foundational comprehension (DOK 3 below 5%), G3-5 hits the Webb-balanced ratio (~21/59/20 measured), G6-8 shifts to analysis (~14/52/34 measured), and G9-12 weights toward DOK 3 analysis (~15/42/42 measured) reflecting college-readiness expectations. Anchor items link test forms across testing windows for longitudinal equating. Every question is authored to fail the text-dependence test: if a student can answer without the passage, the question is rewritten.

Citations

Smarter Balanced Assessment Consortium. (2024). ELA/Literacy Stimulus Specifications.
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet Response Theory and Its Applications. Cambridge University Press.
Hess, K. K. (2008). Depth of Knowledge Framework for Reading. National Center for Assessment.
Fisher, D. & Frey, N. (2012). Text-Dependent Questions. ASCD.

Item design & diagnostic distractors

The research finding

Multiple-choice items with diagnostic distractors — wrong answers that target specific misconceptions — yield richer information than items with merely plausible options. Rodriguez (2005) shows three well-designed options are psychometrically equivalent to four or five; our items use four (one correct + three distractors), the conventional format, with the design discipline focused on distractor quality rather than option count.

How we apply it

Every wrong answer targets a documented error pattern: wrong-paragraph confusion, partial comprehension, background-knowledge substitution, or overgeneralization. Items follow Haladyna, Downing & Rodriguez (2002) validated taxonomy of 31 item-writing guidelines, including stem clarity, option homogeneity, and randomized correct-answer position. Distractor selections write to a per-student misconception ledger that aggregates across items and renders into a teacher-facing report; every misconception family in the taxonomy maps to an assignable corrective lesson, so the diagnostic-to-instruction loop closes inside one click rather than requiring teachers to translate error patterns into curriculum themselves. Practice and assessment maintain disjoint name and scenario pools so students do not recognize assessment items from classwork — protecting measurement validity without halving the generator surface.

Citations

Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines. Applied Measurement in Education.
Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests. Review of Educational Research.
Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items. Educational Measurement: Issues and Practice.

Formative feedback & productive failure

The research finding

Students who attempt problems before receiving instruction can outperform those taught first — but only when structured consolidation follows the struggle. The boundary condition is real: post-attempt scaffolding has to land or the effect can null or reverse (Loibl, Roll & Rummel, 2017). Assessment must remain unscaffolded to preserve measurement validity.

How we apply it

Practice uses a try-first hint system: first wrong answer triggers a retry (no penalty), second wrong unlocks progressive hints, with the full explanation on the final step — the consolidation step the meta-analytic evidence requires. The same misconception data feeds structured lessons that teachers can assign one-click from the dashboard, so the consolidation can extend beyond the single item into targeted instruction. Assessment passages provide no hints, no vocabulary definitions, and no corrective feedback — measuring what students know independently.

Citations

Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153–189.
Kapur, M. (2016). Examining productive failure, productive success, and constructive failure. Cognition and Instruction, 34(2), 1–11.
Sinha, T., & Kapur, M. (2021). When problem solving followed by instruction works. Review of Educational Research, 91(5), 761–798.
Loibl, K., Roll, I., & Rummel, N. (2017). Towards a theory of when and how problem solving followed by instruction supports learning. Educational Psychology Review, 29(4), 693–715.
Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning. Psychological Science, 17(3), 249–255.

Test-taking behavior & rush detection

The research finding

Students who answer too quickly exhibit a distinct response pattern that, when flagged, allows teachers to re-administer items and improve validity. Reading items require longer thresholds than math because of passage processing time.

How we apply it

Math items use grade-adjusted thresholds (K-2: 4s, 3-5: 3.5s, 6-8: 3s, 9-12: 2.5s). Reading items use a passage-aware formula: base threshold plus passage word count divided by grade-appropriate reading speed. Students flagged at ≥30% session-level get session-invalidation; >40% per skill flags the per-skill report in the teacher dashboard.

Citations

Wise, S. L. & Kong, X. (2005). Response time effort: A new measure of examinee motivation. Applied Measurement in Education.

Practice motivation — autonomy, competence, relatedness

The research finding

Self-determination theory predicts that practice habits persist when the activity supports a student's autonomy (real choices), competence (visible progress), and relatedness (a companion or community to do it with). Foregrounded extrinsic incentives can crowd out intrinsic interest in the activity itself (Lepper, Greene & Nisbett, 1973 overjustification effect; Deci & Ryan, 2000).

How we apply it

The collection layer (accessories, companion customization, pet adoption) sits alongside practice rather than acting as the engine of it. Cosmetics are free for everyone — nothing in the collection confers an academic advantage. Reward presentation is deliberately quiet: no scaled celebrations by rarity, no announcement banners, no pay-to-win mechanics. The practice itself is what the student keeps coming back for; the collection is what the work looks like to a child.

Citations

Deci, E. L. & Ryan, R. M. (2000). The 'what' and 'why' of goal pursuits: Human needs and the self-determination of behavior. Psychological Inquiry.
Lepper, M. R., Greene, D., & Nisbett, R. E. (1973). Undermining children’s intrinsic interest with extrinsic reward: A test of the "overjustification" hypothesis. Journal of Personality and Social Psychology, 28(1), 129–137.
Ryan, R. M., Rigby, C. S., & Przybylski, A. (2006). The motivational pull of video games: A self-determination theory approach. Motivation and Emotion.

Diagnostic-to-instruction loop

The research finding

Formative assessment data improves student outcomes only when it travels the full distance from item-level error pattern to teacher-actionable instruction. Reports that stop at scores leave the instructional decision unsupported; reports that name a specific misconception and pair it with a corrective lesson move teacher behavior.

How we apply it

Every distractor a student selects writes a misconception tag (from the 31-family / 680-tag taxonomy) into a per-student ledger. Tags aggregate across items and surface in the teacher dashboard as an error-pattern report — not just 'struggling with fractions' but 'systematically over-applying the cross-multiply template to addition.' Each misconception family maps to an assignable corrective lesson; the teacher closes the loop in a single click rather than translating an error pattern into curriculum themselves. Distractor selections also feed a stealth-assessment Elo update so the next practice item routes against an updated ability estimate, keeping diagnostic and instructional signals coupled within the same session.

Citations

Black, P. & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7–74.
Hattie, J. & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.
Bennett, R. E. (2011). Formative assessment: A critical review. Assessment in Education, 18(1), 5–25.

Universal-design accommodations

The research finding

Universal-design-for-learning practice provides the same accommodations to all students at assignment time — not just to students with formal IEP/504 plans — because lowering the linguistic and sensory load floor helps the whole class while remaining indispensable for English learners and students with disabilities.

How we apply it

Assignment-time accommodation panel surfaces five toggles per student or per cohort: text-to-speech read-aloud, simplified-language stem variants (where the construct is preserved but linguistic load drops), MLL glossary translation for content-area vocabulary, optional break reminders on a teacher-set interval, and stealth-theta opt-out for students whose practice patterns shouldn't drive ability re-estimation. Grow itself is untimed, matching iReady and MAP Growth — there is no extended-time accommodation because there's no time pressure to extend. Rapid-guess detection (Wise & Kong, 2005) flags effort regardless of correctness; correct-conditioned suppression of rapid-guess flags is rejected on selection-bias grounds (Wise, 2017).

Citations

CAST. (2018). Universal Design for Learning Guidelines version 2.2.
Wise, S. L. & Kong, X. (2005). Response time effort: A new measure of examinee motivation. Applied Measurement in Education, 18(2), 163–183.
Wise, S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. (Synthesis paper covering threshold-method families and the selection-bias argument against correct-conditioned suppression.)
Abedi, J. (2010). Linguistic factors in the assessment of English language learners. In G. J. Cizek (Ed.), Handbook of Educational Policy.

Full evidence document. The complete 100+ citation Evidence-Based Learning document — including the full ESSA Tier 4 classification, pre-registered pilot validity surfaces, and modern meta-analytic effect-size estimates — is available to schools and districts on request. Email partnerships@growingstandard.com.

See the research in the product.

As pilots complete, case studies will document how these design choices land with real teachers and students. The pilot program puts your school or district on the calibration-data side of the roadmap.

Read case studies Start a pilot