Thirteen research lineages, one product.
Every major feature in Growing Standard traces to published research spanning assessment design, literacy, mathematics pedagogy, motivation, and cognitive science. Empirical calibration and external validation are in progress pending pilot data — what follows is the research informing our design, not evidence about product outcomes.
Tier 4 today. Tier 3 by design, after the pilot.
ESSA Section 8101(21) defines the four evidence tiers districts use to evaluate education products: Strong (Tier 1, RCT), Moderate (Tier 2, quasi-experimental), Promising (Tier 3, correlational with controls), and Demonstrates a Rationale (Tier 4, logic model + ongoing evaluation). Here's where Grow sits today, and where it's heading.
Why Tier 4 today. Every feature is mapped to peer-reviewed research (the citation map below), and the four pre-committed validity surfaces — Mantel-Haenszel DIF, conditional SEM by theta band, classification accuracy at Tier-2/Tier-3 cuts, and validity-evidence framework (OSF-deposited at osf.io/akzec, formal Registration locked 30-60 days before the first contracted pilot data flows) — are committed in the pilot evaluation plan. Together those meet the two operational requirements of Tier 4: a well-defined logic model based on high-quality research, plus an ongoing effort to study the effects of the intervention.
Pathway to Tier 3 — two pre-registration drafts, not one. The Tier 3 case rests on two independent pre-registration analysis plans, each deposited as a draft on OSF. Formal Registrations (the locked, time-stamped, DOI-minted snapshots) are locked 30-60 days before first contracted pilot data flows, after academic methodologist review. Either one clearing its thresholds supports a Tier 3 claim; both clearing strengthens the package substantially.
Pathway to Tier 2 — two complementary study designs. Tier 2 (Moderate) requires a quasi-experimental design. Two distinct study designs are in flight, each addressing a different piece of the Tier 2 question: (a) within-platform intervention efficacy — when the platform's diagnostic loop assigns a student a focused-lesson remediation for a tagged misconception, do those students show greater reduction in same-tag misconception rate on the next class test than within-classroom matched-comparison students with the same baseline misconception who didn't get the assignment? This study runs entirely within the platform (no external-benchmark data-sharing agreement required), produces evidence on a narrower construct (within-platform misconception resolution, not external-benchmark growth), and has the fastest realistic Tier 2 timeline of any study in the program — contingent on early pilot windows accumulating N ≥ 200 paired observations per misconception family × grade-band cell, which in turn depends on a partner district contracting and on methodologist signoff before lock. (b) External-benchmark transfer — among students at participating sites, do students with high practice volume show larger one-year iReady / MAP / SOL Δscore than within-school matched low-practice students at the same grade × baseline-quintile? This study answers the question every superintendent asks ("will my students' state-test scores go up?"), but the construct is broader and the timeline is slower — gated on at least one DSA-bearing district licensing iReady or MAP and on academic methodologist signoff before lock.
Study 1 — Concurrent validity (OSF Project, formal Registration locked 30-60 days before first contracted pilot data flows). Pearson correlations between Grow placement and external benchmarks (Virginia SOL scale scores, NWEA MAP Growth, iReady) plus polychoric correlations against teacher 5-point judgment ratings, controlling for grade and prior achievement. Pre-committed thresholds: r ≥ 0.50 with 95% lower bound ≥ 0.30 against state tests; r ≥ 0.60 against same-construct CATs. Contingent on at least one Virginia public-school data-sharing agreement (DSA) being executed in time for the spring 2028 reporting window. DSA-free signals (in-product ORF probe and teacher-judgment ratings) are collected universally and reported from the first pilot window onward as exploratory evidence even without an external benchmark in place.
Study 2 — Practice dosage and learning growth (OSF Project, formal Registration target: September 2026). Asks the mechanism question districts care about: does in-game engagement actually predict learning growth, and which engagement vectors carry the signal? Six telemetry vectors are tested confirmatorily and separately — practice minutes, library reading minutes, math game minutes, lesson minutes, class-test minutes, and grow-session minutes — against window-over-window Grow placement growth and (DSA permitting) spring external-benchmark outcomes. Pre-committed thresholds: standardized β ≥ 0.10 on within-system growth, β ≥ 0.08 on external benchmarks. A construct-correctness check (grow-session minutes should NOT predict growth) protects against test-taking experience confounds. Library minutes are tested separately because independent reading volume is the strongest documented predictor of reading growth (Allington 2014; Mol & Bus 2011; Stanovich 1986).
Study 3 — Within-platform intervention efficacy (advisor-review draft, OSF deposit pending methodologist input). The within-platform Tier 2 study above, in detail. When a class test surfaces a tagged misconception (per the misconception taxonomy described in the assessment section), teachers can assign a focused-lesson remediation tied to that exact tag. Subsequent class tests on the same construct produce a paired pre/post measurement of same-tag misconception rate. The confirmatory comparison is within-classroom matched pairs (treatment = assigned and attempted the focused lesson; comparison = same baseline tag + classroom + grade + subject + baseline-placement quartile, no assignment). Pre-committed threshold: paired Cohen's d ≥ 0.20 with 95% lower bound > 0 at N ≥ 200 paired observations per misconception family × grade-band cell, and at least three of the top six most-prevalent families clearing after Benjamini-Hochberg correction. Mandatory prerequisite: each family used must independently clear the response-pattern detection validity study below — Tier 2 claims here are conditional on the diagnostic instrument's reliability being established for that family first.
Study 4 — Response-pattern detection validity (advisor-review draft, OSF deposit pending methodologist input). The supporting-evidence layer that gates Study 3's family-level claims. Tests whether the platform's misconception-tagging pipeline reliably reproduces flags within session, distinguishes them across constructs, and corresponds to qualitatively distinct underlying reasoning when independently probed via researcher-led clinical interview (N = 40-60). Construct framing is response-pattern detection, not misconception-as- stable-trait — consistent with the knowledge-in-pieces tradition (diSessa 1988; Hammer 2000; Smith, diSessa & Roschelle 1993). Most cohort-flexible of the five studies — the first credible reporting window is gated on methodologist recruitment, OSF Registration lock, and the first pilot cohorts producing sufficient interview-eligible flag events.
Study 5 — External-benchmark transfer (advisor-review draft, OSF deposit pending methodologist input). The external-benchmark Tier 2 study above, in detail. Quasi- experimental matched-pair within-school design comparing high- practice-volume vs. low-practice-volume students on iReady / MAP / SOL one-year Δscore at the same school × grade × classroom × baseline-quintile. Pre-committed threshold: paired-t Cohen's d ≥ 0.20 at p < .05 with n_pairs ≥ 200 per (subject × instrument × grade-band) cell. Realistic timeline: spring 2028 onward, contingent on at least one DSA-bearing district licensing iReady or MAP. We will not claim Tier 2 evidence until the comparison-group design is run, the Registration is locked before data flows, and the results are published.
What we will not claim today.No causal effectiveness claims have been independently validated. We say “research-aligned,” not “validated.” We say “pre-calibrated heuristic difficulty,” not “empirically calibrated.” We do not publish numeric outcome claims (X% improvement, Y SD growth) — we have no outcome data yet. The pilot is structured to produce that data on a documented timeline.
Faster placements, fairer items, cleaner data — by design.
Per-domain adaptive testing places students faster and more accurately than fixed-form tests, especially at the tails of the distribution (Weiss, 2004; Thompson & Weiss, 2011). Math assessment routes four domains independently through graduated difficulty phases, ends early when measurement confidence is high enough (Babcock & Weiss, 2012), and balances substandard coverage in the process (Kingsbury & Zara, 1989).
Every item is engineered, not written. Four-option items with three carefully designed distractors are psychometrically equivalent to items with four or five distractors (Rodriguez, 2005). Distractors are designed to target documented misconceptions rather than merely plausible wrong answers (Haladyna, Downing & Rodriguez, 2002; Gierl et al., 2017), and tagged against a closed taxonomy of 31 misconception families (680+ specific tags) so teacher-facing reports can name the precise error pattern, with a one-click assignable corrective lesson behind every family. Rapid-guess responses are filtered at grade-adjusted thresholds following Wise & Kong (2005). The design intent is to measure what a student knows rather than how fast they can click; the empirical evaluation of that intent is part of the pre-registered pilot work below.
Space it, scaffold it, let the struggle happen first.
Distributed practice consistently outperforms massed practice on delayed retention tests, with the optimal spacing gap scaling with how long the material needs to be retained (Ebbinghaus, 1885; Cepeda et al., 2006 meta-analysis of 254 studies). Our review queue derives SM-2 ease factors (Woźniak, 1990) — a widely deployed heuristic that approximates the empirical forgetting curve — from each student's score history, with expanding intervals scaled by recent performance.
Students who attempt problems before instruction can outperform those taught first — but only when structured consolidation follows the struggle (Kapur, 2016; Sinha & Kapur, 2021; Loibl, Roll & Rummel, 2017 on boundary conditions). Practice uses a try-first hint system: hints are always available but the bulb is dim until the first wrong answer highlights it; the second wrong answer steps through progressive hints and the final step reveals the full explanation — the consolidation step the meta-analytic evidence requires. Hints follow a multi-strategy structured format (labeled solution paths with aligned equation chains) so the scaffolding doubles as worked-example pedagogy (Sweller, 1988; Renkl, 2014). Feedback language is process-focused (e.g., “Nice strategy — breaking the problem into smaller steps paid off”) rather than trait-focused (“You’re smart”); the most rigorous recent evidence on growth-mindset interventions (Yeager et al., 2019, *Nature*; Sisk et al., 2018 meta-analysis) shows modest effects concentrated in lower-achieving students during transitions, and we treat process-focused feedback as a low-risk, modest-benefit design choice rather than a silver bullet. The collection layer — accessories, companion customization, pet adoption — is built on self-determination theory's autonomy/competence/relatedness triad (Deci & Ryan, 2000; Ryan, Rigby & Przybylski, 2006). Cosmetics are free for everyone regardless of payment status; nothing in the collection confers an academic advantage, and the practice is the substance the student keeps coming back for.
Big Five skills. Passage-level testlets. Manipulatives before symbols.
Explicit instruction across the National Reading Panel's Big Five — phonemic awareness, phonics, fluency, vocabulary, comprehension — produces the strongest reading outcomes (NRP, 2000; Ehri, 2005; Scarborough, 2001). Our reading assessment tests 16 skill areas anchored in those five — main idea, detail, inference, vocabulary, theme, character, text structure, author’s purpose, figurative language, sequence (chronological order), evidence, argument, cause-and-effect, rhetoric, point of view, and counterclaim — and reports per-skill proficiency separately so teachers can target interventions rather than chase a single composite. The corpus also includes 100+ long-form library books carrying F&P, Lexile, DRA, and Grade-Equivalent calibration, so independent reading volume — the strongest correlate of vocabulary and comprehension growth (Mol & Bus, 2011; Allington, 2014) — is on the same instructional shelf as the assessment.
Comprehension is measured with passage-based testlets, not isolated items (Wainer, Bradlow & Wang, 2007). Passages are calibrated to Smarter Balanced word-count ranges by grade — G3 at 300-450 words, G6 at 650-850, with grade-graduated floors through high school (G9-10 at ~700-900, G11 at ~760-920, G12 at ~850-1080) — with questions distributed across Depth of Knowledge levels (recall, inference, analysis; Hess, 2008) — pool-wide measured today at 23/53/24, well within the Webb-balanced 20/55/25 target. Every question is authored to fail the text-dependence test: if a student can answer it without the passage, the question is rewritten (Fisher & Frey, 2012).
On the math side, K-8 tiers walk the Concrete-Representational-Abstract path (Bruner, 1966; Witzel, Mercer & Miller, 2003): ten frames, base-10 blocks, fraction bars, tape diagrams before symbolic forms, with a 20+ interactive manipulative shelf — including number lines, balance scales, coordinate planes, geoboards, algebra tiles, pattern blocks, protractors, clocks, coins, an abacus, and a graphing calculator — for free exploration alongside guided practice.
All thirteen lineages, every citation.
The narrative above compresses 100+ peer-reviewed studies across thirteen research lineages into three beats. Below is the uncompressed version: every lineage, every citation, and the specific design choice each one shaped.
Adaptive assessment
- Babcock, B. & Weiss, D. J. (2012). Termination criteria in computerized adaptive tests. Journal of Computerized Adaptive Testing.
- Embretson, S. E. & Reise, S. P. (2000). Item Response Theory for Psychologists. Lawrence Erlbaum Associates. (Rasch IRT foundational reference for the Grow placement instrument scaling.)
- Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33(2), 204–229.
- Kingsbury, G. G. & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education.
- Thompson, N. A. & Weiss, D. J. (2011). A framework for the development of computerized adaptive tests. Practical Assessment, Research & Evaluation.
- van der Linden, W. J. & Glas, C. A. W. (Eds.). (2010). Elements of Adaptive Testing. Springer. (Adaptive item-selection methodology — boundary-information principle for the 30% near-boundary cycle.)
- Wainer, H. (Ed.). (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Lawrence Erlbaum Associates. (Operational CAT design — content balancing, exposure control, item-pool requirements.)
- Weiss, D. J. (2004). Computerized adaptive testing for effective and efficient measurement. Measurement and Evaluation in Counseling and Development.
- Brinkhuis, M. J. S. & Maris, G. (2009). Dynamic parameter estimation in student monitoring systems. Cito / Univ. Amsterdam.
Science of Reading
- National Reading Panel. (2000). Teaching Children to Read. National Institute of Child Health and Human Development.
- Ehri, L. C. (2005). Learning to read words: Theory, findings, and issues. Scientific Studies of Reading.
- Scarborough, H. S. (2001). Connecting early language and literacy to later reading disabilities. Handbook of Early Literacy Research.
- Mol, S. E. & Bus, A. G. (2011). To read or not to read: A meta-analysis of print exposure from infancy to early adulthood. Psychological Bulletin, 137(2), 267–296.
- Allington, R. L. (2014). How reading volume affects both reading fluency and reading achievement. International Electronic Journal of Elementary Education, 7(1), 13–26.
Concrete–Representational–Abstract (CRA)
- Bruner, J. S. (1966). Toward a Theory of Instruction. Harvard University Press.
- Witzel, B. S., Mercer, C. D., & Miller, M. D. (2003). Teaching algebra to students with learning difficulties: An investigation of an explicit instruction model. Learning Disabilities Research & Practice.
- Bouck, E. C., Satsangi, R., & Park, J. (2018). The concrete–representational–abstract approach for students with learning disabilities: An evidence-based practice synthesis. Remedial and Special Education, 39(4), 211–228. (CRA meta-analysis, g ≈ 0.68 for K–8 students with math difficulties.)
Spaced repetition
- Ebbinghaus, H. (1885). Memory: A Contribution to Experimental Psychology.
- Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin.
- Woźniak, P. A. (1990). Optimization of learning. SuperMemo.
Interleaved practice
- Rohrer, D., & Taylor, K. (2007). The shuffling of mathematics problems improves learning. Instructional Science, 35(6), 481–498.
- Birnbaum, M. S., Kornell, N., Bjork, E. L., & Bjork, R. A. (2013). Why interleaving enhances inductive learning: The roles of discrimination and retrieval. Memory & Cognition, 41, 392–402.
- Brunmair, M., & Richter, T. (2019). Similarity matters: A meta-analysis of interleaved learning and its moderators. Psychological Bulletin (g ≈ 0.42 across 60 studies).
Growth mindset & try-first learning
- Dweck, C. S. (2006). Mindset: The New Psychology of Success. Random House.
- Yeager, D. S. et al. (2019). A national experiment reveals where a growth mindset improves achievement. Nature, 573(7774), 364–369.
- Sisk, V. F., Burgoyne, A. P., Sun, J., Butler, J. L., & Macnamara, B. N. (2018). To what extent and under which circumstances are growth mind-sets important to academic achievement? Two meta-analyses. Psychological Science, 29(4), 549–571.
- Burnette, J. L., O’Boyle, E. H., VanEpps, E. M., Pollack, J. M., & Finkel, E. J. (2013). Mind-sets matter: A meta-analytic review of implicit theories and self-regulation. Psychological Bulletin.
Passage-based reading assessment
- Smarter Balanced Assessment Consortium. (2024). ELA/Literacy Stimulus Specifications.
- Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet Response Theory and Its Applications. Cambridge University Press.
- Hess, K. K. (2008). Depth of Knowledge Framework for Reading. National Center for Assessment.
- Fisher, D. & Frey, N. (2012). Text-Dependent Questions. ASCD.
Item design & diagnostic distractors
- Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines. Applied Measurement in Education.
- Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests. Review of Educational Research.
- Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items. Educational Measurement: Issues and Practice.
Formative feedback & productive failure
- Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153–189.
- Kapur, M. (2016). Examining productive failure, productive success, and constructive failure. Cognition and Instruction, 34(2), 1–11.
- Sinha, T., & Kapur, M. (2021). When problem solving followed by instruction works. Review of Educational Research, 91(5), 761–798.
- Loibl, K., Roll, I., & Rummel, N. (2017). Towards a theory of when and how problem solving followed by instruction supports learning. Educational Psychology Review, 29(4), 693–715.
- Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning. Psychological Science, 17(3), 249–255.
Test-taking behavior & rush detection
- Wise, S. L. & Kong, X. (2005). Response time effort: A new measure of examinee motivation. Applied Measurement in Education.
Practice motivation — autonomy, competence, relatedness
- Deci, E. L. & Ryan, R. M. (2000). The 'what' and 'why' of goal pursuits: Human needs and the self-determination of behavior. Psychological Inquiry.
- Lepper, M. R., Greene, D., & Nisbett, R. E. (1973). Undermining children’s intrinsic interest with extrinsic reward: A test of the "overjustification" hypothesis. Journal of Personality and Social Psychology, 28(1), 129–137.
- Ryan, R. M., Rigby, C. S., & Przybylski, A. (2006). The motivational pull of video games: A self-determination theory approach. Motivation and Emotion.
Diagnostic-to-instruction loop
- Black, P. & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7–74.
- Hattie, J. & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.
- Bennett, R. E. (2011). Formative assessment: A critical review. Assessment in Education, 18(1), 5–25.
Universal-design accommodations
- CAST. (2018). Universal Design for Learning Guidelines version 2.2.
- Wise, S. L. & Kong, X. (2005). Response time effort: A new measure of examinee motivation. Applied Measurement in Education, 18(2), 163–183.
- Wise, S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. (Synthesis paper covering threshold-method families and the selection-bias argument against correct-conditioned suppression.)
- Abedi, J. (2010). Linguistic factors in the assessment of English language learners. In G. J. Cizek (Ed.), Handbook of Educational Policy.
See the research in the product.
Case studies document how these design choices land with real teachers and students. The pilot program puts your school or district on the calibration-data side of the roadmap.