Teach To One Blended Math Study Results Similar to Other Computer-Assisted Math Instruction

(cross-posted at EdWeek EdTech Researcher)
Audrey Watters asked me to write up my thoughts on the recent Teach to One study. This is basically an email to her that anyone can read. I really enjoy these kinds of questions, so feel free to send them my way.
The study is by Douglas Ready, an associate professor at Teachers College, and funded by New Classrooms, a non-profit that developed the Teach to One (TTO) program. TTO is an adaptive assessment and learning platform described as follows:

"Teach to One students are assessed daily to determine current skill levels, and an algorithm employs these test results to target content delivery for the following day. In addition to creating daily learning plans for each student, this adaptive, self-improving algorithm also generates a unique daily instruction schedule for each teacher."

The study looked at two years of data, from seven schools in the first year and fifteen schools in the second year. Students in these schools used TTO and took pre- and post-tests using NWEA's MAP tests. Y'all should know about these MAP tests. They are adaptive tests being used as assessments to predict Common Core test readiness. Lots of schools are using them. Lots of kids take the MAP tests at the beginning and end of the year. So the test scores of kids in classrooms using TTO can be compared to reports of national norms of MAP gains. This study compares a bunch of schools kids using TTO and taking MAP pre-post to a whole bunch of kids across the nation, doing whatever they are doing in math, and then taking MAP pre-post.
As the author is careful to note, this is a limited study design for testing the efficacy of TTO. The schools that adopt TTO are different from the other schools using MAP tests. Ideally, we'd randomly assign some schools to use TTO and others to not use it. But it's perfectly reasonable early in the life cycle of a program to fund a study like this, where we just use whatever assessment data is lying around to look at some correlations. The author reminds the readers repeatedly that we can't draw causal conclusions from the study; we don't know that TTO is the reason why these schools might be different from others. We don't know if TTO caused the gains. In fact, it's entirely plausable that the schools that chose to use TTO for math were schools that were gung ho about getting better at math, and they could have chosen to use Khan Academy, or new textbooks, or their own homebrewed curriculum, or some other intervention and gotten similar results because the real causal factor here is schools that are being gung ho about getting better at math. But we shouldn't get all "It's a randomized trial or it's worthless!" here. This is a good kind of study to do in order to decide whether we should move on to better (and more expensive) study designs.
On the whole, I found the study well conducted and convincing, but I do have a few concerns or quibbles.
The key findings from the study are summarized in the Executive Summary as follows: "during the 2012-2013 school year TTO students gained mathematics skills at a rate that was roughly 15% higher than the national average (effect size=0.12 SD). In the second year of implementation--the 2013-2014 academic year--gains among TTO students were almost 47% above national norms (effect size=0.35 SD), a sizable improvement over the first implementation year."
My first quibble is with the summary of the first year, saying that students learning gains in TTO schools were 15% higher than the national average. That is a true statement. Here is another true statement: "In one in seven TTO schools, learning gains were half the size of the national average."
The latter true statement would not be suitable for the executive summary because it doesn't provide a reasonable summary of the conclusions of the first year. The report of the average gains, however, also doesn't provide a reasonable summary. Here's the distribution of gains in the seven schools.

Basically, four schools were statistically indistinguishable from the national norm, one was a little better, one was much better, and one was much worse. So here's a case where using an average to summarize your findings doesn't really do justice to the findings. As the author explains in the text, if you remove school A from year 1, there are no differences between the TTO schools and the national norm. A better summary of year one might be, "Most TTO schools had gains statistically indistinguishable from the national average, one school did substantially better, and one school did substantially worse." I made that sentence up. Here's another possibility, from the paper's conclusion "During year one, student gains were uneven across schools, with only two of seven schools making gains that were significantly above national norms." That would be a better choice for the executive summary.
From a scholarly point of view, the author has absolutely laid everything out there, so I have no quibbles with the scholarship. I do think the executive summary sentence isn't the best choice for the public dialogue.
In year two, TTO schools perform better than the national average (of schools taking MAP tests). I wish I knew much more about the differences in the year 1 and year 2 study cohorts, and it's a curious omission from the study. Are the seven schools from year 1 completely different schools from year 2? Are any schools from year 1 in year 2 of the program? What kinds of changes were made to TTO between year 1 and year 2? Is there any reason not to just lump all 22 schools into one study? I don't have any answers to these questions. If the author or New Classrooms has some answers, they'd be welcome and I'd be happy to publish them here.
In year 2, the TTO schools had much higher gains than the national average. The author reminds us again that we can't make causal assertions here, but generally speaking this is suggestive that TTO may be working. In the second year, two schools did statistically signifcantly worse than the national norm, two did the same, and eleven did better. Across the fifteen schools, gains were 47% above the national average. I have no problem with that summary statistic. When you look at the distribution, there are two schools that did a lot better than the others, but the average statistic there seems to me to be a fair summary of the trend. Moreover, it appears that students with lower initial math scores benefited more than students with higher initial math scores. That may suggest a ceiling effect on kids who are good at math, but still, it's always a good thing to see interventions that serve best the kids who need the most help. That's a very positive finding of the study.
On to my next quibble in interpretation. Here's a line from the conclusion: "During the second year of implementation, average gains were roughly 47% above national averages--a marked improvement." Now wait just a minute! I'll grant you that is a marked difference. But a marked improvement? That line got me thinking that the year 1 and year 2 schools must be connected--how else can you have an improvement if you don't track the same schools for two years? But upon a re-read, I don't think that's the case. Did the software improve? Well, we have no evidence of that from the paper. To make that claim, you would need to have some description of feature changes from year 1 to year 2, and then either some tracking log data or data from classroom observations that suggested those platform changes mattered. Otherwise, another perfectly reasonable interpretation is that in year 1, they drew a tiny sample of schools and got some findings, and then in year 2, they drew another tiny sample and got some better findings, but they might in year 3 draw another tiny sample of schools and get worse findings. Without any evidence to explain the difference in year 1 and year 2, I think we ought to be cautious in our interpretations.
At the same time, whatever. I'm quibbling over one word. And again, the author gives me what appears to be all of the appropriate statistical data to raise these kinds of questions and to let me draw my own, different, conclusion. We can't really ask more from research. (Well, maybe we could ask for a little more background on the difference in year 1 and year 2 schools.)

Now, in terms of public consumption of this data, lots of cheerleaders for blended learning are disseminating the 47% statistic, like New Venture Fund, Getting Smart, and New Classrooms. They should be somewhat cautious. As a rule of thumb, interventions look better in correlational studies (like this one), and they look worse in randomized control trials. The 47% gains translates into an effect size of .35. I think there is every reason to believe that if TTO is evaluated in a randomized trial, the effect size will be prove to be smaller than that. Probably in the .2 range, which happens to be about what you get if you lump year 1 and year 2 together. That's also about the effect size we saw, for instance, in the second year of the evaluation of Cognitive Tutor: Algebra. That seems to be within the range of what computer-assisted math instruction has to offer at this point: smallish but non-trivial gains on the kinds of mathematics that we test using standardized tests (which, as Dan Meyer shows in a great post recently, misses many of the most important parts of mathematics instruction, more on that in the future). I'd say this study adds familiar findings to a long history of studying computer-assisted instruction. Tom Vanderark summarizes the findings like this, "While future versions will likely become more affordable, the innovative program is relatively expensive and initial deployments requires substantial technical assistance. It is, however, the best picture of personalized learning out there." If that's the case, then the best picture of personalized learning is pretty much the same as our older pictures of computer-assisted instruction.
Some folks may say, "now wait a minute, didn't New Classrooms fund it's own study?" Yes, it did, and good for them. It's hard to get external funding for new edtech products, so companies should absolutely be funding their own studies. They found a professor from Teachers College, a wonderful school, to do the analysis, and there is no reason to believe the author wasn't an impartial analyst. Maybe he's a fan of blended learning, and I noted a few places where his language seems to reveal a little bias, but as far as I can tell any bias is in the interpretation not in the analysis. New Classrooms should be lauded for including evaluation in their budgets, and these results call for further external studies funded by the Institute of Education Sciences or someone else.
I'd love a world where our mathmatics instruction focused more holistically on mathematically modeling and less on the kinds of computational tasks that make up much of standardized testing, but have been rendered obsolete in the real world by computers. But, in the world we live in, the stakes of high school exit exams are very high for kids. If I was a principal of a school with a lot of kids who were in danger of not meeting proficiency levels on high-stakes math tests, I'd be seriously considering whether these kinds of adaptive systems might lead to improvements. This study contributes to that assessment.
All of this was too late for Audrey's weekly wrap. But those are my $0.02. If Prof. Ready or the New Classrooms folks think I missed something, I'm happy to publish any responses here.
For regular updates, follow me on Twitter at @bjfr and for my publications, C.V., and online portfolio, visit EdTechResearcher.

JUSTIN REICH

EdTechResearcher