Are iPads making a significant difference? Findings from Auburn Maine.

Audrey Watters has an interesting article on early results from an assessment of iPads deployed in kindergardens in Auburn, ME. It’s a perfect place for me to get to one of the core purposes of this blog– to look at educational research results and critique them from the perspective of a fellow researcher. The goal is to help readers be more saavy consumers of educational research. My take is pretty different from Audrey’s (who I think is a brilliant ed tech journalist). I also want to start the post by applauding the team of researchers for tackling this important study, even though I disagree with their interpretation of the data.

The Study

In the study, the Auburn district is in the midst of a multi-year, literacy intervention. Teachers are getting all kinds of training on early literacy stuff–helping kids learn to read. Then, this September, 8 of 16 classes are randomly assigned iPads (the intervention group), and the other 8 get them in December (the control group). Kids are tested in September and December (before the control group gets iPads), and the study measures average difference in score gains between the control and intervention groups.

The Findings

First, Audrey starts with this encouraging sentence

The school year is now almost halfway over, and the early results are in from the district’s implementation. The results are good too: iPads increased the kindergarteners’ literacy scores.

She derives that sentence from this figure and the resulting conclusion (from the press release for the study found here):

Table of dataComparing the OSELA gains from the iPad and comparison settings, gain scores were again consistently greater for the iPad students than were observed in the comparison settings. Most notably, students in the iPad setting exhibited a substantial increase in their performance on the Hearing and Recording Sounds in Words (HRSIW) subtest, which measures a child’s level of phonemic awareness and ability to represent sounds with letters. Subsequent statistical analyses showed that, after controlling for students’ incoming Fall 2011 scores, the impact of being in an iPad classroom had a statistically significant relationship with students post-HRSIW scores (t= 2.36, p.<.05). After controlling for other variables, the nine week impact was equivalent to a 2.1 point increase on the HRSIW, on average, for the iPad cohort.

I read these data quite differently. The results show that in 9 of 10 categories, there is no difference between the two groups. “But Justin,” you say, “The Red numbers are always larger than the Blue numbers.” That is true. But, the point of statistical testing is to determine whether differences between two numbers are because of a relationship between two variable or because of random chance. When statistical testing shows that the difference between two numbers is not statistically significant, we should treat those two numbers as not different. In other words, it is reasonably likely that for 9 of the 10 variables measured, any differences that we observe are due to chance rather than due to the fact that the iPads made a difference. The standard reading, from a statisical perspective, is that iPads had no impact on 9 of the 10 variables measured.
(Could you make a special pleading for the iPads because all of the numbers were higher? Maybe. But if it is making a difference, it’s making a tiny, tiny difference. If I’m the superintendent in Auburn, its a reason to continue the pilot, not a reason to buy iPads for all the kids. )
Next, the researchers argue that the one variable with a statistically significant difference, Hearing and Recording Sounds in Words, shows a “substantial improvement.” It’s very important to remember that “statistically significant” does NOT necessarily mean substantively significant. We should treat the iPads as having increased the HRSIW; we should believe that those two numbers are different (caveat to follow). But how much is a 2 point gain over a semester? The researchers give us no information that we can use to interpret that difference. How much do students typically gain over a year? What is the standard deviation of the  HRSIW in comparable populations of Kindergardeners? What is the range of the scale? Without that information, we cannot independently assess whether a 2 point gain is stellar, or a relatively modest gain from an intervention that cost many thousands of dollars.
Now if you are a researcher, you can figure out some of these things. I Googled HRSIW and found this document. I haven’t really vetted it, but it’s on the site from an area support agency in the UK, so let’s pretend it’s true.  It shows that the standard deviation for the HRSIW in 5 year olds is about 10 units on the test. So to get the effect size of the intervention, we can divide the difference in scores, about 2, by the standard deviation, about 10, to get an effect size of .2. Generically, we’d consider that effect size to be small, based on Cohen’s guidelines. A better choice would be to compare this intervention to other similar interventions on HRSIW, to see how this compares to things like lowering class size, specific kinds of training for teachers, better breakfasts in the morning, and so forth. Since the researchers don’t provide these, we’ll go with the generic guidelines, and I would interpret the gains as “modest” rather than “substantial.”
Moreover, we have to start considering a second problem. In t-testing, we assume that if there is less than a 1/20 chance that our results could have come from a population where there is no relationship between predictor and outcome (in this case, between having an iPad and improving a measure), then our results are “statistically significant.” This means, that if we run lots of analyses, then we should expect that occasionally, maybe 1 in 20 times, we find a statistically significant result when there is no difference between two variables (a false positive). In other words, if you run 10 t-tests, and one shows up as significant (p=.02), then you have to wonder a bit if chance has thrown you a statistically significant result from a population with no difference between the control and intervention. (This problem is known as Type I error.)
So my headline isn’t “iPads increased scores substantially.” It’s “iPads modestly increased Kindergarten literacy scores in 1 of 10 measures tested.”

Other Articles

So I wrote this post only reading Audrey’s article, but now I’ve found two other articles from the Twitter feed of Mike Muir. The articles draw two different conclusions, and I think one gets it right. An article in THE Journal has the headline: “Kindergarten iPad Initiative Reveals Modest Literacy Gains” and reports:

The performance gains were admittedly modest, but 129 of the iPad students showed improvement on the Hearing and Recording Sounds in Words (HRSIW) assessment, which measures a child’s level of phonemic awareness and ability to represent sounds with letters.

That sentence is mostly right, but it’s not that 129 students showed improvement, it’s that as a group, the average scores of those 129 students showed improvement. Still, that’s basically the right take.
The Loop, an online journal I have not heard of, has the headline: “iPad improves Kindergartners literacy scores”

According to the literacy test results classes using the iPads outperformed the non-iPad students in every literacy measure they were tested on.

This isn’t a sound interpretation. The classes using the iPads had scores not significantly different from the non-iPad students. If you do claim that there are differences, then you have to note that most differences are tiny.
I also found the blog of Mike Muir, one of the authors of the study. His interpretation is that the results confirm that ” iPads Extend a Teacher’s Impact on Kindergarten Literacy.” I don’t think that I would use the word confirm. I think I’d say that they are suggestive that iPads may have a modest impact. One small experiment, even a randomized control trial, I don’t think should rise to the level of “confirmation,” especially when results are so modest. I definitely applaud Mike for doing the work he is doing and for using the most robust design he can, but I don’ t think the data support his interpretation.

The Design

Let me also raise a second quibble with the analysis by Audrey and the researchers.
Audrey, and the researchers she interviews, take care to note that it’s not just the device that are responsible for any resultant score gains. The iPads are part of a larger package of reforms. Quoting Audrey:

Was this a result of the device? The apps? The instruction? The professional development? Muir, Bebell, and fellow researcher Sue Dorris (a principal at one of the elementary schools) wouldn’t say.

I will say. It’s true that the whole reform package is responsible for score gains in the 16 classrooms. But, if the study was designed correctly, any score gain differences between the control and intervention group are entirely the results of the iPads.  It sounds from the article that in the randomized trial, the only intervention was the timing of the arrival of the device. Therefore, if the randomization was done correctly (and we have no evidence one way or the other from the press release from the researchers)  any score difference should be attributed entirely to the iPads. That’s the whole point of a randomized control trial, keep everything as identical as possible between the control and intervention conditions, except the intervention itself.
I totally agree with the overall point that it’s important to remember that iPads don’t just appear in classrooms. But the point of random trials is to test specific differences. The specific difference here is the timing of the iPads arrival, and we should be able to credit any differences not to the years of reform beforehand (which both groups enjoyed) but to the one thing that makes the control and intervention groups different: the timing of the arrival of the iPad.

My Conclusion

I applaud the Auburn researchers for tackling this study, and I applaud Audrey for trying to tease out the meaning of these results. I think we have very much to learn from the statistical testing tools that methodologists have developed over the last century, and I think applying these tools can be very powerful in testing the efficacy of new technologies. But reading statistical output is tricky business, and everyone who reads the reports of researchers should take care to evaluate how well a researchers numbers support their interpretations. In this case, I think the press release put out by the researchers overstates the impacts of iPads on these particular measures in this particular intervention. My take is that iPads didn’t make much difference here, but I’ll look forward to reading the entire research report when it is released.