On the Development of the WQIWe developed the Wiki Quality Instrument (WQI) to answer two kinds of questions about the role of wikis in K-12 education: 1) How do educators design wiki learning environments that promote rich learning experiences? 2) Do only certain learners have access to these high quality wikis? To answer these related questions, we needed a means to evaluate the degree to which wikis support high quality learning. No such measurement existed at the beginning of our research process, so designing our own instrument has been a signature feature of our research agenda. The WQI is a content analysis rubric used by trained research assistants to evaluate U.S., K-12 wiki learning environments. The WQI is intended to be used with a large sample of wikis, where each wiki is measured at multiple occasions. We use these multiple quality measures to produce longitudinal wiki quality profiles. These quality profiles represent wiki quality as a trajectory rather than as a single measure. Since our goal is to conduct our investigations at a scale of hundreds or thousands of wikis, we designed our WQI to require approximately 30 minutes, on average, to conduct one evaluation of one wiki at one time point. In this document, we describe the process by which we designed the wiki quality instrument. First, we set some context for our study by describing the unit of analysis in our study and the scale and scope of our inquiry. These contexts provided a variety of constraints to our instrument design. Next, we summarize the research we used to develop the theoretical framework of the WQI. To determine the domains of wiki quality, we examined the literatures on 21st century skills and on evaluation of online learning environments. We also gathered data from wiki-using teachers and students through surveys, interviews, and classroom observations. We explain here how we synthesized those perspectives into a theoretical framework. Finally, we describe how we used an iterative process to operationalize our theoretical framework into a set of valid and reliable items to create the WQI.
Defining the Context of our StudyBefore delving into the details of our instrument design process, we highlight two important features of the context of our research. First we define the wiki subdomain as our unit of analysis. Second, we explain the advantages and constraints associated with the scale of our inquiry. What do we mean by “wiki”? The wiki subdomain as the unit of analysis: The unit of analysis in our study is the wiki subdomain. A wiki subdomain is a particular Web address provided by a wiki hosting service. For instance, PBworks—the wiki hosting service that supplies the wiki data for our project—uses the domain “PBworks.com” and allows users to create subdomains, such as ReichWorldHistory2009.PBworks.com. We use these subdomains to draw rigid conceptual boundaries between wiki learning environments. Thus we consider ReichWorldHistory2009.PBworks.com as one “wiki” and ReichWorldHistory2010.PBworks.com as a second “wiki.” Much of the material on those two wikis might be the same, but in all likelihood the students would be different; the project endured even as the students changed. It might be that ReichMedievalProject2009.PBworks.com and ReichRennaissanceProject2009.PBworks.com are different projects completed by the same classrooms of students. Reich2009TeamA.PBworks.com and Reich2009TeamB.PBworks.com might be from the same class and doing the same project. All of these, we would define as separate wikis in our analysis, even though in other kinds of studies sensible researchers might choose to treat all of these different subdomains as one “wiki community”. By using the wiki subdomain as our unit of analysis, we could apply clear, automated decision rules to defining wiki communities. Each subdomain was treated as a separate, unique case in our dataset. Choosing the wiki subdomain as our unit of analysis also has certain technical benefits. For instance, PBworks maintains their usage statistics at the subdomain level. Moreover, from our preliminary analysis we observed that relatively few wiki subdomains appeared to be nodes in the kinds of networks of wiki communities hypothesized above. Most users create wiki subdomains as discrete entities. To be sure, in other kinds of studies, it might be quite profitable to attempt to link together related subdomains. For instance, in a study closely examining wiki usage in particular school settings, researchers might study a school where wikis are used frequently and wiki projects are connected to one another in meaningful ways. While it might be somewhat technically difficult to track discrete users as they make contributions across multiple subdomains, such an effort might be worthwhile in very closely examining a particular group of wiki users. Such an effort, however, would be very difficult and very expensive to do at scale. For our purposes defining a wiki as a wiki subdomain was a superior approach. Henceforth, when we refer to a “wiki” in our dataset, we are referring to a publicly-viewable, education-related wiki subdomain hosted by PBworks.com. What is the scale of inquiry? Managing wiki study at scale At the heart of our research agenda is the belief that there is something to be learned from quantifying characteristics of the entire universe of U.S., K-12 wikis. To gather a representative sample of this population, we chose to study samples that include hundreds or thousands of wikis. This led to three challenges that put constraints on how we define quality and how we develop our coding protocols. These three challenges are 1) the diversity of the universe of wikis, 2) the scale of our investigation, 3) the inability to track individual users with acceptable levels of reliability. In our preliminary analysis, we found an extraordinary diversity of activity in the universe of wikis. In our sample of 1,799 wikis drawn at random from 179,851 publicly-viewable, education-related wikis hosted by PBworks, we identified wikis used in elementary schools and wikis used by seniors in honors classes. We found wikis supporting instruction in virtually every academic subject area: English/language arts, social studies, math, science, computers and technology, foreign languages, arts, and physical education. Wikis are also very flexible platforms, so teachers and students used wikis in manifold ways: as online handouts, online worksheets, platforms for collaborative presentations, discussion forums, topical encyclopedias, and student portfolios. All of this diversity presents serious challenges for designing a measurement instrument that can be used reliably by a team of coders. In order to manage the diversity we faced in student ages and levels, we did not define quality in regards to particular details of student performance. Rather, we more broadly looked at the kinds of learning opportunities that students had in wiki learning environments. For instance, we documented seven different types of student collaboration, such as copyediting. We did not measure the efficacy of the particular discursive moves made by students in collaboration with each other, such as measuring the degree to which a copyeditor made focused, constructive suggestions. We can reliably identify when a third grader is copyediting another student’s work as easily as we can reliably identify when a senior in high school is copyediting. Attempting to measure the specific quality of the copyediting activity would have been too complex for the scale of our inquiry. Similarly, in determining what kinds of learning opportunities to evaluate, we only chose to evaluate opportunities that would be common across the academic subject areas. If we were to, in a future study, constrain our inquiry to a single subject domain, such as science, then we would have more opportunities to choose specific indicators that would give a richer indication of the quality of science instruction and learning occurring on the wiki. If we refined yet further—to Earth science, or 7th grade Earth science taught in Iowa—then we could be even more detailed and specific in our criteria. We believe, however, that at this stage in the development of research into Web 2.0 tools and deeper learning in K-12 schools, we need a broad national perspective. Thus, we have eschewed specificity and sacrificed some measure of depth in order to maintain this broad view. In addition to the challenges presented by the diversity of the universe of wiki activity, we also faced the challenge of the scale of our inquiry. Because we attempted to manually code wikis at considerable scale, we quickly confronted constraints of time and resources. When we were developing our instrument, we made the following assumptions about the costs of applying the instrument. We wanted to code 500 wikis on 4 separate occasions, which would require conducting 2000 evaluations. Then, we also wanted each wiki to be evaluated twice by independent raters, so we needed to conduct 4000 evaluations. If each evaluation were to take 30 minutes, and if research assistants were to bill at $12/hour, then the cost of applying the WQI would be approximately $24,000. When we added the costs of training coders, evaluating coders, holding meetings to maintain consistency, and reconciling discrepancies, we assumed that the costs would exceed S30,000. Using these estimates, for every additional minute that it would take a research assistant, on average, to code a wiki, the cost of our pilot investigation would increase by approximately $1,500. Of course, not all these assumptions proved to be true. Federal Work-Study grants reduced our labor costs; in our first sample we coded 400 instead of 500 wikis; we used six occasions of measurement instead of four, but since wiki lifetimes are so short, most wikis required three or fewer measurement. The point here, however, is that to examine wikis at a scale of hundreds or thousands, we needed an instrument that could be applied relatively quickly. The scale of our investigation, therefore, set constraints on how deeply coders could evaluate each wiki. Early on, we recognized that any efforts at scalar comparison were unlikely to be successful. In order to design the WQI to be used in a 30 minute evaluation, we chose quality indicators that were relatively simple to detect and could be measured and evaluated without extensive textual analysis or calculation. We chose to evaluate the presence or absence of different behavior types rather than the frequency of particular behaviors or the quality of particular behaviors. For instance, in our preliminary analysis, we found that coders could achieve acceptable reliability when coding for the presence or absence of copyediting. Coding for the frequency of copyediting on a particular wiki could not be accomplished reliably in a reasonable length of time. The variation in wiki size was a major contributor to this dilemma. Consider two wikis. One wiki has one page, and that page is copy-edited in a few places. Another wiki has over 100 pages and only two exhibit copyediting. One of those pages is copyedited extensively, the other only barely. Even with these starkly contrasting scenarios, it is not difficult to imagine the dilemmas caused by trying to create a single, unified scale measuring the frequency of copyediting. Moreover, as noted above, we had very little hope of measuring the quality of copyediting across very diverse wikis. Thus, the WQI measures the presence and absence of behaviors that promote high quality learning rather than the frequency or degree of those behaviors. Our third challenge was the difficulty of tracking individual users. From our observations and content analysis, we knew that users do not always conduct all of their wiki activity using a unique login. For instance, in one third-grade classroom in San Diego, we observed a teacher who created a wiki, signed in each day with his own login, and then allowed students to take turns contributing under his username. A content analysis of the wiki would easily reveal that this wiki is co-constructed by third-graders, but we have no way of knowing exactly which third graders are responsible for which contributions. This uncertainty curtails our ability to measure quality by measuring changes in individual student behavior or performance, which of course is one of the most important indicators of quality in classrooms. In future studies, we hope to partner with schools or districts using online learning environments to evaluate the quality of 21st century skill development in online learning environments by tracking the performance of individual students. In our circumstances, however, we chose to forgo efforts to track individual student development in order to evaluate wiki usage and quality at a national scale. These challenges of wiki diversity, the scale of our inquiry, and the difficulty of tracking individual users shaped our study in several fundamental ways. In order to deal with these constraints we developed an instrument that focused on broadly applicable markers of quality, that evaluated evidence of opportunities for learning rather than evidence of measurable cognitive improvement, and that measured the presence or absence of types of learning opportunities rather than frequency or the level of quality inherent in those types of learning opportunities. These constraints were integral to our thinking as we developed our theory of wiki quality and the WQI.
Developing a Theory of Wiki QualityOur process for developing a theory of wiki quality involved three kinds of research. First, we were committed to listening to the voices of teachers and students in our instrument design process, so we used several methods to listen to, record and analyze their experiences. Through teacher surveys, teacher interviews, student focus groups, and classroom observations, we explored the ways in which teachers and students defined and assessed wiki quality. Second, we wanted to build upon any relevant published scholarship, so we conducted a wide-ranging literature review to examine how previous researchers had evaluated quality in online learning environments. Third, we conducted an additional literature review on the theme of 21st century skills to examine how educators, researchers, and policymakers conceptualized high-quality learning beyond the specific domains of online learning environments. From these three analyses, we developed a theoretical framework for measuring wiki quality. Before delving into these research methods, it is important to highlight one of the assumptions that we brought with us into the research. Any effort to measure wiki quality at a national scale assumes that certain dimensions of quality are universal across American education. One reasonable position on educational quality might posit that high quality learning environments are those that meet the learning goals established by the students and educators within that community. From that perspective, universal measurement is folly, since the only meaningful markers of quality are those that are locally defined. While we have some sympathy for the position that quality teaching responds to local contexts, we reject the notion that dimensions of quality are entirely defined and contained locally. In order to measure wiki quality at a national scale using our methods and resources, we could not assess wikis in their classroom contexts. We could study 500 wikis, but we could not study 500 wikis and their 500 associated classrooms across the country. From the beginning, therefore, we resolved to study wikis divorced from their larger learning ecology. We assumed we could identify certain universal features of “good” wikis without knowing specific contextual details from these larger learning ecologies. This is not to say that this approach is “better” than research studying smaller numbers of learning environments, but our approach offers a different perspective, a different trade-off between depth and breadth. How does the Literature on 21st Century Skills Define High Quality Learning Nearly everyone who studies education for a living has a set of broad assumptions about what makes for high quality learning environments. Within our research team, these assumptions were strongly influenced by the work of Frank Levy and Richard Murnane. In their book The New Division of Labor, Levy and Murnane (2004) used labor market research to develop a taxonomy of skills critical for success in 21st century job markets. They argue that computers have taken over a considerable portion of routine manual and cognitive tasks in the workplace, and thus the labor markets have shifted to include more jobs requiring skills that computers cannot perform well: expert thinking and complex communication. Expert thinking is required for solving ill-structured problems, tasks that cannot be completed with rules-based logic and tasks requiring tacit knowledge. Complex communication is required for tasks which are defined or accomplished through social interactions. Thus, in Levy and Murnane’s formulation, expert thinking and complex communication are 21st century skills in the sense that they are of growing importance in 21st century labor markets. Their research is a cornerstone of empirical research on what are now known as 21st century skills, and it is important for us to acknowledge that this research shaped our thinking entering the project. That said, we made every effort to remain open to alternative perspectives from teachers, students, or the published record of research as we developed the WQI. One of our first research steps, therefore, was to examine other research defining the skills, knowledge, and competencies that educators should value. Since the publication of Levy and Murnane’s work, many other researchers, thinkers and policymakers have developed other lists of 21st century skills (Gardner, 2006; Haste, 2008; H. Jenkins, 2006; Partnership for 21st Century Skills, 2007; Trilling, Fadel, & Partnership for 21st Century Skills, 2009; Wagner, 2008). We examined these lists, and we developed matrices for comparison among the different frameworks. Eventually, we came to conclusions similar to Dede’s (2010) analysis of several of the most prominent frameworks of key skills for the 21st century. He found that expert thinking and complex communication are featured in nearly all of the well-regarded 21st century skill lists, along with one other domain: technological literacy. This seemed to be a domain well aligned with our interests in wikis. To develop our own formulation of technology literacy, we borrowed from Jenkins’ (2009) definition of new media literacy. Jenkins argues that emerging networked technologies require that students have the ability to critically consume and produce diverse forms of social media in a collaborative, networked context. In a sense, this definition of new media literacy defines a particular sub-category of complex communication, a category which includes tasks which are defined and accomplished by communicating with diverse forms of multimodal media. Thus, from early on in our research, these three domains—expert thinking, complex communication, and technological literacy—formed the core of our working definition of 21st century skills. While we were doing our own reading and thinking about the key elements of high quality learning, we were simultaneously working to bring the voices of teachers and students into our deliberations. In the next section, we discuss how we gathered data from classroom wiki users and what they had to say about wiki quality. How do Wiki-Using Teachers and Students Define and Assess Wiki-Quality? We used multiple methods to investigate how teachers and students in wiki-using classrooms defined and assessed high-quality work in wiki learning environments. We interviewed 68 teachers from across the country about their use of wikis. Approximately half of these subjects were randomly drawn from our sample of 411 U.S. K-12 wikis, and the rest were purposively recruited as expert wiki users (recruited through personal contacts), teachers in urban schools (recruited primarily through personal contacts), and novice wiki users (recruited from a PBworks summer institute for teachers). We also visited 19 classrooms in six U.S. states. In sampling these teachers, again we used a combination of cold-calling randomly selected wiki creators identified on large lists of wikis and contacting particular teachers through personal contacts. From these diverse sampling efforts, we believe we captured a broad cross-section of wiki users. While visiting classrooms, we also recruited students to participate in focus groups with our researchers, and we conducted over 40 student focus groups through these methods. We also surveyed 192 participants in an online wiki summer professional development program designed for novice wiki users and hosted by PBworks, and we asked them what they anticipated to be the benefits of using wikis with their students. We compiled field notes, interview transcripts, and analytic memos into an electronic qualitative research package, and we analyzed our data looking for common themes voiced by teachers and students. Through these various channels, teachers expressed a diverse set of beliefs about the benefits and affordances of wikis. When we asked teachers why they chose to invest their time and energy into developing wiki learning environments, they discussed several major categories of benefits. Wikis gave students opportunities to develop communication and collaboration skills, ranging from commenting on each other’s work, to peer editing, to co-creating projects and assignments. Wikis allowed students to develop a fluency with a new technology platform, and to publish multimedia presentations of their arguments and beliefs. Wikis also simplified some of the logistics of classroom communication. They allowed teachers and students to share both logistical information about the course such as homework assignments and classroom guidelines as well as materials related to course content. Teachers also viewed wikis as places for students to develop and publish projects, and to deepen and display their understanding of course skills and knowledge. When discussing what makes a great wiki, advanced wiki-using teachers described sophisticated performances of understanding where students demonstrated mastery of course content, collaboration skills, and technological design competencies. The reasons that teachers described for using wikis overlapped with our own understanding of high-quality learning environments seen through the lenses of complex communication, new media literacy, and expert thinking. Teachers’ discussion of peer editing and collaborative projects fit into the domain of complex communication. Their comments about technology literacy and using the multimedia affordances of wikis cohered with our ideas about new media literacy. Their descriptions of project-based work, of self-directed work, and of developing and displaying understanding cohered with our domain of expert thinking. From analyzing teacher’s descriptions of why they used wikis, we felt that their purposes aligned quite well with our conception of 21st century skills. The new domain that teachers introduced to us had to do with the logistics of classroom life. They described the importance of having a central place to post course materials, course content, and links to related websites. They also described the importance of having a place where students could post questions, homework, and links to other materials and interact with the class outside of class time. We defined this domain slightly more broadly than logistics, and we began to refer to these practices as elements of participation. We viewed these basic ways of interaction—reading materials, following links, posting simple content—as the precursors of more sophisticated behaviors that promoted deeper learning. Overall, we felt that our theoretical conceptions of high quality learning environments cohered well with the ideals and objectives described by our wiki-using teachers. The domains of expert thinking, complex communication, and new media literacy resonated with their descriptions of the benefits of wikis and of their goals with integrating wikis into their classroom. From their descriptions, we also resolved to add the conceptual category of participation to our analytic framework. Since the degree of alignment between teacher values and our beliefs about wiki quality was quite high, we probed the data further to see if teachers had assessment mechanisms that might help us develop the items of the WQI. Along with asking teachers to talk about what they valued, we also asked teachers and their students to describe how teachers measured and assessed quality work in wikis. In our interviews, we often prompted teachers to discuss both their formal assessment mechanisms—like grades and rubrics—as well as other informal mechanisms, like comments they might leave on a wiki or share with students in class. We asked students about what they thought good work looked like on wikis, who they thought was doing good work, and how their teachers graded and evaluated their efforts. We hoped that some of these assessment mechanisms might prove useful in developing the WQI. What we discovered was a striking disjunction between what teachers said they valued and what they actually graded. Our evaluation of the overall trend of teacher assessment in wiki learning environments is that teachers primarily grade students for “following directions.” Many teachers reported that they evaluate students for participating in the wiki community at prescribed intervals, for including the required number of design elements (like pages, paragraphs or images), and for including factually correct content. These routine tasks generally do not cohere with the domains of 21st century skills. Some teachers did report assessing students’ communication, collaboration, and technology fluency, and a few reported assessing understanding or critical thinking. Many teachers reported that they valued the skills in these deeper learning domains, but they struggled to figure out exactly how to assess deeper levels of understanding or expert thinking. One teacher referred to this as a “gray area” in grading, and explained that while he certainly valued these domains, he was not sure how to develop objective assessment criteria. This teacher’s dilemma is not surprising, as developing strategies for assessing 21st century skills and higher-order thinking skills is an unsolved challenge at the heart of numerous research efforts across the world. We did not discover any commonly used assessment criteria for expert thinking, complex communication, or new media literacy that were adopted across multiple classrooms in our study. In developing the WQI, we used our qualitative data as one source of data about the various discursive practices that occur on wikis and as a source of inspiration of WQI items. We did not, however, find -any common grading or evaluation practices that we could directly modify or adapt in creating our WQI. While searching these data for these kinds of metrics of 21st century skill development, we also turned to the research literature on measuring quality in online learning environments. How have other scholars approached measuring quality in online learning environments? In developing our definition of quality and our WQI, we conducted an extensive literature review to investigate how other researchers and scholars had approached the evaluation of Web 2.0 learning environments. We conducted searches for terms such as wiki*, blog*, and “Web 2.0” in databases of published articles and unpublished theses. We also examined all articles in the last ten years of the Journal of Learning Sciences, the International Journal of Computer-Supported Collaborative Learning, the Journal of Research in Technology in Education, the Journal of Computer-Assisted Learning, and the American Journal of Distance Education. Research assistants read and summarized all of the articles in these volumes that dealt with measuring quality or with Web 2.0 tools in educational contexts. Research into Web 2.0 learning environments—wikis, blogs, discussion forums, proprietary environments, and other platforms—has primarily been conducted through small-scale design research experiments and qualitative case studies. Most studies examine one or a few classes of students, often in courses taught by the researchers. These studies typically investigated a single narrow dimension of student learning, such as cognitive engagement (Oriogun, Ravenscroft, & Cook, 2005), collaboration (Cortez, Nussbaum, Woywood, & Aravena, 2009; Trentin, 2009), or the effect of incongruity between knowledge and information on knowledge building (Moskaliuk, Kimmerle, & Cress, 2009). Often these studies were conducted within a single subject domain, such as algebra (Chiu, 2008), business ethics (Jeong, 2003), or American history (Lawrence, 2009). The studies used a wide variety of methods to assess the quality of learning environments and student development, including pre-post student testing, pre-post student surveys, and content analysis of online materials. Our study took a significant departure from these approaches. We studied samples of hundreds of wikis drawn from populations of hundreds of thousands of “naturally occurring” wikis rather than examining special sites or our own classrooms. As a result, we studied learning environments naturalistically by examining the work that teachers and students were already doing, rather than devising interventions or design experiments where conditions were controlled to test particular hypotheses. We studied wikis that support instruction across all the subject areas rather than just in one particular academic domain. We sought to evaluate wiki quality broadly rather than one specific dimension of quality. Perhaps the study that comes closest to ours in scope is Kozma’s (2003) work analyzing 174 case studies of innovative technology projects identified in 28 countries. Kozma assembled an international team of researchers. They created a list of exemplary education technology projects from each country, constructed case studies for each project from interviews, observations, and document analysis, and then classified the case studies in a number of categories. The fundamental similarity between our work and Kozma’s research is that we both examine evidence from technology-based learning environments, classify that evidence, and then conduct quantitative analysis to better understand how teachers develop rich, technology-based learning environments. Sufficient differences between our projects, however, limit the compatibility of Kozma’s research tools with our approach. Kozma’s unit of analysis was the case study, which included evidence both from the technology application and the classroom context. Our unit of analysis was the wiki itself, and we limited our examination to the technology platform. Kozma studied only exceptional cases; we studied the full distribution of wiki learning environments. Both focused and broad studies of Web 2.0 learning environments are much needed. Up to this point, however, most research into Web 2.0 learning environments has been conducted by examining particular learning environments under microscopes, and in this study we attempted to characterize the universe of wiki learning through telescopes. Previous researchers have developed many innovative approaches to studying online learning environments, but most studies had a different grain size than our own investigation. The fine-grained measurement mechanisms that most researchers used to evaluate a particular dimension of quality in a particular domain seemed inapplicable to our efforts to more broadly evaluate quality in a diverse universe of K-12 wikis. As a result, while we could draw some parallels between previous quality measurement approaches and our own efforts, after an extensive review of existing approaches to measuring quality in online learning environments, we chose not to directly adapt existing instruments for evaluating wiki quality into our own WQI.
Iterating towards the Final Wiki Quality InstrumentAfter our qualitative research and literature review, we had established four domains of wiki quality with backing from the research literature and alignment with the objectives of wiki-using teachers: participation, expert thinking, complex communication, and new media literacy. Our next challenge was operationalizing these domains into items that we could use to create a content analysis rubric. Developing these items was an iterative process that took place over a year. Our first efforts took a grounded theory (Charmaz, 2006) approach, where we conducted several rounds of open coding of wikis to get a sense of the kinds of behaviors that we could identify on wikis. We then did multiple rounds of focused coding where we tested a variety of item types. When then conducted additional rounds of pilot coding in order to conduct interrater agreement analysis and finalize our items. The final step in revising the WQI came towards the end of our first study, when we had sufficient data from our quality measures to conduct principal components analysis and cluster analysis in order to assess coherence of the items within our domains. In parallel with our literature reviews and qualitative research, we also analyzed wikis directly. Many of our early rounds of wiki coding were focused on identifying basic demographic features of the wikis. For instance, we sought to separate out U.S., K-12 wikis from wikis used in other countries and in higher education. Once we established which wikis were used in U.S., K-12 schools, we sought to classify them by subject area, by grade level, and by their hosting institution (school, library, district, etc.). In each of these early rounds of demographic coding, we also did various exercises to analyze the content, teacher activity, and student activity of the wikis. For instance, in our first round of wiki coding, we asked research assistants to briefly describe the “purpose” of each wiki. We did not provide criteria or guidelines for the exercise, though we asked coders to try to settle on internally consistent language within the set of wikis they analyzed. In other words, if they started using the term “student portfolio” to describe a subset of wikis, we asked them to individually work out a set of decision rules for applying that term consistently. From these qualitative descriptions, we collaboratively developed a taxonomy of wiki purposes. This taxonomy included categories like “trial wiki,” “individual student, single assignment,” “individual student, project,” “individual student, portfolio,” “collaborative student, single assignment” and so forth. In subsequent rounds of wiki coding, we asked raters to attempt to classify the wikis by our agreed upon purpose categories. “Purpose” proved to be too nebulous a concept and we could not generate sufficient interrater agreement in our classifications (itself a useful finding), but the exercise did give us a better sense of the kinds of things that teachers and students did with wikis. While refining our purpose categories, we also asked research assistants to describe “patterns of practice” they encountered on the wikis. These patterns of practice were identifiable discursive moves made by teachers and students to facilitate student learning. Again, we gave coders very few guidelines for what might constitute these patterns of practice. We did ask them to think about our four conceptual quality categories, of participation, expert thinking, complex communication, and new media literacy. Beyond that, however, we asked them to simply write about what they saw happening. At this stage, we examined over 400 wikis with two raters looking at each wiki, so we developed a pretty extensive set of qualitative descriptions of wiki activities. We also, in these early rounds, began testing preliminary items. For instance, we developed a four-item taxonomy of behaviors displaying complex communication: concatenation, copyediting, co-construction, and commenting. We considered whether we could attempt to create some kind of quality scale for these items, but we realized that it would be impossible to quickly and reliably assess “good” copyediting versus “bad’ copyediting. We did attempt to make a simple, scalar assessment of the frequency of these activities by using a 0-2 scale where 0 was “activity not found,” 1 was “activity found infrequently” and 2 was “activity found regularly.” We did not provide precise definitions for the frequency categories. We found that we were unsuccessful at reliably rating the frequency of these four collaborative activities, but we were successful at reliably identifying the presence or absence of these activities. Moreover, wikis with evidence of multiple collaborative characteristics did appear to be generally more collaborative than wikis with just one characteristic. We also discovered that certain behaviors, such as signing up for a timeslot or a responsibility on a list, did not fit well within our complex communication schema. So in future iterations of the WQI we added items for planning, scheduling, and discussion. Through additional rounds of pilot testing, we attempted several other approaches towards item design. For instance, we developed a set of indicators of technology use for our new media literacy category. These items included using formatting, adding links, and embedding images. For a while, we tried to distinguish between “substantive” and “decorative” uses of these elements. For instance, when did formatting really enhance the argument or artistic message of a wiki page, and when was it simply meaningless decoration? This was another effort at scalar measurement, and once again we could achieve agreement on the presence or absence of formatting, but we could not reliably distinguish decoration from substantive uses in a timely fashion. In another pilot version of the WQI, we tried to identify both the presence of an activity as well as the intention for the activity to take place. In some wikis teachers indicate that certain behaviors are supposed to happen. For instance, a teacher might assign students to comment on each other’s work. We attempted to measure both when an activity actually happened and when a teacher intended for the activity to happen. Measuring teacher intent, however, quickly devolved into an exercise in parsing and mind-reading with low reliability, and we abandoned the effort. While refining the item categories, we also refined our decision rules for each item. We found early on that long decision rules that listed many examples of the presence and absence of a behavior led to disagreement. When decision rules listed many specific examples, some coders only looked for those examples while others looked for the general principle. Based on this experience, for each item we wrote relatively short decision rules that focused on the general principle without many examples. We also experimented with phrasing our decision rules as questions, but we found it more effective to define decision rules as pairs of declarative statements describing the presence and the absence of the behavior. We still use the “question format” in publications as a summary of our instrument, but coders do not use the questions. Thus, through numerous rounds of pilot testing, refinement, and iteration, we settled upon a near-final version of the WQI. In our last round of pilot coding, before we began training a new set of research assistants, we had two senior research assistants code a set of new wikis with the instrument. Afterwards, they sat down to discuss their disagreements, and we used these points of disagreement to make additional refinements to our decision rules. We also used some of these difficult wikis in our training set for new research assistants, to give them a sense of some of the challenges of coding wikis consistently. When we started the first round of wiki coding, we had 25 items in four subdomains. There were two differences between that version of the WQI and the one that we reported in our early publications. In the original specification of the WQI, the participation subdomain included six items: Course Materials, Information Gateway, Contribution, Individual Page, Shared Page, and Student Ownership. In the complex communication subdomain, the WQI included the present seven items as well as one item for Beyond Classroom Communication, which evaluated whether students from more than a single classroom interacted on the wiki. We changed these items after coding the wikis for our first study and using principal components analysis to determine if our theorized subdomains in fact clustered together. We made two changes to the instrument based on our cluster analysis. First, we deleted the item concerning Beyond Classroom Collaboration. This behavior was so rare, that the item artificially inflated our overall interrater agreement (it is easy to agree that something that never happens), and it did not cohere well with the other items in the complex communication category. We also separated out the Course Materials and Information Gateway categories out of the participation subdomain. Theoretically, the reason to include them in the participation subdomain was that they represented basic ways for students to interact with the wiki. However, since many wikis consisted of only students engaging with the wiki through view course materials and links, principal components analysis showed that wikis with positive scores for these two categories tended to score a 0 in all other categories. As a result, we created a fifth subdomain, Information Consumption , based on our empirical data, which included our two items for Course Materials and Information Gateway At this point, we expect the current version of the WQI, with 24 items in five subdomains, to remain stable as we continue our data analysis on additional wiki samples.
Summarizing the Design Process for the Wiki Quality InstrumentOur process of instrument design included six major steps. 1) Defining a theoretical framework for wiki quality based on the literature regarding 21st century skills 2) Conducting qualitative research with wiki-using teachers and students to determine how they defined and assessed wiki quality 3) Conducting a literature review of efforts to measure quality in online learning environments in order to assess whether existing items, scales or instruments could be integrated or adapted for our purposes 4) Conducting several rounds of open coding on wiki learning environments in order to develop a taxonomy of common patterns of practice on wikis 5) Conduct multiple rounds of pilot testing to test different items, scales, and decision rules 6) After data collection and analysis, make final revisions to the instrument based on cluster and principal components analysis. Designing this instrument has been a balancing act. On the one hand, we sought to identify important indicators of potential opportunities for 21stcentury skill development. On the other hand, in order to investigate wikis at scale, we have ensured that the indicators we chose to examine can be evaluated reliably and relatively quickly. This WQI was designed to be used in a research program where we make thousands of evaluations by examining hundreds of wikis on multiple occasions. Also, it is designed to be used in evaluating a very diverse population of wiki learning environments from all subjects and grade levels. We believe that this foundational instrument can be refined and improved to be even more useful, valid, and nuanced in evaluating more specific subpopulations of wiki learning environments.
 See Part II for analyses of actual coding times; 30 minutes was our target.
 Some wikis are created and then never viewed at all by the creator, and when a coder visits the URL of one of these wikis they receive an error message. Some wikis are created and then viewed by the creator, and our raters could view these, even though they were unchanged.
 Sample sheets are available by request from the authors. We have not posted them here since we have decided not to repost URLs of wikis from our study.
 We have experimented with developing computational tools for determining a wiki’s creation date. We have found that a small number of districts and schools have institutional wiki creation processes. In these cases, API calls to the PBworks data warehouse for the wiki creation date can return dates for when a group of wiki subdomains are named and reserved, rather than when the wiki is actually first generated. Thus we manually check each wiki creation date.
 The Recent Activity link shows links by month and date and not by year, which can cause confusion when wikis have not been edited for several years. A review of the page histories, described in the following paragraphs, can resolve this potential confusion.