You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.

The Harvard Family Research Project separated from the Harvard Graduate School of Education to become the Global Family Research Project as of January 1, 2017. It is no longer affiliated with Harvard University.

Terms of Use ▼

Heidi Rosenberg of Harvard Family Research Project and Helen Westmoreland of the Flamboyan Foundation spoke with several evaluators and program administrators about their experience evaluating the process of going to scale.

Harvard Family Research Project spoke with three evaluators (see box below) to discover how evaluation can inform and assess scaling efforts. They shared lessons from their experiences in evaluating programs as they went to scale. Themes that emerged from these conversations follow.

The ways in which programs think about scale are driven in part by their particular characteristics and activities. 
Scaling takes different forms, from replication of an established program model, to a deepening of program efforts to achieve more measurable impact, to scaling components that have demonstrated success. The approach taken depends in large part on the needs and goals of the program being scaled.

Replication. The replication model is traditionally used in the business sector and has been adapted for the nonprofit sector. In the case of Citizen Schools, the organization wished to scale its after school program across the country with the goal of faithfully replicating the original program model. Based on PSA’s evaluation of the original program, which demonstrated its positive impact on student achievement, leadership at Citizen Schools felt the program as a whole was ready to expand beyond the home base.

Researchers Interviewed

Elizabeth Reisner, Policy Studies Associates (PSA)
Evaluator of Citizen Schools, which sponsors after school education programs for disadvantaged middle school youth. PSA conducted an impact study evaluation of the original program in Boston and an implementation evaluation of the national program.

William Penuel, Center for Technology, SRI International
Evaluator of the GLOBE Program, an international program in earth science and earth science education. SRI’s evaluation focused on the effects of state policies and professional development on the implementation of the science curriculum. 

Robin Galloway
, Research Institute for Studies in Education, Iowa State University
Evaluator of the Iowa Parent Information and Resource Center (PIRC), a federally funded program that provides regional and statewide services to promote family engagement in education.

Ed Redalen, Iowa PIRC program director, and Ron Mirr, a consultant to the Iowa PIRC, also contributed.

Deepening of program efforts. Another way of framing scaling efforts relates to the depth of a program’s impact and reach. The Iowa PIRC team took this approach: Scaling involved reframing the program’s efforts from simply engaging in a wide variety of parent involvement activities to focusing on the degree to which the PIRC showed evidence of impact along a set of meaningful outcomes. In this case, scaling related to the depth of program impact. The PIRC team reported that its early efforts sought to raise parent awareness, but that program administrators became dissatisfied with the lack of measurable impact. As Mirr commented, “Our scaling went from a shallow understanding of parent involvement activities to committing to change in depth over time.” 

Identification of core components. Not all elements of a program can, or should, be scaled, since program components are often context-specific and do not translate well to different locations. In these instances, the most promising practices are identified and selectively scaled. SRI’s evaluation of the GLOBE Program identified successful professional development practices among widespread, highly variable program sites; these practices were then scaled across the program, thus increasing the number of sites that adopted these components. Penuel noted, “The GLOBE Program had wide variability in the success of its local partners. As part of our evaluation, we went in to help them understand, through comparative case study analyses of successful partners, the factors of professional development that were associated with higher levels of implementation.”

Different phases of the scaling process require different evaluation approaches.
To evaluate scale effectively, researchers need to choose methods and approaches based on the stage of the scaling process and determine how the data will be most useful to the program at each stage. During preparation, evaluation can help identify what program elements need to be in place in order to scale. Evaluation efforts may shift depending on the scaling approach selected. Once programs have gone to scale, researchers can use evaluation data to assess how well the program has scaled.

Using evaluation data to decide what elements to scale. Base-level evaluation data can help researchers identify the program elements that must be in place for successful program expansion. Reisner used the initial Boston-area impact study to identify the key components of Citizen Schools that affected student success. Program staff analyzed the contexts and resources necessary to create those core components and built that understanding into a consideration of where to implement program services. SRI’s Penuel reported that evaluating widely scaled, field-building program efforts requires identifying the common elements underlying program success: “It is not particularly informative to say simply that the context matters and that the context is messy. If we are thinking about field building in this area, we have to assume that there are some regulari-ties across contexts that matter. For example, it matters that teachers perceive that the professional development and the program that you are trying to implement is coherent with the school and district and state goals for learning.” Once researchers have identified those regularities, or common elements, Penuel pointed out, the next step is to ask, “How do I measure that? And how do I fit a model that goes across many schools, in order to try to test my conjecture about the importance of that factor?” 

Tailoring evaluation approaches to the scaling strategy. In Iowa, the PIRC evaluation started by measuring program inputs (i.e., the services and resources provided). After deciding to shift their emphasis toward deepening their services, PIRC administrators and the evaluator examined their qualitative data to identify “champion” teams within their Sustaining Parent Involvement Network (SPIN) and targeted those sites for the initial scaling. This shift from counting program inputs to assessing depth of impact required program administrators to work closely with their evaluator; Galloway held frequent meetings with the PIRC team to help its members plan how their proposed scaling efforts and points of impact could be operationalized and evaluated to show effect.

Using evaluation data to assess the scaled program. Once a program has gone to scale, the evaluation activities shift to examining how the scaling process is working. The implementation study of the expanded national Citizen Schools program allowed the organization to assess implementation consistency across sites and to develop a more cohesive sense of program identity. The positive findings from both the impact study of the initial site and the follow-up study of the scaled program encouraged Citizen Schools to consider launching a rigorous experimental design impact study. As Reisner noted, “Our evaluation describes the similarity and diversity across sites, and the news is pretty good in terms of the consistency of implementation nationally. If the program just varied all over the map, then it would be premature to conduct a very rigorous impact evaluation of the program.” 

Effective data collection systems are crucial to successful scaling and evaluation.
Reisner and the Iowa PIRC team discussed the importance of strong data collection efforts—sites must be willing to provide the necessary data, and evaluators need an appropriate infrastructure to gather and disseminate data. As Reisner observed, “Getting the cooperation of the school system for data collection and research is a nontrivial matter. It is much easier to do it on the front end than to go back and have to fill in after the fact.” 

Structures and norms for collecting data are important for any evaluation, but are especially critical for scaling efforts. Proper data collection and documentation help lay the foundation to assess a program’s progress and reveal which elements of program success should be scaled.

Programs need to ensure that their sites are collecting the same core set of data in a coordinated manner. The Iowa PIRC team’s scaling efforts, for example, involved a shift from disparate data collection methods among SPIN sites to procedures that everyone in the network followed. Mirr portrayed it as a “shift from essentially independent or no data to common data that everybody collects the same way.”

Scaling requires significant time and resources that may entail compromises or even sacrifices in other areas.
SRI’s Penuel noted a tendency to view educational and social service scaling as a process akin to corporate “software start-up” scaling models, where costs are concentrated on development, and dissemination or replication becomes much less costly and resource-intensive over time. “In my view, educational scaling is not at all like software; it is like services. If you really want something to scale, there is no point at which it becomes a whole lot less resource-intensive.”

In Penuel’s view, this resource-intensiveness creates a tension between the demands of policymakers—whose support can impact programs’ funding and scaling opportunities—and the realities of careful, authentic scaling of educational programs: “Policymakers are always looking for ways to scale at low cost and the point at which they are going to be able to completely transfer ownership of the innovation to the field of practice; in only rare cases does that ever happen. And usually it happens in cases where there is an easy fit to the context. And if we are trying to make big changes, an easy fit to the context is not going to have a dramatic effect on practice. The things that have the potential to change are game-changers—they push against the main culture of education. They do things really differently and therefore require ongoing efforts to organize.” 

Noting that scaling requires significant resources, Reisner pointed out that successful evaluations and program scaling efforts can lead to difficult choices. “Citizen Schools is in the process of determining how they are going to move forward in terms of the next phase of impact evaluation.” The sequencing of the program’s evaluations thus far has, as Reisner put it, “positioned Citizen Schools to be able to make some very difficult choices in their investments. On the one hand, they would probably like to use that money to expand to more sites or to increase the number of sites they have in some of their replication communities. On the other hand, it would be very helpful to their long-term growth to conduct that experimental design impact study. It is a tough call, but they have the information they need to make that call.” Thus, if Citizen Schools chooses to emphasize further replication, it could broaden its reach and serve more children, but it would be missing an opportunity to further solidify its program credibility and proof of impact. 

Related Resources
Penuel, W. R., Fishman, B. J., Yamaguchi, R., & Gallagher, L. P. (2007). What makes professional development effective? Strategies that foster cur-riculum implementation. American Educational Research Journal, 44 (4), 921–958.

Vile, J. D., Arcaira, E., & Reisner, E. R. (2009). Progress toward high school graduation: Citizen Schools’ youth outcomes in Boston. Washington, DC: Policy Studies Associates. Available at:

The Iowa PIRC’s experience demonstrates that deepening program service components makes meaningful impact more likely, but can also lead to a retraction of program reach, since the emphasis falls on a particular site (or small set of sites). Depending on program service mandates, this trade-off can have implications for how outside observers assess the efficacy of the program; administrators must make difficult decisions about whether to strive for broad reach, which might not yield much in the way of measurable impact, or to focus on meaningful service depth, which provides less “ground cover” and thus less visibility for the program as a whole.

Heidi Rosenberg, Ph.D.
Senior Research Analyst, Harvard Family Research Project

Helen Westmoreland
Director of Program Quality
Flamboyan Foundation

‹ Previous Article | Table of Contents | Next Article ›

© 2016 Presidents and Fellows of Harvard College
Published by Harvard Family Research Project