Search Publications | The Evaluation Exchange | FINE Network | OST Database

The Harvard Family Research Project separated from the Harvard Graduate School of Education to become the Global Family Research Project as of January 1, 2017. It is no longer affiliated with Harvard University.

Volume XI, Number 4, Winter 2005/2006

Issue Topic: Professional Development

Promising Practices

The Three- Step Assessment Tango: Nurturing and Measuring Learning in Online Professional Development: Expanded Web Only Version

David Eddy Spicer, Roland Stark, and Martha Stone Wiske from WIDE World describe their process of measuring learning in online professional development.

You get what you assess. This common sense maxim points to the perils of simple, “dumbing down” measures that can reduce learning to jumping through hoops. It also highlights the promise of assessment as a vehicle for driving learning forward. The need to put an assessment system in place is particularly acute in the burgeoning field of online teacher professional development. Emerging standards give a prominent role to assessment as critical to educational effectiveness.1 Yet new views of teaching and learning make clear that, if assessment is truly to support instruction, assessment strategies must be integral to students' learning activities.2

How, then, do you create a system that is as useful to learners as it is to instructors, program managers, and those on the outside who want evidence of learning? Online environments for professional learning offer challenges to and opportunities for striking this balance. One big challenge in online environments is getting a geographically dispersed instructional team to agree on what constitutes valid and fair assessment; one big opportunity is the chance to assess student work that is well-documented and largely text-based, with a trail of evidence from the beginning to the end of the learning process.

A team from the WIDE World online professional development program at the Harvard Graduate School of Education (wideworld.pz.harvard.edu) developed a three-step process that effectively negotiated this balancing act with educators enrolled in our online courses. The team, including course developers, instructors, program managers, and evaluators (with some of us wearing more than one hat), proceeded in three broad steps. First, we helped instructors sharpen course goals. Next, we developed performance assessments that included rubrics to guide both learners and the instructional team as they progressed toward these goals. Finally, we devised a testable system for applying the rubrics to assess and score a key course assignment.

This process generated a global appraisal of program results as well as a systematic assessment of individuals' performance in the courses. Summarized in this way, this three-step process might appear to be a straightforward march. In practice, it's more of a tango, with bold moves forward toward greater definition of instructional intent, then back, as each successive step demands new, creative ways of maintaining balance and rhythm.

We chose three courses with which to test this assessment process. All three were designed using a research-based pedagogical framework known as Teaching for Understanding (TfU).3 Two of the three were courses well-known to the WIDE World team, having been taught at the Harvard Graduate School of Education for 4 years. One of the courses provided a general introduction to the Teaching for Understanding framework through planning and revising a curriculum unit. Another used the framework in conjunction with new technologies to help teachers meet the requirements of district and state curriculum standards. The third course was new.

The Teaching for Understanding framework, on which all three courses drew, puts particular emphasis on honing instructional goals, identifying key performances (i.e., learning activities aligned with those goals), and linking the goals and performances through ongoing assessment. While course delivery was guided by the Teaching for Understanding framework, course delivery relied on networked technologies to help K–12 educators apply acquired knowledge in their everyday work. Networked technologies, like e-mail and Web-based discussion forums, offer many ways to help build a community of learners by facilitating dialogue and the exchange of ideas across distance. Participants discuss goals, exchange resources, collaboratively develop ideas, and give constructive feedback. Assessment is an integral component of the networked technologies approach.

Networked technologies are used in all WIDE World classes. Those enrolled in our courses—classroom teachers, curriculum developers, professional development staff, and school administrators from around the world—are placed in discussion groups of 10 to 20 people and given small-group and individualized feedback on a regular basis by online teaching assistants or “coaches.” Participants work on creating lesson plan designs and sharing them via specifically designed online tools; feedback from the coaches is supplemented by exercises to stimulate self- and peer reflection. The integrated approach to assessment used with our online learners in turn provides a model for strategies they can use with their own students.

In the three above-mentioned courses, our three-step assessment tango helped us move from an ad hoc approach to assessment in each course to a systematic approach that was flexible enough to meet the needs of every course. This meant finding our collaborative rhythm in the first step, making bold moves forward in the second step to tighten important instructional links, and finally, “going quantitative” by using numbers to provide a fair measure of outcomes in the third step.

The Three-Step Assessment Tango

First Step: Sharpening Course Goals and Key Performances
In our initial conversations about participant assessment, even our seasoned instructors were apprehensive about offering their courses for collaborative scrutiny. To manage this tension, we chose a Tuning Protocol—a tested approach to sharing curricular work designed by the Coalition of Essential Schools.4 The structure of this Tuning Protocol succeeded in reducing defensiveness in our conversations. Together with each course instructor, the WIDE World assessment team conducted an assessment inventory-reviewing the syllabus, course materials, and the work that participants would attempt-so that we could clarify the connection between the goals the instructor had identified and the things learners were asked to accomplish.

During each tuning protocol, our assessment inventory focused on these questions:

How do the overarching instructional goals relate to key learning activities?
For these key activities, what are the ways in which (a) learners assess themselves, (b) learners assess one another, and (c) coaches assess learners?
What difficulties have occurred/are anticipated in further developing an approach to assessment to provide a clear measure of impact on learners?

Second Step: Tightening Links Among Goals, Performance, and Assessment
Effective assessment, whether online or face-to-face, gives learners clear criteria, frequent feedback, and opportunities for reflection from the beginning to the end of their coursework. The three courses already employed several formal methods to accomplish feedback for certain assignments, including rubrics that helped sharpen learners' work and clarify the feedback process. The assessment inventory that the team conducted with instructors in step one helped the instructors to clarify the criteria in the rubrics in step two, while the discussions that instructors conducted online with their coaches helped them to further refine this criteria. This process led to the creation within each course of a more global rubric or reflection guide, which gave detailed instructions to learners as they completed assignments and worked toward a cumulative course product.

Asking instructors to develop and commit to such assessment instruments posed certain challenges particular to distance learning. Materials posted online cannot be recalled or revised as easily as instructions given in face-to-face classrooms. Other factors—including the prospect of using these tools to evaluate individual work and the overall success of the course, and the introduction of a more explicit assessment approach—also risked arousing some anxiety in instructors and students. However, by emphasizing explicit discussion of course goals and their relation to the criteria, the reflection guides clarified goals and assessment procedures among each course's team.

Checking for Reliability

Scores, ratings, and grades mean little if they are assigned unpredictably. Most classroom teachers have struggled with the tendency to judge a paper a bit differently depending on whether it is 1st or 31st in the stack, on the time of day—or on their caffeine level! Such inconsistencies cut into intrarater reliability. Most educational programs are even more concerned with interrater reliability and believe that only those scores that are assigned similarly by different judges are worthwhile. It is natural to measure interrater similarity by simply counting the percentage of cases for which different raters agree. However, this common indicator can be surprisingly deceptive and often over- or underestimates reliability. WIDE World measures both intra- and interrater reliability with additional checks on judges' consistency (indicated by correlation) and agreement (indicated by Cohen's kappa statistic).*

* More information on reliability in performance assessment can be found in: Gronlund, N. (1998). Assessment of student achievement (6th ed.). Boston: Allyn and Bacon; Herman, J., Aschbacker, P. R., & Winters, L. (1992). A practical guide to alternative assessment. Alexandria, VA: Association for Supervision and Curriculum Development; and, for greater detail: Stemler. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research and Evaluation, 9(4). Retrieved March 12, 2004, from http://pareonline.net/getvn.asp?v=9&n=4

Third Step: Going Quantitative
Our final step, which proved to be the most challenging, was to help instructors translate each reflection guide into a scoring guide that could be used not just for ongoing assessment but to evaluate learners' final products. (See examples of scoring guides below.) The team worked on devising scoring guides with which all coaches could assign a consistent numerical rating to participant work as that work related to particular criteria. Again, instructors initially hesitated to distill evidence of learning down to a dozen or so criteria because they feared that a reductionist emphasis on obtaining a numeric score would derail learning.

Other hurdles also arose. Even seasoned educators who had written many classroom tests found themselves struggling with questions of how to allow more or fewer points for various components of the rubric, how to choose between holistic (“just size it up”) and analytic (part-by-part breakdown) scoring systems, and how many sections of the rubric required separate sets of directions. The fact that instructors' work was being openly scrutinized and that changes to the course could not simply be made “on-the-fly” raised anxiety. However, instructors soon saw that they could make the approach fit their own preferences for grading and scoring. Their coaches' enthusiasm for finding constructive ways to make the rubrics work also boosted instructors' confidence in the process.

Keeping the Beat
By the end of our first trial run, our team had developed a solid, systematic approach to participant assessment. Here, in condensed form, is a list of assessment fine-tuning steps on which we settled for the three courses and which we eventually used with all 10 courses we offered the following semester:

Identify key performance(s) to be assessed throughout the course.
Create a reflection guide for each of these pieces of work.
Provide recommendations to raters (coaches or others who judge work) about how to use the guides to promote learning throughout the course.
Identify a key performance to be evaluated at the end of the course.
Create a scoring guide by selecting key criteria from the reflection guide that apply to the culminating performance.
Present raters with the scoring guide and work samples to be used for practice scoring so all can be on the same page about how to use the guide.
Conduct a pilot test with at least three raters using work samples that have not been discussed among raters.
Check for reliability. (See sidebar.)

In the context of our three-step tango, these fine-tuning steps helped us to implement an effective approach that provides evidence of the learning process, while at the same time nurturing and driving learning forward.

Examples of Scoring Guides
Following are two examples of scoring guides (i.e., evaluation guides or rubrics) that might be designed by developers/instructors for coaches to use in assessing participant work. Please note that these guides are presented only to serve as examples. Neither was used in these exact forms in WIDE World's fall semester courses.

First, this example of a holistic scoring guide asks coaches to provide only one rating to summarize each section.

Differentiated Instruction Course Rubric—Instructions to Coaches:

Consider the submitted Work in Progress (WiP) separately with respect to each of three sections of the scoring guide. Taking into account all of the section's criteria, but paying particular attention to the text in bold, provide a Section Score ranging from 0 to 4.
- A rating of 0 indicates criteria that are not addressed at all or that are addressed poorly or inadequately. (If the project does not include any work by which it could be assessed, assign a “0” for the section. Participation, or the amount of work submitted, will be addressed separately from this assessment.)
- A rating of 4 indicates work that provides varied and ample evidence of deep understanding of the ideas referred to in these criteria. It indicates work that shows careful thought about design, planning, and enactment.
Add up the three Section Scores to obtain a Raw Score.
Convert the Raw Score to an Overall Rating of 1, 2, or 3 using the following rule:
- 0–3 1
- 4–9 2
- 10–12 3

Name of project being assessed:	Overall Rating (from below): 1 2 3
Date of submission:
Coach (person submitting this form):
Date:

Understanding the TfU framework re: Differentiated Instruction (DI)	Section Score: 0 1 2 3 4
1. Understanding Goals focus on the essentials—understanding of key concepts, processes, uses of, and/or genres in the subject matter (often, but not always, as defined by national, state, or local standards, as well as by teacher's expertise).
2. Understanding Performances align with, address, and guide student's exploration, appreciation, and deepening understanding of Understanding Goals.
3. WiP incorporates ongoing assessment. Ongoing and culminating assessments align with goals and Understanding Performances, in that they support assessment of learners' understanding of the Understanding Goals.
4. The WiP's narrative reflects an understanding of the purpose and practical role of the Teaching for Understanding (TfU) framework. The WiP effectively uses the TfU framework in identifying and maintaining focus on “essential content, processes, and products.”
The Entry Point Approach	Section Score: 0 1 2 3 4
1. Entry points are used to develop learning experiences aimed directly at developing understanding of key concepts articulated by one or more of the Understanding Goals.
2. Entry point-based Understanding Performances are valid instantiations of the target entry point. (e.g., “aesthetic” entry point activity legitimately taps into and applies the aesthetic entry point).
3. Entry point-based learning experiences require students to engage actively and to think with and about content and concepts.
4. Learning experiences employ a range of entry points to the content (i.e., introductory or “messing about” experiences that invite students with varying backgrounds, experiences, and expertise to work thoughtfully with the content).
Grouping Strategies	Section Score: 0 1 2 3 4
1. Grouping strategies are identified based on DI goals (e.g., to address a particular student or student population, to address a range of ability levels in the topic, etc.).
2. Grouping strategies are well suited for the goals and purposes for which they were chosen.
3. Grouping strategies are well designed and align with the Understanding Performances in which they are applied.
Raw Score (sum of scores for the three sections):
Convert to Overall Rating: 0–3 1, 4–9 2, 10–12 3

Second, this analytical scoring guide asks for a separate rating for each criterion.

Instructions to Coaches

Rate each of the 11 criteria on a scale from 0 to 4 by writing in a number.
- A rating of 0 indicates a criterion that is not addressed at all or that is addressed poorly or inadequately. (If, for certain criteria, the project does not include any work by which it could be assessed, assign a 0 for the applicable rows. Participation, or the amount of work submitted, will be addressed separately from this assessment.)
- A rating of 4 indicates work that provides varied and ample evidence of deep understanding with respect to this criterion. It indicates work that shows careful thought about design, planning, and enactment.
Add up the 11 ratings to obtain a Raw Score.
Convert the Raw Score to an Overall Rating of 1, 2, or 3 using the following rule:
- 0–14 1
- 15–32 2
- 33–44 3
In this way each of the three sections on the scoring guide are given approximately equal weight, with slightly more weight given to each of the first two.

Name of project being assessed:	Overall Rating (from below): 1 2 3
Date of submission:
Coach (person submitting this form):
Date:

Understanding the TfU framework re: DI	0, 1, 2, 3, 4
1. Understanding Goals focus on the essentials—understanding of key concepts, processes, uses of, and/or genres in the subject matter (often, but not always, as defined by national, state, or local standards, as well as by teacher's expertise).
2. Understanding Performances align with, address, and guide student's exploration, appreciation, and deepening understanding of Understanding Goals.
3. WiP incorporates ongoing assessment. Ongoing and culminating assessments align with goals and Understanding Performances, in that they support assessment of learners' understanding of the Understanding Goals.
4. The WiP's narrative reflects an understanding of the purpose and practical role of the Teaching for Understanding (TfU) framework. The WiP effectively uses the TfU framework in identifying and maintaining focus on “essential content, processes, and products.”
The Entry Point Approach	0, 1, 2, 3, 4
1. Entry points are used to develop learning experiences aimed directly at developing understanding of key concepts articulated by one or more of the Understanding Goals.
2. Entry point-based Understanding Performances are valid instantiations of the target entry point. (e.g., “aesthetic” entry point activity legitimately taps into and applies the aesthetic entry point).
3. Entry point-based learning experiences require students to engage actively and to think with and about content and concepts.
4. Learning experiences employ a range of entry points to the content (i.e., introductory or “messing about” experiences that invite students with varying backgrounds, experiences, and expertise to work thoughtfully with the content).
Grouping Strategies	0, 1, 2, 3, 4
1. Grouping strategies are identified based on DI goals (e.g., to address a particular student or student population, to address a range of ability levels in the topic, etc.).
2. Grouping strategies are well suited for the goals and purposes for which they were chosen.
3. Grouping strategies are well designed and align with the Understanding Performances in which they are applied.
Raw Score (sum of scores for the 11 criteria):
Convert to Overall Rating: 0–14 1, 15–32 2, 33–44 3

1 Institute for Higher Education Policy (2000, April). Quality on the line: Benchmarks for success in Internet-based distance education. Alexandria, VA: Association for Supervision and Curriculum Development; Massachusetts Department of Education (2003, November). Massachusetts recommended criteria for distance learning courses. Retrieved February 26, 2003, from http://www.doe.mass.edu/edtech/news03/distance_learning.pdf; National Staff Development Council (2001). E-learning for educators: Implementing the standards of staff development. Retrieved November 24, 2003, from http://www.nsdc.org/library/authors/e-learning.pdf
2 National Research Council (2000). How people learn: Brain, mind, experience, and school. Washington, DC: National Academy Press.
3 Wiske, M. S. (Ed.). (1998). Teaching for understanding (1st ed.). San Francisco: Jossey-Bass; Wiske, M.S., Rennebohm Franz, K., & Breit, L. (2004). Teaching for understanding with new techniques. San Francisco: Jossey-Bass.
4 Allen, D., & McDonald, J., (n.d.). The tuning protocol: A process for reflection on teacher and student work. Retrieved February 10, 2005, from http://www.essentialschools.org/cs/resources/view/cs_res54

David Eddy Spicer
Research Manager
Tel: 617-384-9869
Email: eddyspda@gse.harvard.edu

Roland Stark
Researcher/Statistician
Tel: 617-384-7841
Email: roland_stark@gse.harvard.edu

Martha Stone Wiske
Co-Principal Investigator
Tel: 617-495-9268
Email: wiskema@gse.harvard.edu

WIDE World
Harvard Graduate School of Education
14 Story Street, 5th floor
Cambridge, MA 02138
Website: wideworld.pz.harvard.edu

‹ Previous Article | Table of Contents | Next Article ›

The Three- Step Assessment Tango: Nurturing and Measuring Learning in Online Professional Development: Expanded Web Only Version

Quick Links

Checking for Reliability