ELearning/Course design/Learning activities/Assessment

Preinstruction | Content | Application | Assessment | Follow-through

<youtube>_iv8A1pHNYA</youtube>
1. Why is measuring learning so difficult?'

As this video illustrates, there are many issues involved with, and ways of, assessing learning. Though not the only approach, this page provides guidelines for creating formative and summative assessments. Authentic assessment can be both formative and summative, and so is address separately. Common to all assessments is the need to score them in some way.

Scoring criteria

Scoring and grading criteria describe the standards against which performance is measured. They are essential for clarifying expectations and assigning scores for assignments, projects, and tests. There are two basic approaches to establishing and communicating standards: objective scoring and qualitative scoring. Within each approach, standards can focus on process and/or outcomes. Process standards focus on following established procedures such as "how to assemble a circuit board." Outcome standards focus on the finished product or results, a completed report or improved conditions for example.

Two ongoing concerns around assessments are validity and reliability. Does the assessment measure what it purports to measure? and Do we arrive at the same conclusions over time and between raters?

Objective scoring

Objective scoring assigns a numerical score to a limited number of choices - multiple choice and fill-in-the-blank questions. The scoring is definitely objective, as far as it goes. But this conclusion does not consider many other, often subjective, factors. "Is this important enough to test on?" is one of my favorites. Item construction is also important, as discussed below. Regardless, the scoring is routine enough to be scored by a machine following very specific algorithms (rules for the program). Humans can do it too, but it's very inefficient work for them.

It's also the easiest way to grade, and the least to probe for deep learning. It works best with declarative knowledge and mathematical calculations.

Developers are continuing to investigating machine learning as a means for scoring written responses. At this point, the efforts are not attracting a lot of attention in the literature and web. (needs expansion)

Qualitative scoring

The other forms of scoring involve judgment to varying degrees, based on qualitative criteria. Therefore, they do not declare an absolutely infallible decision. Here, the "answer" lies along continua of dimensions, each which must be judged and then merged into a final result. Gradations divide up each continuum into a manageable number of parts. Because these tools involve judgement, it is advisable to limit the number of gradations. The more gradations, the more difficult it is to tell them apart.

Holistic scoring

In holistic scoring, the rater makes an overall judgement about the quality of performance. As such, the approach is the least vigorous, the least valid, the least reliable, and the most susceptible to idiosyncratic ratings. Even so, a rubric or description of performance levels is essential for claiming any level of objectivity. Saying “this feels like an A” is not holistic grading.

Holistic scoring is best suited to tasks that can be performed or evaluated as a whole and/or those that do not require extensive feedback. Generally, small, low-stakes assignments are most appropriate for this approach. Click here for a holistic description of grade levels.

Checklists

Checklists increase the objectivity of scoring by describing required and desired components of a performance or product. Generally, a yes/no decision is all that is required of raters - the component is evident or it is not. This approach does not reflect the reality of most assessment situations, where some items can be judged in this manner but others involve degrees of "match" between criteria and actual performance that must be judged. Rater comments can serve to explain and justify the rating, but the method is generally low in reliability due to human factors. Stufflebeam (2004) offers a checklist development procedure based on his 30 years of practice. Checklist samples:

Rubrics

"The most powerful grading device since the invention of red ink" (Stevens & Levi, 2005). Rubrics increase validity and reliability by dividing expectations into component dimensions and providing a detailed description of what constitutes levels of performance for each of those dimensions. As such, they are "analytic rubrics." Note that their real value lies in the ability to vividly describe levels of performance for each dimension. Without this, they are of no more value, and perhaps less, than holistic grading. Figure 1 presents an example of poorly constructed rubrics and Figure 2 represents a more valid one. Use the links below for additional examples. Another value-added feature that can be included in rubrics is assigning different weights to the dimensions based on their relative importance to the final product or outcome. Connecting with the audience, for example, is more important to successful presentations than slick PowerPoint slides. We can also note that best results occur when the instructor discusses the rubric prior to students beginning the assignment.


1. A poorly constructed rubric.


2. A more complete rubric.

The advantages of using well-written rubrics are many for both scorers and learners (Stevens & Levi, 2005):

Save grading time. Anglin et. al. (2008) found up to 350% faster grading with computer-assisted grading rubrics, as provided in most learning management systems.
Quicker turnaround of assignments. Multiple studies tell us that the timeliness of feedback is directly related to its effectiveness. The longer the delay, the less the effect of feedback on performance.
Provide valuable feedback without the need for extensive writing. Refer to the feedback article for a look at problems with feedback.
Grades are assigned more objectively and equitably.
Objectively monitor student progress over time with similar assignments using similar criteria.
Communicate problem areas for remediation.
Learners have a much clearer picture of expectations and the level of effort necessary to achieve a particular grade, thus increasing self monitoring.
Provide more useful feedback compared to written comments, which tend to be more evaluative than instructive. Refer again to problems with feedback .
Students perceive that rubrics increase grading transparency and fairness (Reddy & Andrade, 2009).

Follow these steps to create well-constructed rubrics:

Reflect on what you want from students in the assignment.
List the details of the assignment and what you want to see from students.
Group similar expectations and assign a name to each group. This name becomes a dimension of the assignment. There should generally be 3 - 5 dimensions for each assignment, but this will vary.
Enter the dimensions down the left column in the rubric grid, below. Determine the relative weight for each dimension (e.g., X1, X2, X3).
Decide on the number of performance levels, label them and enter along the top row of the rubric grid. Again, 3 - 5 performance levels are most appropriate. Fewer than three levels defeats the granularity advantage of rubrics; more than five and we reach the outer limits of human ability to discriminate differences.
Describe each level of performance for each dimension. It's often easiest to first describe the "best possible" criteria, the "less than acceptable," and finally the "competent" and "unacceptable" levels. Begin with key words and phrases describing evidence of performance before you attempt to put them into complete sentences.

Especially for constructivist and connectivist approaches to teaching, it is helpful to involve students in rubric construction. This promotes understanding, ownership, and satisfaction. Use the same approach as above, or you can ask for student input and then create the final product for each step.

Table 1. The Grading Rubric Grid
Performance Levels ⇒ Dimensions & Weight ⇓	Exceptional	Successful	Needs Improvement	Unacceptable

Dimension 1 (X3)	Description (points)	"	"	"
Dimension 2 (X2)	"	"	"	"
Dimension 3 (X2)	"	"	"	"
Dimension 4 (X1)	"	"	"	"
Dimension 5 (X1)	"	"	"	"

Rubric samples:

Creating formative assessments

Important: Because feedback is an essential element of formative assessment, be sure to read the feedback article in the Teaching Online section in conjunction with this one.

Formative assessments, also referred to as embedded assessment (Hunter & Murrin, 2010) include assignments, projects, self-tests, and scaffolded peer-and self-review of performance. Discussions are also useful for formative assessment. They respond to student learning on an ongoing basis - providing feedback that is timely and can be acted upon to improve learning and performance.

There is a strong body of evidence that formative assessment practices raise learning outcomes (Davison, 2011). It helps clarify expectations, develops reflection and self-assessment, gives high quality information to students about their learning, encourages interaction with instructor and peers, boosts positive motivation, provides opportunities to close the gap between current and desired performance, and provides information to teachers they can use to improve instruction. Several authors advocate for the use of formative assessment exclusively, rejecting online exams as overwhelming temptations to cheat (Gilman, 2010; Harmon, 2010; Jaffee, 2012).

When creating formative online assignments, clear, explicit, step-by-step instructions are essential. We can't read confusion on student faces as we do in the classroom, so we must anticipate questions and provide the answers upfront. Gilman (2010) provides typical examples such as "briefly identify" and "resources." What constitutes "brief" for you? Are there acceptable and unacceptable resources? Says Gilman, "I know that seems like spoon-feeding and, believe me, I would never consider doing that in a face-to-face course. But somehow it hits the right note online because it cuts down on the confusion for students." Beaumont, et. al. (2011) identifies four elements to include in your preparatory guidance: the goals of the assignment, an explanation of the grading criteria (and referencing a grading rubric), an opportunity for students to ask questions, and an exemplary sample of the finished product (also advocated by Gilman).

Assignments

To be formative, assignments are best constructed as multi-stage progressions. This likely means that the involved learning goals reach beyond one or two modules. Honing research strategies, writing skills, and critical thinking are examples. The typical setup begins with an initial assignment completed, reviewed by the instructor or peers using a grading rubric, and returned to the student. The student can also be assigned to review his own work and compare it with the instructor's or peers' evaluation.

The student is expected to use this feedback to either redraft and resubmit the assignment for grading, or use the feedback when completing the next, more demanding, assignment, or both. The feedback article provides a complete description of this approach.

Group assignments and projects

Group assignments such as case studies provide students the opportunity to share and be exposed to multiple perspectives, negotiation, asking and answering questions, and working in teams. Projects, spanning half the course timeline or more, involve phases of accomplishment like planning, organizing, managing, and controlling in project management. Milestone dates are established by which each phase must be completed and reviewed before moving to the next. Projects provide the best opportunity for authentic assessment, including working in groups, making presentations, and receiving feedback from peers.

For more on the subject of group assignments and projects, refer to the Group Learning article in this module.

Student interviews

Meet briefly (face-to-face, telephone, live conferencing, synchronous chat) with each student and ask a few well-planned questions. Jot down notes based on what each students says and assign a grade on their ability to explain concepts, etc.

Self/Practice tests

More accurately called retrieval practice, self-tests provide students training in the reconstruction of their knowledge and understanding, which itself causes improved learning when followed by feedback or reviewing the original material (Karpicke & Blunt, 2011). Roediger & Butler (2011) also report gains without feedback - as long as there is repeated practice (five to seven spaced practice retrievals prior to final testing is optimal). However, they also found that practice with feedback is best. Both sets of researchers conclude that repeated testing is more effective than repeated study.

We established in the Physical basis of learning unit that human recall is not simply "replaying the scene," but rather an active reconstruction of memories subject to interference and fading. We also learned that every time a memory is processed by the brain, synaptic connections grow stronger, reinforcing the memory. However, without the correcting power of review or feedback, retrieval errors become reinforced and extended.

The increased learning found by Karpicke & Blunt (2011) came from students recalling as much information as they could after a period of study, and not by using multiple-choice (recognition) questions. Can multiple-choice self-tests achieve the same end? Reviewing 35 studies of classroom testing, Marsh, et. al. (2007) conclude that yes, they also produce additional learning as long as they're accompanied by feedback, as did Roediger & Butler (2011). Otherwise, the same retrieval errors may be reinforced. Feedback after the student has completed the entire test produces better results than immediate feedback after each question. We note a unanimous conclusion among the three sources: recall is more powerful than recognition in increasing learning.

Retrieval is ultimately the process that makes new memories stick. Not only does it help learners remember the specific information they retrieved, it also improves retention for related information that was not tested (Paul, 2015).

Interpolated testing

Inserting tests between content presentation (e.g., lecture segments) significantly improves cumulative test scores (Szpunar, 2017; Szpunar et al., 2016). Given the value of practice tests, this is not surprising. However, Szpunar also demonstrated other positive outcomes that may apply to practice tests as well:

Reduced mind-wandering during online lectures
Increased note-taking during lectures
Reduced anxiety toward final cumulative testing
Reduced perception of cognitive demand during cumulative testing
Increased retention of content during the final (untested) lecture segment

The experiments included diversionary activities between lecture segments and testing to add an element of interference. However, the tests, including the final, were scored and came less than ten minutes after the lecture, so it remains an unsolved question whether or not the results apply to ungraded tests or across longer time spans. However, the study adds to the accumulating evidence that testing immediately after periods of study significantly increases learning retention.

Scaffolded peer- and self-review

Research tells us that when multiple peers (5 to 6) review a student's work, (1) they are consistent with expert assessments (correlations of .70 to .94), (2) the feedback is written in language better understood by the student, and (3) the student is much more likely to make improvements based on the feedback (Cho & MacArther, 2010; Cho & Wilson, 2006; Sadler & Good, 2006). Single peer reviews, on the other hand, result in few changes leading to improved scores, and are less reliable.

When using peer and self-assessment, it's important to include a structured feedback rubric (Cartney, 2010; Cho & MacArther, 2010; Cho & Wilson, 2006; Jones, 2011). This is so for two reasons. First, students don't possess the knowledge and skill to make adequate judgments without this guidance. Second, the rubric provides the instruction that helps students learn as they look for the identified qualities in their own or peers' work. Using rubrics with category ratings (e.g., degrees of understanding: thorough, extensive, minimal, partial) rather than numerical ratings (1, 2, 3, 4) help raters and ratees focus on the feedback (Gibbs, 2004).

Many students protest this approach to assessment and feedback, citing a number of interpersonal and capability issues. These issues are best dealt with upfront, encouraging students to express their doubts and then discussing the advantages of peer- and self-review (Cartney, 2010). Anonymous reviews also help reduce anxiety (Cho & MacArther, 2010). Please refer to the Feedback article for a thorough discussion. In case you're anxious about the legality of students grading each other, the 2001 Supreme Court unanimously affirmed the practice (Owasso Independent School District v. Falvo).

Discussion boards

Equivalent to classroom participation, student contribution to online discussions is considered by most instructors as an essential aspect of education, especially from the constructivist and connectivist perspectives where students are believed to build personal knowledge from their experiences. Course discussions contain rich information about student understanding of course concepts and assignments, and the resulting information provides invaluable feedback for instructors, allowing them to respond formatively to student concerns (Ma et. al. 2011).

We begin the process of assessing participation by communicating expectations, reinforce them through feedback and additional instruction, and assess students based on them. Rubrics have proven to be the most efficient and consistent method of assessing student participation and, together with individual feedback, providing guidance for future participation. Discussion rubrics can also be used for peer- and self-assessment (Sadler & Good, 2006).

Creating summative assessments

We discussed earlier that the primary purpose of summative assessments is to measure and document student learning, and that they can also serve a formative role when feedback is included. Summative assessments include quizzes, exams and tests, portfolios, and capstone projects.

Quizzes and tests

<p> Considerations for constructing these include defining mastery, writing assessment items, the number of items necessary to test mastery, and types of items. Except where indicated, this material is from Dick, Carey & Carey (2005) and Smith & Regan (1999). Additionally, we take a look at multiple-choice versus essay questions. First, a useful take on quizzes:

"Frequent quizzing forces students to stay current with the course by studying more regularly. Classroom studies have shown that students who received daily quizzes performed better than those who did not. Students who were frequently quizzed felt they had learned more and reported greater satisfaction with the course, despite (or perhaps because of) the greater effort they exerted" (Roediger & Butler, 2011). A study of 900 students at the University of Texas found that grading students on quizzes at the beginning of every class (seven questions for all students plus one personalized from a wrong answer to a previous quiz), rather than midterms or finals, increased both attendance and overall performance (Carey, 2013). The study also found a reduced achievement gap due to socio-economic background. "By forcing students to stay current with reading and paying attention in class, the quizzes also taught students a fundamental lesson about how to study."

Game based assessment

Relatively little is known about the usefulness of game-based assessment, but there are initial studies indicating its validity and usefulness. Kiili & Ketamo (2017) found significantly correlated scores between sixth graders on traditional paper-based and game-based math tests. Additionally, "the results revealed that game-based assessment lowered test anxiety and increased engagement which is likely to decrease assessment bias caused by test anxiety. In addition, the results show that earlier playing experience and gender did not influence the game-based test score suggesting fairness of the game-based assessment approach."

Defining mastery

There are three basic approaches for defining mastery. None are explicitly preferred, but we suggest a choice be made prior to assessment construction.

Performance normally expected from the best learners (norm referenced).
The level of mastery required to be successful in real-life settings (criterion referenced).
Statistical calculation of the number of opportunities necessary for students to perform or demonstrate competence so that it is nearly impossible for correct responses to be the result of chance. This approach is addressed later.

Writing assessment items

Criteria

Criteria that test items should meet to be considered adequate:

First and foremost, items need to be congruent with the learning objectives, matching the behavior described in the objective.

Learners should never miss questions because of unfamiliar terms, context, assessment format, or topic unless you are specifically testing for transfer of learning.

All the information necessary to answer the question is included.

When testing for skills, a series of questions ranging from very easy to extremely difficult is more effective than one or two trick questions (e.g., using double negatives).

Multiple-choice items should never include more than one obvious wrong answer, with the others being under- and over-generalizations of the correct response.

Number of items to determine mastery

To statistically rule out the chances of success by guessing, use these suggestions:

If correct guessing is possible, include three or more parallel items for the same learning objective.
If guessing correctly is unlikely, include one or two items.
Declarative knowledge needs only one item.
When covering a wide range of declarative knowledge, select a random sample of instances. Don't try to cover everything.
When testing cognitive abilities like using concepts, applying rules, and solving problems, provide at least three opportunities to demonstrate competence.

Types of items

Dick, Carey & Carey (2005) suggest particular types of items for different behaviors identified in your learning objectives. To choose among those that are adequate, consider factors such as response time required by learners, scoring effort required, the testing environment, and the probability of guessing the correct answer.

Table 1. Matching learning objectives and test item type (Dick, Carey & Carey, 2005)
Learning Objective	Test Item Type
Learning Objective	Completion	Short Answer	Matching	Multiple Choice	Essay	Product	Live Performance
State/Name	✓	✓
Define	✓	✓	✓	✓
Discriminate		✓	✓	✓
Select		✓	✓	✓
Locate		✓	✓	✓
Evaluate/Judge		✓	✓	✓
Solve		✓	✓	✓	✓	✓	✓
Develop					✓	✓	✓
Construct					✓	✓	✓
Generate					✓	✓	✓
Operate/Perform							✓
Choose (affective)							✓

Remember, "The farther removed the behavior in the assessment is from the behavior specified in the objective, the less accurate is the prediction that learners can or cannot perform the behavior described."

Sequencing test items

Ideally, test items should be ordered by objective regardless of the type of question, except that essay items should come last (Dick, Carey & Carey, 2005). The picture is complicated a bit for online assessment in that randomization is a key strategy for discouraging cheating. Learning management systems can accommodate randomization within categories, however, so it becomes a matter of categorizing your test items by objective and then directing the LMS to select a number of items from each category. Refer to the Course building section for specific instructions.

Item banks

Smith (2008) advocates for creating test item banks within the LMS with at least three, and preferably five, items for each objective. Her approach is to create an initial question for each objective and then create the alternatives. With the questions ready for use, you can then direct the LMS to select the appropriate number using the mastery guidelines above.

Assessing different types of knowledge

From Smith & Regan (1999) and Kuechler & Simkin (2010):

Assess factual knowledge -

Using verbatim response options determine only if the learner has memorized the item, whereas using paraphrased forms can tap learners' understanding of the information.

Assess conceptual knowledge by asking students to -

Recognize instances where the concept or principle applies.
Describe the concept or principle.
Explain why a given instance is or is not an example of a concept or principle.
Categorize instances as examples or nonexamples, with or without an explanation of their reasoning.
Produce their own examples of a concept or principle, with or without explanation.
Determine if the concept or principle has been correctly applied.
Recognize or describe unstated assumptions, identify motives.
Predict consequences of described actions based on the concept or principle.

Assess procedural knowledge by asking students to -

List the steps of a procedure.
Recognize situations where the procedure is applicable and not.
Determine whether a procedure has been correctly applied.
Apply a procedure across a range of situations and difficulty levels.
Show their work as they solve problems.

Assess metacognitive knowledge and skill underlying task performance (from Bannert & Mengelkamp, 2008) by asking students to -

Describe the process they will or would take to solve a problem, investigate a topic, or diagnose a situation.
Correctly order a series of problem solving, investigation, or learning steps.
Record themselves as they think aloud while working through an issue, solving a problem or learning a specific piece of material.
Reflect on and write the reasons they chose their particular steps of problem solving, etc. The assessment can prompt students to reflect and record at specific steps along the way. This approach is most useful in training for metacognitive skills since it forces students to focus on their own thoughts and the metacognitive learning process.

Multiple-choice v. Constructed response

Historically, summative testing has relied heavily on multiple-choice (MC) questions despite the widely held belief that constructed response (CR) questions are a superior measure of student learning (Kuechler & Simkin, 2010). A basic question revolving around their use is the relationship, if any, that exists between results using the two methods, and the nature of that relationship. While there are variations of each, the essence of MC questions is they require learners to select from a small list of possibilities, and CR questions require test-takers to create their own answers.

Multiple choice questions can be graded quickly, consistently, and accurately. They are generally viewed as more objective, can cover a wider range of subjects ("shotgun"), with results returned instantly in the online environment. CR questions better test integrative skills requiring understanding of subject matter and context. They are better probes of breadth and depth of knowledge, and provide evidence of organizational abilities, language and reasoning skills. They are also more authentic when requiring problem solving encountered in actual work situations.

Contrary to early evidence, equivalence between the two item types has been fairly well established by more recent research. Peeple, et. al (2010) found a significant correlation between MC and essay question scores (p<.01), with students consistently scoring higher on MC than essay tests (p<.01). In a 2003 meta-analysis of 67 studies, Rodriguez found "stem equivalency" to be an important determinant of test item parity. How closely the prompts resemble each other largely determines performance similarity. Gender differences have been consistent, with males performing better on MC questions and females doing better on CR items (Kuechler & Simkin, 2010; Everaert & Aurhter, 2012). Unwittingly, then, instructors who rely solely on multiple-choice items risk systematic gender bias in their testing practices.

Portfolio assessment

Portfolios stress the collection of student work accompanied by reflective commentary to provide evidence of learning achievements (Klenkowski, et. al., 2006). They appear to be especially useful in helping students develop their metacognitive skills. Some believe portfolios are best used in assessing practice-oriented fields such as teaching, medicine, and architecture (Brown, 2003), but they have also been applied to mathematics (Reynolds, 2010), information technology (Tubaishat, et. al., 2009), and languages (Hung, 2012). Regardless, their use is often combined with test results to arrive at final grades. Black, et. al. (2011) argue that portfolios enhance validity by sampling learning in a variety of ways and on many occasions. They can also increase flexibility when students are allowed a degree of choice in selecting assignments and projects reflecting their interests. Of course consistent standards, communicated through grading rubrics, are necessary to ensure projects are focused on established learning objectives.

Klenowski et. al (2006) provide three useful approaches to portfolio use:

Professional development records

This approach operates under the belief that "professional learning is most usefully focused on professionals' practice; professional expertise and knowledge are effectively generated by individual and collaborative research; and that dialogue about learning is an essential component in generating expertise and knowledge." This approach is especially effective at promoting meta-learning of individual approaches to learning, taking personal responsibility, and integrating learning into practice. Course participants are required to:

Identify a focus relating to professional practice.
Collect evidence of competencies and skills.
Reflect on professional and personal learning.
Incorporate a relevant literature review.
Identify issues for professional practice.

Learning portfolio

A learning portfolio approach is designed to focus learning on developing analytical and critical thinking skills, and understanding of concepts, theories and issues related to the course subject. It asks students to:

Write critical analyses of readings, including the use of instructor provided questions to guide analysis.
Create plans for completing assignments.
Include working drafts of assignments as they progress to a final product.
Evaluate their own learning progress.
Reflect on recorded lectures, readings, critical reviews and the like.

A student comment about the value of learning portfolios:

"I can see that there were lots of gaps in my thinking. And I hadn't really considered the role assessment had on learning. -The portfolio has allowed me to build the steps in pushing my own learning on and now I see those steps."

Learning records

A Learning Record method de-emphasizes the collection of learning evidence in favor of an ongoing process of recording and analyzing learning experiences - to step back and observe their learning. Students are asked to record evidence of change in their journals, examine their own learning and meta-learning strategies, and reflect on how they help others learn. The instructor provides initial prompts such as:

What strikes you as important about the lecture, readings, and learning activities?
What sense are you making of your experiences?
Have there been any surprises?
How have your contributions helped other group members learn?

Gradually, instructor prompts encourage students to take responsibility for developing and answering their own questions:

What questions do you need to ask yourself at this point?
What questions have not been asked that need to be?

Later in the course, the instructor encourages meta-analysis of their learning:

What do your earlier entries tell you about your learning at that point in time?
In what ways do you see your entries changing?
How do you see your entries developing in the future?

This personal reflection can be aided by including discussions, self-assessments, and questionnaires as fodder for reflection.

When contemplating alternative assessment practices such as portfolios, it's important to remember that students have not likely experienced their use, and so may be initially uncomfortable with them. When introducing change of any sort, explaining the change and answering questions is an important success factor.

Capstone projects

Capstone activities include project based learning, case study analysis, service learning, work placements, internships, simulations, and immersion experiences. They typically provide the culmination of theoretical approaches and applied work practice experiences (Holdsworth, et. al., 2009). This involves integrating scholarly capabilities and employability skills. While the involved skills may be universal, the focus is on discipline-specific knowledge applied to real world (authentic) scenarios.

Figure 2 illustrates the relationship between capstone experiences, student attributes, employer desired capabilities, and lifelong learning (Holdsworth, et. al., 2009 p. 3).


2. Capstone projects build capabilities

Outcomes assessment

As a result of continuing pressure to demonstrate learning outcomes, increasing numbers of programs are using some form of outcomes assessment (Astin, 2013). It has long been acknowledged that traditional means fall short of consistently measuring desired outcomes. Course grades, with their questionable capacity to reflect changes, growth, or improvement in student learning, along with the non-comparability of grades from instructor to instructor and institution to institution are the most cited reasons.

Unlike course grades, Astin asserts, standardized tests can be used as yardsticks for comparing students, and can be used repeatedly to measure growth and change. They can also be used to compare groups of learners and measure group change over time. Some of the more commonly used outcome assessments in higher education on the national level include the Critical Thinking Assessment (CLA), Critical Thinking Assessment Test (CAT), Collegiate Assessment of Academic Proficiency (CAAP), and the the ETS Proficiency Profile (EPP). For more information, see the National Institute for Learning Outcomes website.

Criticisms: In practice, many institutions have failed to use these assessments as intended, opting instead to use them as "one-shot" measures to rate their programs. Another criticism of these measures is they assume a causal relationship between college attendance and learner growth, when in fact there are likely other contributing factors. Additionally, the tests fail to explain differing outcomes for different students in the same programs. In short, outcome assessments are a good thing, but the execution needs much improvement.

Authentic assessment

When we use authentic assessment, we measure the phenomenon itself; we don't talk about it. We reconstruct the circumstances under which the learning is applied in the real word as much as possible. The experience is integrative rather than atomistic. This approach is widespread in vocational education and workplace training, but much less so in the academic environment. Davison (2011) challenges this status with four case examples in non-vocational subject areas.

In its best form, authentic assessment is both formative and summative, and fully integrated with the instruction. Formative assessment becomes part of the process toward creating a product or outcome. As in real life, product development involves phases in which ideas are tested within the "lab", presented to others for feedback, and tested again with a sample population before general release. Working toward desired outcomes involves hypothesis testing, trial and error, and working out successful processes. Iteration and feedback are key.

Spectrum of Authenticity

Gulikers (2006) describes five dimensions that together place a particular assessment within a spectrum of authenticity. The further from real-world application any dimension lies, the less authentic the assessment. However, younger students are less likely to view consider physical and social context as relevant, possibly due to their lack of experience in the work world.

Assessment task: What do you have to do?

3. Spectrum of assessment authenticity

Integration of knowledge, skills, and abilities
Meaningfulness, typicality, and relevance (in the student's mind)
Degree of autonomy and responsibility
Degree of complexity

Assessment criteria: How is what you've done to be evaluated or judged?

Based on criteria used in professional practice
Related to a realstic result (product/outcome)
Transparent and explicit
Criterion referenced

Result form and substance: What result has to come out of your effort?

Demonstration of competence
Observation by or presentation to others
Multiple indicators of learning

Physical context: Where do you have to do it?

Similarity to professional work space
Availability of professional resources
Similarity to professional time frame

Social context: With whom do you have to do it?

Similarity of social context of professional practice
Individual work and decision making
Group work and decision making

While Gullikers focused on professional practice, Davison (2011) suggests authentic assessment can be more widely applied. Assessments can be authentic to:

a professional practice
an academic discipline
a research discipline
real-life settings
individual lives

The common denominator is that authentic assessments require learners to apply knowledge, skills, and abilities in an immediate and relevant way. Davison (2011) goes on to define authentic activities, adding a set of criteria to the settings just described. Authentic activities:

have real-world relevance
are ill-defined, requiring students to define the tasks necessary to complete the activity
comprise complex issues to be investigated over a sustained period of time; the answers are not obvious or simple
ask students to examine the issues from different perspectives, using a variety of resources
provide the opportunity to collaborate
provide the opportunity to reflect
can be integrated and applied across different subject areas and go beyond strictly domain-specific outcomes
create polished products valuable in their own right rather than as preparation for something else
allow competing solutions and diversity of outcomes (no one right answer)

Creating authentic assessments

The process of creating authentic assessment is simple, but executing the process can be very involved. Multiple sources (Ashford-Rowe et al., 2014; Mueller, 2012; Neely & Tucker, 2012) use Gullikers' five dimensions to structure the process.

As with any form of assessment, we must be clear about the learning objectives we wish to accomplish. "What should learners know and be able to do?"
Design and/or select authentic tasks based on the relevant professional practice, academic discipline, settings, or lives.
Establish the authentic physical context.
Establish the authentic social context.
Determine the assessment criteria. "What indicates that students have achieved the objectives?" Creating an assessment rubric assists learners and instructors alike to concretely discriminate levels of performance.

Note that Gullikers' research indicates that students, though not instructors, place less emphasis on physical and, especially, social context. The implication being that compromises in these two dimensions are less likely to taint learners' perception of authenticity.

Examples

Here are three courses using authentic assessment activities to varying degrees.

Applied Literary Translation (Fitz, 2014)

This course, conducted entirely online, utilized readings, lectures, graded discussions, and the following authentic assignments:

Response papers on assigned articles. Sample question: Which do you believe is most essential when it comes to producing a quality literary translation: being a close, detailed reader of the original text, or being a stylistically accurate creative writer?
Submission letter/reader's report for publishing a book translation
Query letter to a publisher
Plan of approach, selecting a specific author, living or dead, research who owns the rights, and describe whom and in what manner, you would approach in order to broach the subject of translating said author
Translation of a selected poem or short story, and a query letter with which to accompany the submission (summative assessment)

Medieval Thought and Culture (Davison, 2011).

This hybrid course utilized readings, lectures, in-class discussions, and the following authentic learning activities:

Watching Monty Python and the Holy Grail movie to demonstrate modern misconceptions of life in the Middle Ages
Reading original texts and viewing paintings from the period to examine medieval culture and life
Comparing modern values with what students have learned about medieval values
Debating traditionalism versus modernism set within the 12th century Roman Catholic Church and acting as church cardinals
Making group presentations based on historical texts
Visiting a medieval cathedral
3000 word summative essay on a specific question about medieval culture, with interim reviews and feedback

Urban Geography (Davison, 2011)

The aim of this face-to-face course was to "encourage reflection on students' own perceptions and representations of their own geography and environment?. Included lectures, readings, in-class cooperative work, seminar-style sessions and:

Photographic portfolio of the local city with critical discussion of the photos
Development of a reflective book/journal/comic entitled, "Your place in the city"
Site visits to specific sections of the city
A carbon footprint exercise
Public exhibition of the final assessed work (summative assessment)

Using media in assessment

10. Designing video for specific learning objectives (Schwartz & Hartman, 2011)

Media can add elements of authenticity to both formative and summative assessments. Use of video for instruction was introduced in Part 2: Content, and here we use the same designed learning model by Schwartz & Hartman (2007) but consider the use of multiple media.

The inner circle (1) describes four general learning outcomes and their rough alignment with Bloom's taxonomy: engaging (interest, motivation), saying (declarative knowledge), seeing (perceptual learning), and doing (interpersonal and skill learning). From Bloom's perspective, motivation and attitudes are based in the affective domain.

From circle 1, we move outward within each quadrant, looking at specific objectives we seek to achieve and video genres that can be used to meet them. We review using video for assessment within each category.

Engaging is characterized as the pull that brings people to a situation and keeps them involved. Here we are assessing the design of the instruction rather than student learning.

Assessing for interest can be accomplished by asking about learners' preferences, or by presenting options and recording their choices. Also, do learners request additional resources and discussion, or seek them out on their own, following basic instruction? One way to measure this is to create web pages with links and record learner use of each.
Testing for the impact of contextualizing topics during content presentation (e.g., providing background information, activating prior knowledge, embedding concepts in real-life situations) can be done by comparing assessment scores between those who access the contextual material and those who do not. Rather than assessing directly what students have learned form the video, we can measure what they learn from the lessons following the video.

Saying is associated with declarative knowledge like facts and concepts.

Assessing factual knowledge is most easily measured by testing for recall. This can take the form of free recall in which learners say or write what they remember, or by using cues such as showing photos, symbols, etc. and asking learners to name and define them.
Explanatory outcomes ask students to draw inferences by going beyond the information given and extending what they have learned by drawing logical conclusions. Problem-solving, applying concepts to new situations, making predictions, assuming a point of view, and constructing an argument are forms of inference.

Seeing is about perceiving phenomena along a continuum from familiarity to discernment. Familiarity introduces people to phenomena they have not been exposed to - exotic animals, the deep sea, world heritage sites. Discernment focuses people on the details and nuances of phenomena, developing an "enlightened eye."

Recognition is the simplest way to test for familiarity: showing photos, drawings, and videos and asking learners to name what they see. Recognition can be more authentically assessed by putting subjects in their natural habitat (e.g., recognizing a plant in the wild as opposed to a greenhouse or isolated within a white background), and by presenting them from different angles.
Discernment depends on learners' ability to perceive, make note of, the targeted phenomenon. There are at least three ways to test for perception. The simplest way is to present paired images or videos and asking learners to select the exemplary model. Student teachers might watch two teachers intervene in a scuffle and select the better approach. Second, through comparison by presenting a photo, scene, graphic, etc. and asking learners to identify targeted elements. Pointing out possible security risks in a schematic of a building, for example, or signs of cancer on an x-ray. Third, using compare and contrast, say, by recording two customer interactions and asking learners to describe the differences between the two and selecting good and bad elements of both.

Doing is associated with human behavior involving attitudes and skills. Attitudes can be directed to other people, objects, ideas, and concepts (e.g., democracy). Skills may include cognitive, interpersonal, and psychomotor performance.

Assessing behavior is best approached by viewing and evaluating actual physical behavior, rather than asking someone what they would do. To evaluate attitudes we can observe the manner in which people behave toward the target before and after instruction. Are they more or less aggressive? Do they establish or avoid eye contact? Although not entirely predictive of long-term behavior, it is usually best to observe shortly after instruction and provide feedback than it is to wait.
Performance assessments are ideal for evaluating skill acquisition because they directly test the relevant behavior. For some skills, such as assembling a circuit board, a full-blown performance allows us to evaluate all elements of performance such as speed, accuracy, precision, etc. Variations can also be included, such as different board shapes or connectors, to test for flexibility. Other, more complex skills can be broken down into sub-tasks which can be evaluated separately to identify aspects that need additional attention. Scaffolding can also be included for formative assessments, providing starter materials and videos for novices rather than starting cold.

Conclusion

Assessments are not only useful for measuring student learning, but are valuable learning tools in themselves. Formative assessment in its many forms is intended to provide learners with feedback that (1) informs them of their current state, and (2) extends their learning and improves their performance. Summative assessment is best accomplished through authentic tasks and in multiple ways that sample the full range of student performance.