Bad Methods Yield Non-Actionable Answers

Originally published June 2017

Having drunk the KoolAid of rubrics and assessment, many the untrained academic administrator epitomizes that old saw about a knowing just enough to be dangerous. Suppose a manager wants to make a decision based on multiple criteria. An academic manager, for example, might consider

  • Employee Type
  • Organization Needs and Employee Expertise
  • Employee Productivity
  • Employee Versatility
  • Engagement in Critical Roles

The plan is to rate each employee on each dimension and then add up the ratings to yield a score that will permit comparison between employees for the purpose of decisions about whether to retain the employee or not.

The individual ratings will be some variation on High, Medium, Low.

The use of rubrics such as this is all the rage in higher education. Unfortunately, they are frequently deployed in a manner that reduces

Ratings are not normalized

By having some categories top ranking count 3 and others 2 points we introduce a distortion into the final score. Type, match, and productivity “count” more than versatility and critical role.  If that’s intended, fine, but if not, it skews results.

Ordinal Scales Do Not Contain Distance Information

Any fool, as they say, knows that “high” is more than “medium” which is more than “low” and “low” is more than “none.”  When we have a scale that has this property we call it an “ordinal” scale; the elements of the scale can unambiguously be ordered from low to high.

What we do NOT know, though, is whether the “distance” between a high rating and a medium rating is equal to the distance between a medium rating and a low rating.

Although it is extremely common to look at an ordinal scale like “high, medium, and low” and assign 3 to high, 2 to medium, and 1 to low, this is a serious methodological error.  It invents information out of thin air and inserts it into the assessment. The ways in which this distorts the answers that emerge from the measurement cannot be determined without careful analysis. Just writing 3, 2, 1 next to words is not careful analysis.

Criteria Overlap Double Counts Things

Suppose some of the same underlying traits and behaviors contribute to a needs/expertise match and an employee’s versatility and that this trait is one of many we would like to consider in deciding whether to retain the employee. Since it has an impact on both factors its presence effectively gets counted twice (as would its absence).
Unless we are very careful to be sure that each rating category is separate and distinct, a rubric like this introduces distortion into the final score by unintentionally overweighting some factors and underweighting others.

Sequence Matters

When using rubrics like this we sometimes hear that one or another criteria is only used after the others or is used as a screen before the others. This too needs to be done thoughtfully and deliberately. It is not hard to show how different sequences of applying criteria can result in different outcomes.

Zero is Not Nothing

A final problem with scales like these is that even if the distance between the ratings were meaningful, it is not always the case that we have a well defined “zero” rating.  Assigning zero to the lowest rating category is not the same as saying that those assigned to this category have none of whatever is being measured.
The problem that this introduces is that a scale without a well understood zero measurement yields measurements that cannot be multiplied and divided. This means that we cannot think in terms of average ratings as we often do.

Rankings are Just Rankings

The upshot is that ordinal scales are just rankings, just orderings, and without a more well established underlying numerical scale rankings are very hard to compare and combine in a manner that does not obscure more than it illuminates. Decisions based on naive uses of quantification are as likely as not to be wrong and influenced by extraneous and unacknowledged factors or just be the result of random consequences of choices made along the way.

Ten Reflections from the Fall Semester

Notes from this semester. Each semester I jot down observations about organizational practices, usually inspired by events at my place of employment.  Every now and then I try to distill them into advice for myself. Most are obvious, once articulated, but they come to notice, usually, because things happen just the other way round.

  1. Always treat the people you work with as if they are smart; explain why you take a stand or make a decision in a manner that demonstrates that you know they are smart, critical, and open to persuasion by evidence and argument. Set high standards for yourself. Your institutional work should be at least as smart as your scholarly work.
    • “it is better to be wrong than vague.” – Stinchcombe
    • If smart people are opposed to your idea, ask them to explain why. And listen. Remember, your goal is to get it right, not to get it your way.
  2. Do not put people in charge of cost cutting and budget reductions. Put them in charge of producing excellence within a budget constraint.
  3. Make sure everyone is able to say how many Xs one student leaving represents. How much will it cost to do the thing that reduces the chance a student will get fed up with things?
  4. If most of what a consultant tells you is what you want to hear (or already believe), fire her.
  5. Don’t build/design system and policies around worst cases, least cooperative colleagues, people who just don’t get it, or individuals with extraordinarily hard luck situations. Do not let people who deal with “problem students” suggest or make rules/policy.
  6. Be wise about what you must/should put up for a vote and what you should not. And if you don’t know how a vote will turn out, they are are not prepared to put it up for a vote.  Do your homework, person by person.
  7. If a top reason for implementing a new academic program is because there’s lots of interest among current students, pause. Those students are already at your school. What you want are new programs that are attractive to people who previously would not have given you a second look.
  8. If you are really surprised by the reaction folks have to an announcement or decision then just start your analysis with the realization that YOU screwed up.
    • Related: and don’t assume it was just about the messaging; you might actually be wrong and you should want to know whether that’s the case.

     

  9. If you or someone else’s first impulse when asked to get something done is to form a committee, put someone else in charge of getting that thing done.
  10. Persuade/teach folks that teams and committees in organizations are not representative democracies. The team does not want your opinions, feelings, experiences, or beliefs; it wants you enrich the team’s knowledge base by reporting on a part of the world you know something about.  And that usually means going and finding out in a manner that is sensitive to your availability bias.  In the research phase, team members are the sense organs of the team. Be a good sense organ not a jerking knee or pontificator or evangelist or nay sayer.

Rubrics, Disenchantment, and Analysis I

There is a tendency, in certain precincts in, and around, higher education, to fethishize rubrics.  One gets the impression at conferences and from consultants that arranging something in rows and columns with a few numbers around the edges will call forth the spirit of rational measurement, science even, to descend upon the task at hand.  That said, one can acknowledge the heuristic value of rubrics without succumbing to a belief in their magic.  Indeed, the critical examination of almost any of the higher education rubrics in current circulation will quickly disenchant, but one need not abandon all hope: if assessment is “here to stay,” as some say, it need not be the intellectual train wreck its regional and national champions sometimes seem inclined to produce.

Consider this single item from a rubric used to assess a general education goal in gender:

As is typical of rubric cell content, each of these is “multi-barrelled” — that is, the description in each cell is asking more than one question at a time. It’s not unlike a survey in which respondents are asked, “Are you conservative and in favor of ending welfare?”  It’s a methodological no-no, and, in general, it defeats the very idea of dis-aggregation (i.e., “what makes up an A?”) that a rubric is meant to provide.

In addition, rubrics when they are presented like this are notoriously hard to read. That’s not just an aesthetic issue — failure to communicate effectively leads to misuse of the rubrik (measurement error) and reduces the likelihood of effective constructive critique.

Here is the same information presented in a manner that’s more methodologically sound and more intellectually legible:

At the risk of getting ahead of ourselves, there IS a serious problem when these rank ordered categories are used as scores that can be added up and averaged, but we’ll save that for another discussion.  Too, there is the issue of operationalization — what does “deep” mean, after all, and how do you distinguish it from not so deep?  But this too is for another day.

Let’s, for the sake of argument, assume that each of these judgments can be made reliably by competent judges. All told, 4 separate judgments are to be made and each has 3 values. If these knowledges and skills are, in fact, independent (if not, a whole different can of worms), then there are 3 x 3 x 3 x 3 = 81 combinations of ratings possible. Each of these 81 possible assessments is eventually mapped on to1 of 4 ratings. Four combinations are specified, but the other 77 possibilities are not:

Now let us make an (probably invalid) assumption: that each of THESE scores is worth 1, 2 or 3 “points” and then let’s calculate the distance between each of the four scores. We use standard Euclidean distance – r=sqrt(x2 + y2) with the categories being: Mastery = 3 3 3 3, Practiced = 2 2 2 3, Introduced = 2 2 2 2, Benchmark = 1 1 1 1

 

So, how do these categories spread out along the dimension we are measuring here? Mastery, Introduced, and Benchmark are nicely spaced, 2 units apart (and M to B at 4 units). But then we try to fit P in. It’s 1.7 units from Mastery and 2.2 from Benchmark, but it’s also 1 unit from Introduced. To represent these distances we have to locate it off to the side.

This little exercise suggests that this line of the rubrik is measuring two dimensions.

This should provoke us into thinking about what dimensions of learning are being mixed together in this measurement operation.

It is conventional in this sort of exercise to try to characterize the dimensions in which the items are spread out. Looking back at how we defined the categories we speculate that one dimension might have to do with skill (analysis) and the other knowledge. But Mastery and Practiced were on the same level on analysis. What do we do?

It turns out that the orientation of a diagram like this is arbitrary — all it is showing us is relative distance. And so we can rotate it like this to show how our assessment categories for this goal relate to one another.

Now you may ask what was the point of this exercise?  First, if the point of assessment is to get teachers to think about teaching and learning, and to do so in a manner that applies the same sort of critical thinking skills that we think are important for students to acquire then a careful critique of our assessment methods is absolutely necessary.

Second, this little bit of quick and dirty analysis of a single rubric might actually help people design better rubrics AND to assess the quality of existing rubrics (there’s lots more to worry about on these issues, but that’s for another time).  Maybe, for example, we might conceptualize “introduce” to include knowledge but not skill or vice versa?  Maybe we’d think about whether the skill (analysis) is something that should cross GE categories and be expressed in common language.  And so on.

Third, this is a first step toward showing why it makes very little sense to take the scores produced by using rubrics like this and then adding them up and averaging them out in order to assess learning.  That will be the focus of a subsequent post.