Bad Methods Yield Non-Actionable Answers

Originally published June 2017

Having drunk the KoolAid of rubrics and assessment, many the untrained academic administrator epitomizes that old saw about a knowing just enough to be dangerous. Suppose a manager wants to make a decision based on multiple criteria. An academic manager, for example, might consider

  • Employee Type
  • Organization Needs and Employee Expertise
  • Employee Productivity
  • Employee Versatility
  • Engagement in Critical Roles

The plan is to rate each employee on each dimension and then add up the ratings to yield a score that will permit comparison between employees for the purpose of decisions about whether to retain the employee or not.

The individual ratings will be some variation on High, Medium, Low.

The use of rubrics such as this is all the rage in higher education. Unfortunately, they are frequently deployed in a manner that reduces

Ratings are not normalized

By having some categories top ranking count 3 and others 2 points we introduce a distortion into the final score. Type, match, and productivity “count” more than versatility and critical role.  If that’s intended, fine, but if not, it skews results.

Ordinal Scales Do Not Contain Distance Information

Any fool, as they say, knows that “high” is more than “medium” which is more than “low” and “low” is more than “none.”  When we have a scale that has this property we call it an “ordinal” scale; the elements of the scale can unambiguously be ordered from low to high.

What we do NOT know, though, is whether the “distance” between a high rating and a medium rating is equal to the distance between a medium rating and a low rating.

Although it is extremely common to look at an ordinal scale like “high, medium, and low” and assign 3 to high, 2 to medium, and 1 to low, this is a serious methodological error.  It invents information out of thin air and inserts it into the assessment. The ways in which this distorts the answers that emerge from the measurement cannot be determined without careful analysis. Just writing 3, 2, 1 next to words is not careful analysis.

Criteria Overlap Double Counts Things

Suppose some of the same underlying traits and behaviors contribute to a needs/expertise match and an employee’s versatility and that this trait is one of many we would like to consider in deciding whether to retain the employee. Since it has an impact on both factors its presence effectively gets counted twice (as would its absence).
Unless we are very careful to be sure that each rating category is separate and distinct, a rubric like this introduces distortion into the final score by unintentionally overweighting some factors and underweighting others.

Sequence Matters

When using rubrics like this we sometimes hear that one or another criteria is only used after the others or is used as a screen before the others. This too needs to be done thoughtfully and deliberately. It is not hard to show how different sequences of applying criteria can result in different outcomes.

Zero is Not Nothing

A final problem with scales like these is that even if the distance between the ratings were meaningful, it is not always the case that we have a well defined “zero” rating.  Assigning zero to the lowest rating category is not the same as saying that those assigned to this category have none of whatever is being measured.
The problem that this introduces is that a scale without a well understood zero measurement yields measurements that cannot be multiplied and divided. This means that we cannot think in terms of average ratings as we often do.

Rankings are Just Rankings

The upshot is that ordinal scales are just rankings, just orderings, and without a more well established underlying numerical scale rankings are very hard to compare and combine in a manner that does not obscure more than it illuminates. Decisions based on naive uses of quantification are as likely as not to be wrong and influenced by extraneous and unacknowledged factors or just be the result of random consequences of choices made along the way.

House of Cards

A Facebook post called my attention to a neat little article about why swimming rules only recognize hundredths of seconds even though modern timing technology allows much more precise measurements. The gist is this: swimming rules recognize that construction technology limits the precision with which pools can be built to something like a few centimeters in a 50 meter long pool.  At top speed a swimmer moves about 2 millimeters in a thousandth of a second.  So, if you award places based on differences of thousandths of a second, you can’t know if you are rewarding faster swimming or the luck of swimming in a shorter lane.

This observation points to the more general phenomena of false precision, misplaced concreteness (aka reification, hypostatization), and organizational irrationality rooted in sloppy and abusive quantification.

These are endemic in higher education.

Students graduate with a GPA and it’s taken as a real, meaningful thing. But if you look at what goes into it (exams designed less and more well, subjective letter grades on essays, variable “points off” for rule infractions, quirky weighting of assignments, arbitrary conversions of points to letter grades, curves, etc.), you’d have to allow for error bars the size of a city block.

Instructors fret about average scores on teaching evaluations.

“Data driven” policies are built around the analysis of tiny-N samples that are neither random nor representative.

Courses are fielded or not and faculty lines granted or not based on enrollment numbers with no awareness of the contribution of class scheduling, requirement finagling, course content overlap, perceptions of ease, and the wording of titles.

Budgets are built around seat-of-the-pants estimates and negotiated targets.

One could go on.

The bottom line is that decision makers need to recognize how all of these shaky numbers are aggregated to produce what they think are facts about the institution and its environment.  This suggests two imperatives. First, we should reduce individual cases of crap quantification.  Second, when we bring “facts” together (e.g., enrollment estimates and cost of instruction) we should adopt an “error bar” sensibility – in it’s simplest form, treat any number as being “likely between X and Y” – so that each next step is attended by an appropriate amount of uncertainty rather than an inappropriate amount of fantasized certainty.