House of Cards

A Facebook post called my attention to a neat little article about why swimming rules only recognize hundredths of seconds even though modern timing technology allows much more precise measurements. The gist is this: swimming rules recognize that construction technology limits the precision with which pools can be built to something like a few centimeters in a 50 meter long pool.  At top speed a swimmer moves about 2 millimeters in a thousandth of a second.  So, if you award places based on differences of thousandths of a second, you can’t know if you are rewarding faster swimming or the luck of swimming in a shorter lane.

This observation points to the more general phenomena of false precision, misplaced concreteness (aka reification, hypostatization), and organizational irrationality rooted in sloppy and abusive quantification.

These are endemic in higher education.

Students graduate with a GPA and it’s taken as a real, meaningful thing. But if you look at what goes into it (exams designed less and more well, subjective letter grades on essays, variable “points off” for rule infractions, quirky weighting of assignments, arbitrary conversions of points to letter grades, curves, etc.), you’d have to allow for error bars the size of a city block.

Instructors fret about average scores on teaching evaluations.

“Data driven” policies are built around the analysis of tiny-N samples that are neither random nor representative.

Courses are fielded or not and faculty lines granted or not based on enrollment numbers with no awareness of the contribution of class scheduling, requirement finagling, course content overlap, perceptions of ease, and the wording of titles.

Budgets are built around seat-of-the-pants estimates and negotiated targets.

One could go on.

The bottom line is that decision makers need to recognize how all of these shaky numbers are aggregated to produce what they think are facts about the institution and its environment.  This suggests two imperatives. First, we should reduce individual cases of crap quantification.  Second, when we bring “facts” together (e.g., enrollment estimates and cost of instruction) we should adopt an “error bar” sensibility – in it’s simplest form, treat any number as being “likely between X and Y” – so that each next step is attended by an appropriate amount of uncertainty rather than an inappropriate amount of fantasized certainty.

Student Evaluations of Teaching: WHY is this still a thing?

My institution just created a data science major. But it doesn’t care about using data in honest and robust ways any more than other institutions.

It’s gotten to the point that it’s intellectually embarrassing and ethically troubling that we are still using student evaluations of teaching (SET) in their current form for assessing instructor job performance. It is laughable that we do so with numbers computed to two decimal places. It is scandalous that we ignore the documented biases (most especially gender-based). But we do.

Why isn’t this an active conversation between faculty and administrators?  I certainly find teaching evaluations helpful – trying to understand why I got a 3.91 on course organization but a 4.32 on inspiring interest is a useful meditation on my teaching practice.  I have to remind myself that the numbers themselves do not mean much.

Telling me where my numbers stand vis a vis my colleagues or the college as a whole FEELS useful and informative, but is it? I THINK I must be doing a better job than a colleague who has scores in the 2.0 – 3.0 range. But doing a better job at what? If you think hard about it, all you can probably take the bank is that I am better at getting more people to say “Excellent” in response to a particular question. The connection between THAT student behavior and the quality of my work is a loose one.

Maybe I am on solid ground when I compare my course organization score to my inspires interest score. MAYBE I am on solid ground when I compare my course organization score in one class to the same score in another the same semester or the same class in another year. I might, for example, think about changes I could make in how I organize a course and then see if that score moves next semester.

But getting seduced by the second decimal place is ludicrous and mad. Even fetishizing the first decimal place is folly. For that matter, even treating this as an average to begin with is bogus.

If you also use these numbers to decide whether to promote me, you’ve gone off into the twilight zone where the presence of numbers gives the illusion of facticity and objectivity. Might as well utter some incantations while you are at it.

Some new research adds another piece of evidence to the claim that the validity of the numbers in student evaluations of teachers is probably pretty low. Validity means “do they measure what you think they measure?” The answer here is that they do not. Instead, they measure things like “what gender is your instructor?” and “what kind of grade do you expect in this course?”

These researchers even found gender differences in objective practices like “how promptly were assignments graded” and these persisted when the students were misinformed about gender of instructors.

Let’s start implementing a policy we can have some respect for. No more averaging. No more use of numerical scores in personnel review. No more batteries of questions that ask more or less the same thing (thus distorting the positivity or negativity of the overall impression).

As John Oliver asks, “why is this still a thing?”