"But even if they are not valid, they do tell you something…."

Remember, “validity” means “they measure what you think they measure.” “Data driven” can also mean driven right off the side of the road.

From Inside Higher Ed

Zero Correlation Between Evaluations and Learning

New study adds to evidence that student reviews of professors have limited validity.
September 21, 2016 By Colleen Flaherty


A number of studies suggest that student evaluations of teaching are unreliable due to various kinds of biases against instructors. (Here’s one addressing gender.) Yet conventional wisdom remains that students learn best from highly rated instructors; tenure cases have even hinged on it.
What if the data backing up conventional wisdom were off? A new study suggests that past analyses linking student achievement to high student teaching evaluation ratings are flawed, a mere “artifact of small sample sized studies and publication bias.”
“Whereas the small sample sized studies showed large and moderate correlation, the large sample sized studies showed no or only minimal correlation between [student evaluations of teaching, or SET] ratings and learning,” reads the study, in press with Studies in Educational Evaluation. “Our up-to-date meta-analysis of all multi-section studies revealed no significant correlations between [evaluation] ratings and learning.”

House of Cards

A Facebook post called my attention to a neat little article about why swimming rules only recognize hundredths of seconds even though modern timing technology allows much more precise measurements. The gist is this: swimming rules recognize that construction technology limits the precision with which pools can be built to something like a few centimeters in a 50 meter long pool.  At top speed a swimmer moves about 2 millimeters in a thousandth of a second.  So, if you award places based on differences of thousandths of a second, you can’t know if you are rewarding faster swimming or the luck of swimming in a shorter lane.

This observation points to the more general phenomena of false precision, misplaced concreteness (aka reification, hypostatization), and organizational irrationality rooted in sloppy and abusive quantification.

These are endemic in higher education.

Students graduate with a GPA and it’s taken as a real, meaningful thing. But if you look at what goes into it (exams designed less and more well, subjective letter grades on essays, variable “points off” for rule infractions, quirky weighting of assignments, arbitrary conversions of points to letter grades, curves, etc.), you’d have to allow for error bars the size of a city block.

Instructors fret about average scores on teaching evaluations.

“Data driven” policies are built around the analysis of tiny-N samples that are neither random nor representative.

Courses are fielded or not and faculty lines granted or not based on enrollment numbers with no awareness of the contribution of class scheduling, requirement finagling, course content overlap, perceptions of ease, and the wording of titles.

Budgets are built around seat-of-the-pants estimates and negotiated targets.

One could go on.

The bottom line is that decision makers need to recognize how all of these shaky numbers are aggregated to produce what they think are facts about the institution and its environment.  This suggests two imperatives. First, we should reduce individual cases of crap quantification.  Second, when we bring “facts” together (e.g., enrollment estimates and cost of instruction) we should adopt an “error bar” sensibility – in it’s simplest form, treat any number as being “likely between X and Y” – so that each next step is attended by an appropriate amount of uncertainty rather than an inappropriate amount of fantasized certainty.

Student Evaluations of Teaching: WHY is this still a thing?

My institution just created a data science major. But it doesn’t care about using data in honest and robust ways any more than other institutions.

It’s gotten to the point that it’s intellectually embarrassing and ethically troubling that we are still using student evaluations of teaching (SET) in their current form for assessing instructor job performance. It is laughable that we do so with numbers computed to two decimal places. It is scandalous that we ignore the documented biases (most especially gender-based). But we do.

Why isn’t this an active conversation between faculty and administrators?  I certainly find teaching evaluations helpful – trying to understand why I got a 3.91 on course organization but a 4.32 on inspiring interest is a useful meditation on my teaching practice.  I have to remind myself that the numbers themselves do not mean much.

Telling me where my numbers stand vis a vis my colleagues or the college as a whole FEELS useful and informative, but is it? I THINK I must be doing a better job than a colleague who has scores in the 2.0 – 3.0 range. But doing a better job at what? If you think hard about it, all you can probably take the bank is that I am better at getting more people to say “Excellent” in response to a particular question. The connection between THAT student behavior and the quality of my work is a loose one.

Maybe I am on solid ground when I compare my course organization score to my inspires interest score. MAYBE I am on solid ground when I compare my course organization score in one class to the same score in another the same semester or the same class in another year. I might, for example, think about changes I could make in how I organize a course and then see if that score moves next semester.

But getting seduced by the second decimal place is ludicrous and mad. Even fetishizing the first decimal place is folly. For that matter, even treating this as an average to begin with is bogus.

If you also use these numbers to decide whether to promote me, you’ve gone off into the twilight zone where the presence of numbers gives the illusion of facticity and objectivity. Might as well utter some incantations while you are at it.

Some new research adds another piece of evidence to the claim that the validity of the numbers in student evaluations of teachers is probably pretty low. Validity means “do they measure what you think they measure?” The answer here is that they do not. Instead, they measure things like “what gender is your instructor?” and “what kind of grade do you expect in this course?”

These researchers even found gender differences in objective practices like “how promptly were assignments graded” and these persisted when the students were misinformed about gender of instructors.

Let’s start implementing a policy we can have some respect for. No more averaging. No more use of numerical scores in personnel review. No more batteries of questions that ask more or less the same thing (thus distorting the positivity or negativity of the overall impression).

As John Oliver asks, “why is this still a thing?”

Rubrics, Disenchantment, and Analysis I

There is a tendency, in certain precincts in, and around, higher education, to fethishize rubrics.  One gets the impression at conferences and from consultants that arranging something in rows and columns with a few numbers around the edges will call forth the spirit of rational measurement, science even, to descend upon the task at hand.  That said, one can acknowledge the heuristic value of rubrics without succumbing to a belief in their magic.  Indeed, the critical examination of almost any of the higher education rubrics in current circulation will quickly disenchant, but one need not abandon all hope: if assessment is “here to stay,” as some say, it need not be the intellectual train wreck its regional and national champions sometimes seem inclined to produce.

Consider this single item from a rubric used to assess a general education goal in gender:

As is typical of rubric cell content, each of these is “multi-barrelled” — that is, the description in each cell is asking more than one question at a time. It’s not unlike a survey in which respondents are asked, “Are you conservative and in favor of ending welfare?”  It’s a methodological no-no, and, in general, it defeats the very idea of dis-aggregation (i.e., “what makes up an A?”) that a rubric is meant to provide.

In addition, rubrics when they are presented like this are notoriously hard to read. That’s not just an aesthetic issue — failure to communicate effectively leads to misuse of the rubrik (measurement error) and reduces the likelihood of effective constructive critique.

Here is the same information presented in a manner that’s more methodologically sound and more intellectually legible:

At the risk of getting ahead of ourselves, there IS a serious problem when these rank ordered categories are used as scores that can be added up and averaged, but we’ll save that for another discussion.  Too, there is the issue of operationalization — what does “deep” mean, after all, and how do you distinguish it from not so deep?  But this too is for another day.

Let’s, for the sake of argument, assume that each of these judgments can be made reliably by competent judges. All told, 4 separate judgments are to be made and each has 3 values. If these knowledges and skills are, in fact, independent (if not, a whole different can of worms), then there are 3 x 3 x 3 x 3 = 81 combinations of ratings possible. Each of these 81 possible assessments is eventually mapped on to1 of 4 ratings. Four combinations are specified, but the other 77 possibilities are not:

Now let us make an (probably invalid) assumption: that each of THESE scores is worth 1, 2 or 3 “points” and then let’s calculate the distance between each of the four scores. We use standard Euclidean distance – r=sqrt(x2 + y2) with the categories being: Mastery = 3 3 3 3, Practiced = 2 2 2 3, Introduced = 2 2 2 2, Benchmark = 1 1 1 1


So, how do these categories spread out along the dimension we are measuring here? Mastery, Introduced, and Benchmark are nicely spaced, 2 units apart (and M to B at 4 units). But then we try to fit P in. It’s 1.7 units from Mastery and 2.2 from Benchmark, but it’s also 1 unit from Introduced. To represent these distances we have to locate it off to the side.

This little exercise suggests that this line of the rubrik is measuring two dimensions.

This should provoke us into thinking about what dimensions of learning are being mixed together in this measurement operation.

It is conventional in this sort of exercise to try to characterize the dimensions in which the items are spread out. Looking back at how we defined the categories we speculate that one dimension might have to do with skill (analysis) and the other knowledge. But Mastery and Practiced were on the same level on analysis. What do we do?

It turns out that the orientation of a diagram like this is arbitrary — all it is showing us is relative distance. And so we can rotate it like this to show how our assessment categories for this goal relate to one another.

Now you may ask what was the point of this exercise?  First, if the point of assessment is to get teachers to think about teaching and learning, and to do so in a manner that applies the same sort of critical thinking skills that we think are important for students to acquire then a careful critique of our assessment methods is absolutely necessary.

Second, this little bit of quick and dirty analysis of a single rubric might actually help people design better rubrics AND to assess the quality of existing rubrics (there’s lots more to worry about on these issues, but that’s for another time).  Maybe, for example, we might conceptualize “introduce” to include knowledge but not skill or vice versa?  Maybe we’d think about whether the skill (analysis) is something that should cross GE categories and be expressed in common language.  And so on.

Third, this is a first step toward showing why it makes very little sense to take the scores produced by using rubrics like this and then adding them up and averaging them out in order to assess learning.  That will be the focus of a subsequent post.