Assessing Assessment?

The appalling legacy of “assessment” goes on and on and on. This “frank discussion” at a recent WASC conference is a classic bit of “too little, too late.”

I’m someone keenly interested in the organizational aspects of higher education, especially in questions of how we know we are being as effective and as productive as the world needs us to be. But for most of my career I’ve watched millions of person-dollars squandered on misguided efforts to “document” and “measure” learning. Alongside that I’ve watched the erosion of the intellectual integrity of institutions and individuals as they winked and went through the motions of methods they knew (or should have known) were bogus and would never produce actionable, valid knowledge. We watched as individual faculty members sold their souls for small stipends or to keep on the good side of a dean who might have input into their tenure or promotion case. And those of us who dared to apply our professional training to point out the inanity of the methodological manure being sold to us endured being dressed down for not being team players or having our commitment to students questioned by arrogant small-minded assessment consultants.

A real underlying pathology exposed by the ongoing assessment debacle is the monopoly power of the accreditation agencies. For the last two decades they ranted about accountability in higher education – the one standard they would never have to meet. The hypocrisy of agencies like WASC being immune to serious criticism should be an embarrassment to people who care about higher education.

The simple move of forcing national education accreditation agencies to compete rather than allowing them to enjoy geography-based monopolies would do more for higher education than a thousand conference presentations from people who live off the problem rather than for its solution.

"But even if they are not valid, they do tell you something…."

Remember, “validity” means “they measure what you think they measure.” “Data driven” can also mean driven right off the side of the road.

From Inside Higher Ed

Zero Correlation Between Evaluations and Learning

New study adds to evidence that student reviews of professors have limited validity.
September 21, 2016 By Colleen Flaherty


A number of studies suggest that student evaluations of teaching are unreliable due to various kinds of biases against instructors. (Here’s one addressing gender.) Yet conventional wisdom remains that students learn best from highly rated instructors; tenure cases have even hinged on it.
What if the data backing up conventional wisdom were off? A new study suggests that past analyses linking student achievement to high student teaching evaluation ratings are flawed, a mere “artifact of small sample sized studies and publication bias.”
“Whereas the small sample sized studies showed large and moderate correlation, the large sample sized studies showed no or only minimal correlation between [student evaluations of teaching, or SET] ratings and learning,” reads the study, in press with Studies in Educational Evaluation. “Our up-to-date meta-analysis of all multi-section studies revealed no significant correlations between [evaluation] ratings and learning.”

House of Cards

A Facebook post called my attention to a neat little article about why swimming rules only recognize hundredths of seconds even though modern timing technology allows much more precise measurements. The gist is this: swimming rules recognize that construction technology limits the precision with which pools can be built to something like a few centimeters in a 50 meter long pool.  At top speed a swimmer moves about 2 millimeters in a thousandth of a second.  So, if you award places based on differences of thousandths of a second, you can’t know if you are rewarding faster swimming or the luck of swimming in a shorter lane.

This observation points to the more general phenomena of false precision, misplaced concreteness (aka reification, hypostatization), and organizational irrationality rooted in sloppy and abusive quantification.

These are endemic in higher education.

Students graduate with a GPA and it’s taken as a real, meaningful thing. But if you look at what goes into it (exams designed less and more well, subjective letter grades on essays, variable “points off” for rule infractions, quirky weighting of assignments, arbitrary conversions of points to letter grades, curves, etc.), you’d have to allow for error bars the size of a city block.

Instructors fret about average scores on teaching evaluations.

“Data driven” policies are built around the analysis of tiny-N samples that are neither random nor representative.

Courses are fielded or not and faculty lines granted or not based on enrollment numbers with no awareness of the contribution of class scheduling, requirement finagling, course content overlap, perceptions of ease, and the wording of titles.

Budgets are built around seat-of-the-pants estimates and negotiated targets.

One could go on.

The bottom line is that decision makers need to recognize how all of these shaky numbers are aggregated to produce what they think are facts about the institution and its environment.  This suggests two imperatives. First, we should reduce individual cases of crap quantification.  Second, when we bring “facts” together (e.g., enrollment estimates and cost of instruction) we should adopt an “error bar” sensibility – in it’s simplest form, treat any number as being “likely between X and Y” – so that each next step is attended by an appropriate amount of uncertainty rather than an inappropriate amount of fantasized certainty.

Student Evaluations of Teaching: WHY is this still a thing?

My institution just created a data science major. But it doesn’t care about using data in honest and robust ways any more than other institutions.

It’s gotten to the point that it’s intellectually embarrassing and ethically troubling that we are still using student evaluations of teaching (SET) in their current form for assessing instructor job performance. It is laughable that we do so with numbers computed to two decimal places. It is scandalous that we ignore the documented biases (most especially gender-based). But we do.

Why isn’t this an active conversation between faculty and administrators?  I certainly find teaching evaluations helpful – trying to understand why I got a 3.91 on course organization but a 4.32 on inspiring interest is a useful meditation on my teaching practice.  I have to remind myself that the numbers themselves do not mean much.

Telling me where my numbers stand vis a vis my colleagues or the college as a whole FEELS useful and informative, but is it? I THINK I must be doing a better job than a colleague who has scores in the 2.0 – 3.0 range. But doing a better job at what? If you think hard about it, all you can probably take the bank is that I am better at getting more people to say “Excellent” in response to a particular question. The connection between THAT student behavior and the quality of my work is a loose one.

Maybe I am on solid ground when I compare my course organization score to my inspires interest score. MAYBE I am on solid ground when I compare my course organization score in one class to the same score in another the same semester or the same class in another year. I might, for example, think about changes I could make in how I organize a course and then see if that score moves next semester.

But getting seduced by the second decimal place is ludicrous and mad. Even fetishizing the first decimal place is folly. For that matter, even treating this as an average to begin with is bogus.

If you also use these numbers to decide whether to promote me, you’ve gone off into the twilight zone where the presence of numbers gives the illusion of facticity and objectivity. Might as well utter some incantations while you are at it.

Some new research adds another piece of evidence to the claim that the validity of the numbers in student evaluations of teachers is probably pretty low. Validity means “do they measure what you think they measure?” The answer here is that they do not. Instead, they measure things like “what gender is your instructor?” and “what kind of grade do you expect in this course?”

These researchers even found gender differences in objective practices like “how promptly were assignments graded” and these persisted when the students were misinformed about gender of instructors.

Let’s start implementing a policy we can have some respect for. No more averaging. No more use of numerical scores in personnel review. No more batteries of questions that ask more or less the same thing (thus distorting the positivity or negativity of the overall impression).

As John Oliver asks, “why is this still a thing?”

Real "Competencies" for the 21st Century

Music to my ears. Sarah Lawrence, long known for its innovative approach to liberal arts education (still using narrative evaluations – something we could adopt at Mills to great effect IMHO), crafts a simple response to assessment madness and places it where it should be: at the student-advisor nexus.

Imagine: six goals that are about skill not ideological content; evaluated every semester in every course; tracked over time by student and advisor. Throw all the rest of the baroque apparatus away and get on with educating.

H/T to Mark Henderson

Play audio at MarketPlace Education

At Sarah Lawrence College in Bronxville, N.Y., about ten students — all women but one — sit at a round table discussing Jane Austen’s “Northanger Abbey.”

The 88-year-old college has a reputation for doing things differently. Most classes are small seminars like this one. There are no majors. Students do a lot of independent projects. And grades aren’t as important as the long written evaluations professors give every student at the end of every semester. It’s no surprise, then, that professor James Horowitz is skeptical of any uniform college rating system, like the one being proposed by the Obama administration.

“The goals that we are trying to achieve in instructing our students might be very different from what the University of Chicago or many other schools or a state school or a community college might be striving to achieve,” Horowitz says.

The Obama administration is due out this spring with details of its controversial plan to rate colleges on measures like value and affordability. The idea is that if students can compare schools on cost, graduation rates and even how much money students earn after they graduate — colleges might have to step up their game. Especially if, as proposed, poor performers risk losing access to federal financial aid.

All that, naturally, makes colleges just a bit nervous. Sarah Lawrence is fighting back with its own way of measuring value. The faculty came up with six abilities they think every Sarah Lawrence graduate should have. They include the ability to write and communicate effectively, to think analytically, and to accept and act on critique.

“We don’t believe that there’s like 100 things you should know when you graduate,” says computer science professor Michael Siff, who helped develop the tool. “It’s much more about are you a good learner? Do you know how to enter into a new domain and attack it with an open mind, but also an organized mind?”

Faculty advisors can use the results to track students’ progress over time and help them address any weaknesses. A student who’s struggling with communication could take class with a lot of oral presentations, for example, or make an appointment at the campus writing center.

But Siff says the tool is also about figuring out what the college can do better.
“This tool will allow us to assess ourselves as an institution,” he says. “Are we imparting what we believe to be these critical abilities?”

So how is the school doing? So far there are only data for two semesters, but on every measure seniors do better than juniors. Sophomores do better than freshmen.

Starting next fall, advisors will meet with their students at the beginning of each semester to talk over their progress. In sort of a trial run, Siff goes over the results so far with one of his advisees, junior Zachary Doege.

On a scale from “not yet developed” to “excellent,” he’s mostly at the top end. Doege says he likes seeing his own growth.

“I think the thing I like the most about this is just the fact that I can look back at how I was doing in previous semesters and sort of chart my own progress,” he says. “Not comparing me towards other students—just me to myself.”

That’s a different measure of the value of an education than, say, student loan debt or earnings after graduation — the sorts of things the Obama administration is considering as part of its ratings plan. Students and parents are right to ask if they’re getting their money’s worth, says the college’s president, Karen Lawrence. After financial aid, the average cost of a Sarah Lawrence education is almost $43,000 a year.

“People are worried about cost,” Lawrence says. “We understand that.”

And they’re worried about getting jobs after graduation. But she says the abilities that the new assessment measures—critical thinking and innovation and collaboration—are the same ones employers say they’re looking for.

“We think these are abilities that students are going to need both right after graduation and in the future, and so it could be an interesting model.”

One she hopes other schools will take a look at as they figure out how to answer the national debate about the value of college.

Does Assessment Work?

A short commentary on assessment from a respected sociologist who has done a big assessment project funded by the Mellon Foundation and who served several years on Middle States (the WASC of the mid-Atlantic region).  Chambliss spoke at Mills in ~2005.

Evaluating and Assessing Short Intensive Courses

Two articles on the topic of assessing and evaluating short, intensive courses.  Most of the results appear positive in terms of learning outcomes, but there are a number of factors associated with variations in outcomes that appear worth paying attention to.

Using a database of over 45,000 observations from Fall, Spring, and Summer semesters, we investigate the link between course length and student learning. We find that, after controlling for student demographics and other characteristics, intensive courses do result in higher grades than traditional 16 week semester length courses and that this benefit peaks at about 4 weeks. By looking at future performance we are also able to show that the higher grades reflect a real increase in knowledge and are not the result of a “lowering of the bar” during summer. We discuss some of the policy implications of our findings.


Altogether, we found roughly 100 publications that, in varying degrees, addressed intensive courses. After reviewing the collective literature, we identified four major lines of related inquiry: 1) time and learning studies; 2) studies of educational outcomes comparing intensive and traditional formats; 3) studies comparing course requirements and practices between intensive and traditional 

Scott and Conrad finish their literature review with several sets of open research questions suggested by their research:


  1. How do course requirements and faculty expectations of students compare between intensive and traditional formats and, if different, how does this affect the learning environment and student learning outcomes?
  2. How do student’s study patterns compare between intensive and traditional length courses?

Learning Outcomes

  1. How do pedagogical approaches compare between intensive and traditional length courses and, if different, how do these variations affect learning?
  2. How does the amount of time-on-task (i.e., productive class time) compare between intensive and traditional-length courses?
  3. How do stress and fatigue affect learning in intensive courses?
  4. Are intensive courses intrinsically rewarding and if so, how does that affect the classroom experience and learning outcomes?
  5. How do the immediate (short-term) and long-term learning outcomes compare between intensive and traditional-length courses?
  6. How do different student groups compare in their ability to learn under intensive conditions? For example, do older and younger students learn equally well in intensive courses?
  7. How does the degree of intensity influence student achievement? Do three week courses yield equivalent results to eight-week courses?
  8. How does the subject matter influence outcomes in intensive courses?
  9. Which kinds and levels of learning are appropriate for intensive formats?
  10. How do course withdrawals and degree completion rates compare between students who enroll in intensive versus traditional courses?
  11. How do intensive courses influence a student’s attitude toward learning?

Optimizing Factors and Conditions

  1. What disciplines and types of courses are best suited for intensive formats?
  2. What type of students are best suited for intensive formats?
  3. What types of pedagogical styles and instructional practices are best suited for intensive formats? Must teaching strategies change for intensive courses to be effective?
  4. Can certain instructional practices optimize learning?
  5. Do learning strategies differ between intensive and traditional-length courses and if so, can students effectively “learn how to learn” in time compressed formats? In other words, can students be taught effective learning strategies for intensive courses that would enhance achievement outcomes?

See Also

John V. Kucsera & Dawn M. Zimmaro Comparing the Effectiveness of Intensive and Traditional CoursesCollege Teaching Volume 58, Issue 2, 2010, pages 62-68

What If Administrator Pay Were Tied to Student Learning Outcomes

The recent negotiation in Chicago (“Performance Pay for College Faculty“) of a tie between student performance and college instructor pay brought this accolade from an administrator:  it gets faculty “to take a financial stake in student success.”

It got me wondering why we don’t hear more about directly tying administrator pay to student success.  If we did, I’ll bet the students would have a lot more success.  At least, that’s what the data released to the public (and Board of Trustees) would show.  There’d be far less of a crisis in higher education.

Thought experiment. What would  happen if we were to tie administrator pay to student success — much the way corporate CEOs have their pay packages designed — especially administrators of large multi-campus systems.

Prediction 1.  The immediate response to the very proposal would be “oh, no, you can’t do that because we do not have the same kind of authority to hire and fire and reward and punish that a corporate CEO has.”  But think about this…

  1. Private sector management has a lot less flexibility than those looking in from the outside think.  Almost all of the organizational impediments to simple, rational management are endemic to all organizations.
  2. Leadership is not primarily about picking the members of your team. It’s about what you manage to get the team you have to accomplish.
  3. Educational administrators do not start the job ignorant of how these educational institutions work. It is tremendously disingenuous to say “if only I had a different set of tools.”  People who do not think they can manage with the tools available and within the culture as it exists should not take these jobs in the first place.
  4. This, it turns out, is what some people mean when they say that schools should be run like a business. The first impulse of unsuccessful leaders is to blame the led. The second one is to engage in organizational sector envy: “if I had the tools they have over in X industry….”  What this ignores is the obvious evidence that others DO succeed in your industry with your tools.  And plenty of leaders “over there” fail too.  It is not the tools’ fault.

Prediction 2.  Learning would be redefined in terms of things produced by inputs administrators had more control over.  And resources would flow in that direction too.

Prediction 3. Administrators would get panicky when they looked at the rubrics in the assessment plans they exhort faculty to participate in and that are included in reports they have signed off on for accreditation agencies.  They’d suddenly start hearing the critics who raise questions about methodologies.  They would start to demand that smart ideas should drive the process and that computer systems should accommodate good ideas rather than being a reason for implementing bad ones.

Prediction 4. In some cases it would motivate individuals to start really thinking “will this promote real learning for students” each time they make a decision.  And they’ll look carefully at all that assessment data they’ve had the faculty produce and mutter, “damned if I know.”

Prediction 5. Someone will argue that the question is moot because administrators are already held responsible for institutional learning outcomes.   Someone else will say “Plus ça change, plus c’est la même chose.”

Better Teaching Through a Financial Stake in the Outcome

In an Inside Higher Ed article this week (“Performance Pay for College Faculty“) K Basu and P Fain describe how the new contract signed between City Colleges of Chicago and a union representing 459 adult education instructors links pay raises to student outcomes.

Administrators lauded the move in part because it gets faculty “to take a financial stake in student success.” The details of the plan are not clear from the article, but the basic framework is to use student testing to determine annual bonus pay for groups of instructors working in various areas. That is, in this particular plan it does not sound like the incentive pay is at the level of individual instructors.

Still, should the rest of higher education be paying attention? Adult education at CCC is, after all, a markedly different beast than full time liberal arts institutions or 4 year state schools or research universities. One reason we should because it’s precisely the tendency to elide institutional differences that is one of the hallmarks of the style of thought endemic among some higher education “reformers.” Those who think it’s a good idea for adult education institutions are likely to champion it elsewhere.

But most germane for the subject of this blog is the question of what data would inform such pay for performance decisions when they are proposed for other parts of American higher education. Likely it will be something that grows out of what we now know as learning assessment. I ask the reader: given what you have seen of assessment of learning outcomes in your college, how do you feel about having decisions about your pay check based upon it?

But, your opinion aside, there are several fundamental questions here. One is whether you become a more effective teacher by having a financial stake in the outcome. The industry where this incentive logic has been most extensively deployed is probably the financial services industry, especially investment banking.  How has that worked for society?  It would be easy to cook up scary stories of how this could distort the education process, but that’s not even necessary to debunk the idea.  The amounts at play in the teacher pay realm are so small that one can barely imagine even a nudge effect on how people approach their work.

But what about the data?  Consider the prospect of assessment as we know it as input to ANY decision process, let alone personnel decisions.  Anyone who has spent any time at all looking at how assessment is implemented knows that the error bars on any datum emerging from it dwarf the underlying measurement. The conceptual framework is thrown together on the basis of dubious theoretical model of teaching and learning and forced collaboration between instructors and assessment professionals.  The process sacrifices methodological rigor in the name of pragmatism, a culture of presentation (vis a vis accreditation agencies), and the tail of design limitations of software systems that wags the dog of pedagogy and common sense.  At every step of the process information is lost and distorted. But it seems that the more Byzantine that process is, the more its champions think they have scientific fact as product.

It could well be that the arrangement agreed to in Chicago will lead to instructors talking to one another about teaching, coordinating their classroom practices, and all sorts of other things that might improve the achievements of their students.  But it will likely be a rather indirect effect via the social organization of teachers (if I understood the article, the good thing about the Chicago plan is that it rewards entire categories of instructors for the aggregate improvement).  To sell it at the level of individual incentive is silly and misleading.  And, if we think more broadly about higher education, the notion that you can take the kinds of discrimination you get from extremely fuzzy data and multiply it by tiny amounts of money to produce positive change at the level of the individual instructor is probably best called bad management 101.

Rubrics, Disenchantment, and Analysis I

There is a tendency, in certain precincts in, and around, higher education, to fethishize rubrics.  One gets the impression at conferences and from consultants that arranging something in rows and columns with a few numbers around the edges will call forth the spirit of rational measurement, science even, to descend upon the task at hand.  That said, one can acknowledge the heuristic value of rubrics without succumbing to a belief in their magic.  Indeed, the critical examination of almost any of the higher education rubrics in current circulation will quickly disenchant, but one need not abandon all hope: if assessment is “here to stay,” as some say, it need not be the intellectual train wreck its regional and national champions sometimes seem inclined to produce.

Consider this single item from a rubric used to assess a general education goal in gender:

As is typical of rubric cell content, each of these is “multi-barrelled” — that is, the description in each cell is asking more than one question at a time. It’s not unlike a survey in which respondents are asked, “Are you conservative and in favor of ending welfare?”  It’s a methodological no-no, and, in general, it defeats the very idea of dis-aggregation (i.e., “what makes up an A?”) that a rubric is meant to provide.

In addition, rubrics when they are presented like this are notoriously hard to read. That’s not just an aesthetic issue — failure to communicate effectively leads to misuse of the rubrik (measurement error) and reduces the likelihood of effective constructive critique.

Here is the same information presented in a manner that’s more methodologically sound and more intellectually legible:

At the risk of getting ahead of ourselves, there IS a serious problem when these rank ordered categories are used as scores that can be added up and averaged, but we’ll save that for another discussion.  Too, there is the issue of operationalization — what does “deep” mean, after all, and how do you distinguish it from not so deep?  But this too is for another day.

Let’s, for the sake of argument, assume that each of these judgments can be made reliably by competent judges. All told, 4 separate judgments are to be made and each has 3 values. If these knowledges and skills are, in fact, independent (if not, a whole different can of worms), then there are 3 x 3 x 3 x 3 = 81 combinations of ratings possible. Each of these 81 possible assessments is eventually mapped on to1 of 4 ratings. Four combinations are specified, but the other 77 possibilities are not:

Now let us make an (probably invalid) assumption: that each of THESE scores is worth 1, 2 or 3 “points” and then let’s calculate the distance between each of the four scores. We use standard Euclidean distance – r=sqrt(x2 + y2) with the categories being: Mastery = 3 3 3 3, Practiced = 2 2 2 3, Introduced = 2 2 2 2, Benchmark = 1 1 1 1


So, how do these categories spread out along the dimension we are measuring here? Mastery, Introduced, and Benchmark are nicely spaced, 2 units apart (and M to B at 4 units). But then we try to fit P in. It’s 1.7 units from Mastery and 2.2 from Benchmark, but it’s also 1 unit from Introduced. To represent these distances we have to locate it off to the side.

This little exercise suggests that this line of the rubrik is measuring two dimensions.

This should provoke us into thinking about what dimensions of learning are being mixed together in this measurement operation.

It is conventional in this sort of exercise to try to characterize the dimensions in which the items are spread out. Looking back at how we defined the categories we speculate that one dimension might have to do with skill (analysis) and the other knowledge. But Mastery and Practiced were on the same level on analysis. What do we do?

It turns out that the orientation of a diagram like this is arbitrary — all it is showing us is relative distance. And so we can rotate it like this to show how our assessment categories for this goal relate to one another.

Now you may ask what was the point of this exercise?  First, if the point of assessment is to get teachers to think about teaching and learning, and to do so in a manner that applies the same sort of critical thinking skills that we think are important for students to acquire then a careful critique of our assessment methods is absolutely necessary.

Second, this little bit of quick and dirty analysis of a single rubric might actually help people design better rubrics AND to assess the quality of existing rubrics (there’s lots more to worry about on these issues, but that’s for another time).  Maybe, for example, we might conceptualize “introduce” to include knowledge but not skill or vice versa?  Maybe we’d think about whether the skill (analysis) is something that should cross GE categories and be expressed in common language.  And so on.

Third, this is a first step toward showing why it makes very little sense to take the scores produced by using rubrics like this and then adding them up and averaging them out in order to assess learning.  That will be the focus of a subsequent post.