Data Exhaust and Informational Efficiency

Heard an interesting talk by Paul Kedrosky a few weeks ago at PARC titled Data Exhaust, Ladders, and Search.

The gist of the talk is that human behaviors of all kinds leave traces which constitute latent datasets about that activity. Social scientists have long had a name for gathering this type of data: unobtrusive observation. Perhaps the most famous example is looking at carpet wear in a museum as a way of figuring out which exhibits captured the most visitor attention or garbology and related “trace measures used by anthropologist W. Rathje in the 70s and 80s.

One of Kedrosky’s nicer examples was comparing aerial view of Wimbledon center court at the end of a recent tournament with one from the 1970s. The total disappearance of the net game from professional tennis was clearly visible in the wear patterns on the grass court.

In addition to a number of neat examples (ladders found on highways as indicator of housing bubble was a favorite) of using various techniques to capture “data exhaust” (indeed, he suggests, it’s the entire principle behind google), he asks the question: What are the consequences of an instrumented planet? That is, a planet on which more and more data exhaust is captured and analyzed, permitting better decisions and more efficient choices.

In fact, one of the comments on Kedrosky’s blog post about the talk (by one J Slack) suggests a continuing move toward “informational efficiency” — with more and more instrumentation generating data and more and more connectivity, he suggests, “we’ll be continuously approaching an asymptotic efficiency, though never quite getting there.”

A standard definition of informational efficiency is “the speed and accuracy with which prices reflect new information” (  But there is some circularity here — in this context it’s only information if it does affect the price, otherwise it’s mere noise.  And so we’re still left with the challenge of sorting out the signal from the noise even after the data has been extracted from the exhaust.  And the more of everything the more of a job it is.

Bottom line: I think “data exhaust” is a great concept, but I don’t think perfecting its capture and analysis gets you to a fully efficient use of information about the world (even asymptotically).  The second law of thermodynamics kicks in along the way for starters, but the boundedness of human cognition finishes the job.

Somebody is probably going to point out that evolution already does this (that is, it’s the most unobtrusive data collection method of all).  But it takes big numbers and lots of time to do it and the result, though beautiful, is messy.

More to think about here, to be sure.

See Also (2014)

Johnson, Steven. “What a Hundred Million Calls to 311 Reveal About New York.” Wired Magazine 11.01.10

The Ontology of Mis-Information

An editorial, “We’ll Have to Check, Sir, in this morning’s Times criticizes the apparent difficulties in purging incorrect information from the government’s terrorist “watch list.” It’s not, at first glance, at all counter-intuitive that organizations put less energy in deleting data than they put into collecting and storing it. Further thought, though, and I think this is a point that’s implicit in the editorial, suggests that databases that contain errors may be more costly than you would think. A catalog company, for example, that has stale addresses in its database wastes money when it sends catalogs to nowhere. We might guess that there’s an equilibrium somewhere — the cost of identifying and purging bad addresses stands in some relation to the cost of producing and sending an advertisement to a landfill. The problem comes when the costs are invisible or externalized or when a powerful entity operates on a “better safe than sorry” logic or has the “luxury” of imputing an infinite price for false negatives (e.g., since the cost of letting a terrorist get away is infinite, any effort wasted by a false positive is worth it).

But one wonders whether the Homeland Security policy would stand up to a thorough cost benefit analysis. Unless it has access to infinite resources, every time it detains a Ted Kennedy, or other unfortunately named individual, it is diverting resources from its important duties.

But all this brings up a more general sociology of information question. How should we think about mis-information. I don’t mean the act of spreading mis-information. Rather, how should we think about recorded information which is false? This includes information recorded in databases that is incorrect as well as information in circulation that is either false or has been stripped of context in a way that renders it uninformative. It looks like information, but it is not.

One approach is to think like a statistician or electrical engineer and describe it as error or noise. That’s helpful, but we probably need to distinguish random noise — high entropy, “signal-less” gibberish — and non-random, incorrect signals born of lies, mistakes, or unrecorded change (as when I move and don’t update everyone who has me in their database).

How are the data-miners thinking about this? What will we learn about the sociology of information through trying to port concepts from EE? More to come….