Note: this is Part 9 in a series of blog posts about becoming data driven in business. You may want to read the prior parts before reading this essay.
In our previous instalment of the Data Driven in Business series, I explained the process that I was testing with Commoncog. I wrote:
So here we have a bunch of questions: does this work? Is such a project worth doing for a tiny content business like Commoncog? What is hard about this? What’s likely to go wrong? Perhaps media businesses can’t possibly be data-driven; every blog post is a unique snowflake with unique properties that appeal to unique individuals. Perhaps the causal structures behind certain businesses are unknowable below a certain size.
But in the spirit of the WBR, and in the grand tradition of Deming: all of this sounds empirical. I can sit around pontificating ... or I can find out. The Becoming Data Driven in Business series will pause for a bit — barring case studies and some book summaries — while I run out this experiment; you’re likely to see some website changes.
I’ll tell you when I have results.
I don’t yet have results that are worthy of publication. But I do have some notes from practice. This is one of them.
A few weeks ago I was catching up with a friend, and I walked him through the Weekly Business Review that I was doing. He laughed, said that he liked what he saw, and proceeded to tell me a story.
“We have a new product at work that’s been out for a few months, and I saw our active users jump up, not too long ago. I didn’t think much of it at first — I thought the team was simply executing well. But then recently I learnt that what had actually happened was that the team changed the definition of active users.”
“How do you mean?” I asked.
“Well, I think the old active user definition was something like ‘customer that’s spent at least $100 with us in the past two weeks’, and in the new active user definition they just changed that to $10. So of course overall active users would go up! And I was like, ‘oh that’s smart.’”
“Oh god,” I said. “Do you at least know when they changed the definition?”
“And so it just looked good suddenly?”
“And is this active user definition recorded anywhere?”
“No, of course not.”
I covered my face with my hands.
“It’s actually a pretty smart move,” my friend said, “Useful trick to have in your back pocket — though not if you actually own the company, of course. If that’s the case then it’s terrible. But if you’re an employee and you find yourself responsible for some P&L, then, by all means, good trick to use to hit your quarterly targets, eh?”
Record the Definitions of Your Metrics
In Executing on Becoming Data Driven: The Technicals I wrote a section titled ‘Record Your Operational Definitions’. I said that when embarking on an analytics project, it is important to store definitions of every metric you’re tracking in an easily-accessible location; I also said that I used the format that Statistical Process Control (SPC) practitioners recommended — something that W. Edwards Deming calls an ‘operational definition.’
A good operational definition consists of three things:
- A criterion. The thing you want to measure.
- A test procedure. What process are you going to use to measure the thing?
- A decision rule. How will you decide if the thing you’re looking at should be included (or excluded) from the count?
If this seems more complicated than you’re used to, that’s because it is. At the time, I wrote:
This seems overly rigorous, and indeed operational definitions might not be as critical when dealing with things like ‘page views’, or ‘number of unique visitors’ (seeing as I’m just going to use the definitions of my analytics tool). But it becomes very important when measuring even slightly fuzzier concepts, and it matters when you want multiple teams in your org to measure things in the exact same way. You want to be very clear with what you’re measuring, and why — in practice, this means documenting every operational definition in some central place.
In truth, I didn’t really understand what I was writing. I told my friend about the ‘operational definitions’ idea, and said that it was a potential prevention mechanism for his described problem. I pointed out that recording and using ODs seems to be a basic practice that’s recommended in most SPC texts. But I had trouble articulating the structure of an OD. And how to enforce it? In fact, how — or why — does recording operational definitions for every metric help you, really?
That conversation told me was that I didn’t really understand ODs. So I took a second look. And I realised that whilst I’ve spent a large amount of time in this series on the consumption of data, I’d spent very little on the collecting of it.
The Principles of Counting
Once you look into what SPC has to say about data collection, you’ll quickly realise that operational definitions fall out of a small handful of fundamental principles. These principles are simpler than you might think. This is part of a pattern: over the course of the Becoming Data Driven series I’ve been increasingly struck by how ... basic all the ideas really are. You do not need fancy technology to solve the problem my friend described, nor do you need to reach for some sophisticated set of ideas — no, the problems in my friend’s story are preventable if you follow a small handful of ideas. Three of them, to be precise.
The principles we’re about to look at are drawn from Donald Wheeler’s Making Sense of Data. According to Wheeler, there are only two rules for collecting good data:
- Unless all the values in a set of values are collected in a consistent manner, the values will not be comparable.
- It is rarely satisfactory to use data which have been collected for one purpose for a different purpose.
To these two principles I would add a third, which is really an actionable implication of the two:
3. Every metric you use needs to come in the form of an ‘operational definition’.
Wheeler continues: “Collecting good and focused data is a prerequisite for making sense of those data. The issues of discovery, investigation, communication, and persuasion are all promoted by useful and effective data analysis, but the analysis cannot be better than the original data.”
These principles are deceptively simple, and for good reason. Principles are compressed bits of knowledge; their implications are more useful and more interesting to examine. So I think it’s worth talking over some of the implications of the two rules, before we discuss what an operational definition is, or why it’s so important to the topic of measurement.
Rule 1: Values Are Comparable Only If They Are Collected in a Consistent Manner
The first rule seems like common sense, so I’ll ask a series of application questions instead.
- You have been using Google Analytics for a few years. You decide to switch away from Google Analytics to something more powerful. Can you export the data from the old Google Analytics system to use as historical data for web traffic measurement in your new system?
- Your company wants to administer work-life balance surveys. The last time they ran one was a year ago. Your HR department decides to use a different vendor this year. As a result of setting up on the new platform, the head of the project decides to modify the survey questions from the prior year. At the end of the survey period, the team presents a selection of charts comparing changes in numerical values from a bunch of ‘sentiment’ questions administered in the survey this year and last year (e.g. How happy are you with your work life balance? Pick a score from 1-7). However, some of the questions — whilst attempting to capture the same intent — are phrased differently from the prior years’s question. For those questions, can you trust such comparisons?
- The head of HR decides to measure office utilisation. However, not all of the office buildings that the company uses across the country have employee badge access systems (some of them have turnstiles with employee badge access, but others have no access restrictions). The HR head decides to measure the buildings without such turnstiles by doing a walking count executed twice a day, once at 11am, and once at 3pm. Employee utilisation is then summed by calculating an average for each day. For the buildings with employee badge access, data will be collected using the access system. Is this data comparable?
Going by the first rule, the answer to all of the questions above is ‘no’. You would think this principle is simple and obvious, but apparently in practice it is not. It’s actually rather common to go “sure, we can’t compare the two sets of values, but it’s close enough, and it gives us some information, so let’s not throw it out, shall we?” — and yes, you may very well do that. But you should not pretend you’re being rigorous when you do.
The application to my friend’s problem is clear: the instant that team changed the definition of their active user metric, they should’ve deleted all references to the old data. They don’t get to say “oh, we performed X% better than the previous quarter” — because active users the previous quarter was defined differently.
A second implication is also clear. If you want to compare survey data from this year to the prior year, or if you want to compare utilisation rates measured by different teams in different buildings — you need to ensure the measurements are taken in the exact same way. This means that writing down and tracking changes to metric definitions are very important for the practice of analysis. And it also implies your metrics definitions must be sufficiently detailed so that measurements may be taken in the exact same way.
Rule 2: Don’t use Data Collected for One Purpose For a Different Purpose (Most of The Time)
Wheeler tells a real story of a plant manager who has the worst productivity figures in the company. According to his calculations, his plant is doing better than several other plants. Yet every month, his plant shows up in last place in the company’s operational reports. After a particularly bad month, he reaches out to the Comptroller to find out how, exactly, these productivity numbers are calculated. It turns out that plant productivity is taken by dividing the revenues for a plant by the total headcount for the site. Since this manager’s plant also houses the engineering division, his overall productivity figures are unfairly deflated.
What is the core issue here? The issue is that the headcount numbers were not measured for the purposes of the productivity calculation. It was measured for some other reason, and repurposed for this measure.
Wheeler then continues (all emphasis mine):
A healthy appreciation of operational definitions is fundamental to being able to do business. (…) An operational definition puts communicable meaning into an adjective. Adjectives like good, reliable, uniform, prompt, on-time, safe, unsafe, finished, consistent, and unemployed have no communicable meaning until they are expressed in operational terms of criterion, test and decision rule. Whatever they mean to one person cannot be known to another person until they both agree on a communicable definition. A communicable definition is operational, because it involves doing something, carrying out a test, recording the result, and interpreting the result according to the agreed upon decision rule. This means there can be different operational definitions for the same thing. It is not a matter of being right or wrong, but rather a matter of agreement between parties involved. The result of an operational definition is consistency. The result of not having an operational definition is lack of consistency.
One implication of this second rule is that metric definitions should be available to operators. If the plant manager was aware that productivity figures were calculated in this manner, he could’ve raised the issue earlier.
A second implication is that you need to tie your metric definitions to the purpose for those measurements. This could be done implicitly (everyone knows what the metric is used for) or explicitly (in the metric definition itself). And this is more important than you might think, because the way you intend to use the data will affect the way the data is collected!
This gets a tad philosophical, but Wheeler gives a pretty good illustration:
While you may think this example is silly (how to measure employees in office buildings), this is the problem faced by the Census Bureau every decade. How many people live in Atlanta? While this question is easy to ask, there is no exact answer to this question. Which is why it is nonsense to claim that ‘this census was the most accurate ever.’ The census uses a method to obtain a count. Change the method and you will change the count. A particular method may do a better job, or a poorer job, of counting people in a given segment of the population than will an alternative method. Nevertheless, there is no way to claim that one method is more accurate than another method. The number you get will be the result of the method you use.
To which I would add: the method you use is informed by the purpose you have for your data.
The two rules leaves us with an obvious question: what format should our metrics definitions be? Here the argument is clear: Deming and his ilk argue that metric definitions should take the form of an ‘operational definition’.
Let’s return to the shape of an OD. Recall that an operational definition consists of three parts: a criterion, a test procedure, and a decision rule. I’ve mentioned above that the definition seems excessive, but after using it in practice for a few months, I’m starting to realise why ODs are this way.
I have spent my career working on software products. Sure, an earlier stint of mine involved selling and distributing Point of Sale Systems, which required some assembly (a touch of ‘manufacturing’, if you will). But the assembly process was so low volume that we did not instrument our ‘manufacturing’ operations — and I was not involved in tracking sales, measuring customer support, or taking inventory.
As a result, the bulk of my metrics were digital, and could be defined by merely using a criterion and a decision rule. For instance, in a web-based product I could define an ‘engaged user’ as a ‘user that has liked at least three posts in the previous seven days’. Or recall my friend’s definition of an ‘activation’ — ‘a customer who has spent at least $100 within the product’. For these types of digital metrics, there is no need to come up with a test procedure: you may track app interactions directly.
This is not the case with pretty much any other type of measurement. As a more involved example, let’s return to one of the scenarios above: say you want to measure office utilisation for a hybrid workforce. Employees are allowed to work from home twice a week, and are expected to come into the office the rest of the days. Management is thinking about revising this policy, and wants some data on office usage. How would you measure this? Do you count employee badge check-in data? Some buildings have turnstiles with badge access, and others do not. Ok, you want to be consistent in your data collection. Do you measure the number of employees who walk into the offices by counting at the front desk? But not all offices have a front desk. Do you do a count by walking the halls with a physical counter? If so, what time do you do that walkabout, and how often?
Here is one way of doing it:
- Criterion: The number of employees that are physically present in company offices.
- Test procedure: A count is conducted twice a day, by using a hand tally counter, counting all employees starting from the top office floor on down at 11am and at 3pm. The daily count is an average of the two counts.
- Decision Rule: An employee is counted if they are wearing a company badge. (Which they should be wearing at all times).
One week later, HR comes back with a number of complaints. It is labour intensive to do two counts a day. Could they just do once a day? Plus not all employees are wearing company badges prominently (some wear lanyards, but others clip it in their trouser pocket, and others don’t wear it at all). Checking for the badge makes the count more time consuming. Could we do away with that rule, pretty please?
At this point the head of IT hears about your problem, and proposes a different method of counting. All employees are assigned company phones. The devices each have a unique MAC address, and each MAC address is tied to employee IDs. When in the office, these devices usually connect to office wifi. The head of IT tells you it is possible for his team to collect the logs from office wifi across all the offices, grab the MAC addresses that are tied to employee IDs, and do a count that way. Better yet, since the logs are stored for 60 days, you could do retrospective data analysis.
Notice two things: first, this trips Rule 2, but the purpose of collecting wifi access logs and doing an employee count are close enough that it should be alright. Second, this method of measurement has different problems — an employee might leave their phone behind in the office, or an employee might forget to bring their phone to work, or an employee might leave their phone on airplane mode. But the number you get will be the result of the method you use — you accept this as an ok tradeoff, given the purpose of the data collection.
The OD for this will look as follows:
- Criterion: The number of employees that are physically present in company offices.
- Test procedure: Office wifi access logs are collected, and unique MAC addresses are taken from those logs. These addresses are filtered against a list of employee phone MAC addresses, which are then cross-checked against employee IDs. Only unique employee IDs will be counted (to cater for the situation where an employee holds multiple company phones).
- Decision rule: An employee is counted to be physically present in the office only if they have brought their company-issued phone to the office and connect it to wifi.
Again, the point here isn’t that counting via wifi logs is better than counting manually; each method has different problems. The point is that counts collected in the first way cannot be compared to counts collected in the second way. And notice: regardless of which method you pick, the operational definition listed above contains sufficient information for a future analyst, who might want to do a comparative count, or who might want to use the data set for some other purpose (with obvious caveats here, of course!)
If the terms ‘criterion, test procedure and decision rule’ are a little difficult to remember, here’s the way I try to remember it:
- Thing — as in what Thing do you want to measure (and why?)
- Procedure — how are you going to measure it?
- Rule — what counts as that Thing?
Feel free to come up with your own mnemonic.
At this point in the series, we have already gone through a number of simple principles, all of them with fairly profound implications. In Goodhart’s Law Isn’t As Useful as You Think I introduced David Chamber’s ‘three ways to respond to a quantitative target’ as an alternative to Goodhart’s Law, which is really just ... common sense, if you pause to think about it for a bit. In How To Become Data Driven we looked at the idea — repeatedly preached in the field of Statistical Process Control — that the key to becoming data driven is to understand variation. And then finally, in Operational Rigour is the Pursuit of Knowledge we walked through the concept that the purpose of data is really ‘knowledge’ — which is specifically defined as ‘theories or models that allow you to predict the outcomes of your business actions.’ Some people call this “having a causal model of the business in your head”. Deming had a shorthand way of talking about this: he liked to say “management is prediction”. Conversely, if you aren’t able to predict the outcomes of your business actions, then you are essentially running your business on superstition.
I’m increasingly convinced that the ideas in the Data Driven Series are really the contents of a Data Literacy 101 course — a class that none of us seem to have taken. This essay is about collecting data. It contains a discussion about counting. Shouldn’t we have been taught these ideas in school?
This is Part 9 of the Becoming Data Driven in Business series. Read the next part here: Beck's Measurement Model, or Why Software Development is So Hard To Measure.
Originally published , last updated .