The perfect score: how to quantify user behavior and build more accurate UX personas

There is so much value in qualitative data, but it can easily be misinterpreted when analyzed by different people. We each have our own perspectives and bias when we try to unpack data from interviews, usability tests, and surveys. This is especially true when we are analyzing user interview data for the purpose of building UX personas. In this blog post, I’ll talk about the process in implementing a scoring guide and address some of the issues we encountered when quantifying qualitative user data.

I would like to clarify that this post will focus on using a scoring method to help interpret qualitative user data for the purpose of persona development.

My first experience on a research team was at IDC Herzliya and Weizmann Institute, conducting behavioral research for participants with Autism. I was in charge of gathering and scoring motor and behavioral tasks for each participant. We recorded measurements like the length of eye-contact and number of smiles in a conversation. But there was also valuable subjective data that had to be measured: what was the quality of their conversational flow or the affective engagement during the conversation? One of the more fascinating things we did in our research was how we worked to quantify qualitative data and ensure that all our subjective data was being consistently measured. To do this, we created a standard scoring guide that outlined the definitions of each behavioral measurement and standardized a way to analyze the data from a short conversation in order to give users a behavioral score.

While I’m at a new company and in a new industry, I find myself with the same problem. When Sam, our Senior UX Researcher, started diving into interview data for persona development, she created dimensions for different travel behaviors, which we examined for each user. These dimensions ranged from frequency of travel to travel savviness (check out her blog post “How to do persona mapping with 50+ users” to learn more). Each dimension was scored on a scale from 1 to 5 based on our qualitative user data. Let’s take frequency of travel for example. A score of 1 meant that the user travels on a monthly basis, while a score of 5 meant that they traveled once per year. This type of dimension is quantitative as the score is based on a specific number of times a user traveled.

What about a dimension like travel savviness? This was where scoring became complicated. What did it mean if a user was travel savvy? Did they travel frequently or have a ton of memberships, or both? When it was just one UX Researcher on the team scoring, this kind of information only existed in their head and their scores were determined more by a feeling than by a definition. This doesn’t mean that scoring based on a feeling is necessarily bad. When I was scoring effective engagement in a conversation, I found that I could feel when a user would be scored a 1 or a 5. However, it’s not enough to just go on a ‘hunch’ because there is no way to ensure that you consistently score every user the exact same way, especially if more UX Researchers will be contributing to the user data. Without consistency, the scores 1–5 will mean something different for each user. In order to accurately score (and therefore measure) these dimensions, we needed a way to quantify the qualitative data. As a result, we created a standard scoring guide — a document that defined in detail how to translate behavioral data into quantifiable, numerical data.

What is the Standard Scoring Guide?

Simply put, a standard scoring guide is like a dictionary. It documents how you decide to define a numerical range within a dimension. It provides examples and quotes to give context to the definition as well.

Here is a simple quantitative dimension that scores a user’s frequency of travel based on how often they travel per year. The range created here was based on our responses during user interviews. The real value of a standardized scoring guide applies for more complicated dimensions like the one below:

A dimension like travel savviness can be vague without specific guidelines that outline how different factors play into a user’s score. In this case, we decided that there were 4 different factors to consider when calculating travel savviness.

1. Does the user collect and use memberships points?
2. Do they travel more than 6 times a year?
3. Do they travel both internationally and domestically?
4. Do they know their way around travel websites? Do they have an established system?

Now that we have several factors that influence travel savviness, we can score users based on how many of these criteria they met. A user that doesn’t have or use membership points, doesn’t travel outside of the US, but is comfortable booking travel and travels to California frequently is considered a 3 for travel savviness by meeting 2 of the 4 criteria. Even with this outline, there is still room for error when interpreting a factor like “having an established booking system.” To account for that, we added example participants that fit ‘perfectly’ to the defined score. This way, if there is any confusion, we can go back and look at how a user compares to the example user.

Another way to define would to consider the scores in each dimension as “if…then…” statements. Let’s consider spending habit…

If, user A likes to splurge, prefers 4–5 star hotels, and flies premium economy, then user A would score a 2 for spending habit.

So…

Here, we still look at several factors that influence spending habit, but instead of saying “this user needs to check off 2 of the 4 factors” we look at where they fall within the factors (like hotel star rating and travel class) to help us determine how to score spending habit.

How to start building your standard scoring guide

1. Define your range based on your user interviews

You could create your dimensions and define each score within your dimension before you even start your user interview, but that won’t do you much good. It’s possible to really skew the range by not knowing how your users tend to behave. Let’s say you define your range first and then interview your users. You find that most users scored a 5, but all the users within that score of 5 could have somewhat different behaviors. The range that is used is not an effective division of behaviors. In order for the guide to be effective, every user — regardless of their behavior — has to match to a score, so that you don’t dilute the data with loose matches. Therefore, it’s much better to evaluate how your responses vary among users and shape the definitions from there.

2. Determine the different use cases and apply them into your definitions

After you conduct a chunk of your interviews, you’ll have a better idea of what your range would look like for different dimensions. Start with a framework of what you think would make the most sense based on your responses. When defining the “importance of reviews” dimension it started out very simply:

As we went through all our user responses, we were able to more clearly define each score:

3. Find the best examples! They are your benchmark!

Having written definitions isn’t enough to ensure consistent scoring. Even when reading specific guidelines, there is still room for interpretation. The most helpful thing to do here is to choose an example participant (and a quote or clip) that represents the score perfectly, or very closely. Adding contextual quotes will help maintain consistency when scoring.

4. It will evolve!

As you interview more users, you’ll find a few cases where a user just doesn’t fit into any score. It’s important to remember that a scoring guide is a living and breathing document that will change here and there over time. This doesn’t mean that you change the definition completely, but rather you can add a note that would address a more specific use case. A good example of this was when Sam interviewed users in the UK and found that the way we defined international vs domestic trips in the US wouldn’t apply the same way to UK users. Instead, we had to add a use case for those users:

It’s important to remember that the standard scoring guide isn’t a perfect science, and there is still room for bias. However, our team has more trust in the quality and consistency of data analysis. While this process was originally applied in behavioral research, it can also be applied to any qualitative data set. In UX, many of us come from different academic and professional backgrounds, which we can use to apply different perspectives and methods to improve our approach to the UX process. In our case, it has helped us more accurately represent a spectrum of consistent, real behaviors in our personas, which we use to improve our designs.

LEARN MORE