Scientific Rigor and User Testing: How to Remain Unbiased in a World that Wants Results
5 min read
A little while back, a study was published that examined the results of studies from cognitive and social psychology, revealing that only 36% of original studies, when replicated, yield near-same results. While disappointing to a large number of researchers in psychology and other fields, it reminds us to be careful with our conclusions.
So what’s happening with these studies? Why can’t they be replicated? There are a few thoughts: some environmental, some based on randomness of participants, and some, unfortunately, due to researchers manipulating the data. Aside from those who have been found to be wholly fabricating their data, there are other ways that a researcher might knowingly or unknowingly confound their data:
a) Removing “outliers” due to unfounded reason
Sometimes in studies, analysts remove data points that are very far outside of the rest of the data set. Say someone falls asleep halfway through your test and only scores a 10 out of 100. This person would be an outlier, because they did not complete the test for a reason other than running out of time. If someone stays awake throughout a test, and still scores a 10 out of 100, this person should not be removed, even if they had a rough night before and they looked a little worse for wear. When removing outliers, researchers need to be careful with why and how they are removing them. If someone is tired, that person is actually representative of a set of people that might or might not be tired. Unless tiredness is a variable or requirement in your study, it shouldn’t be used to remove outliers.
b) Trying to find something in the data that isn’t there
When an original study doesn’t lead to statistically significant results, there is a temptation to look at the data in other ways to try to find an unexpected conclusion. The problem is, this usually results in overworking the data or analysis on variables that might not have been controlled as closely as the original variables. For example, too many two-sample t-tests can result in type I error, otherwise known as a false positive, or finding results that aren’t truly there. To illustrate, let’s say we are looking at a drug trial and the reduction of symptoms of an illness. If we compare the test group to the control group on a large number of symptoms, random chances are, the groups will appear to differ on at least one attribute. This would be type I error. There are statistical ways to eliminate this error, but the more an analyst manipulates the data, the more likely they are to commit type I error.
So how does this relate to user testing? How do we make sure that our studies and conclusions accurately reflect your system, your users, and their interaction? Here are five ways that we make sure our results are meaningful, unbiased, and useful to our clients:
1. A True Environment
We conduct usability studies in a true environment; if your product or system will be used outside, we test outside, and if your product is used inside, we test inside. If we’re unable to test in the actual use environment, we try to replicate it the best we can including lighting, sound levels, and possible distractions. If your system will be used in a relatively quiet, indoor environment like an office building, we might conduct a remote test over the phone, but we wouldn’t try this if your system is used in a bustling restaurant kitchen.
2. Target the Right Participants
Those taking part in the study should be representative of your target users. Guidelines around target users should be as loose or as strict as your study requires; the more strict your target user set is, the more specific you can be with your conclusions about that certain data set, though it will be more difficult to find study participants. And a more specific target user isn’t always what you want - for example, any publicly available software inherently reaches a very broad spectrum of users. Once target user criteria is set, then, and only then, can we can use this to remove any outliers that do not meet the criteria.
3. Collect Meaningful Demographic Data
When collecting demographic data, we include variables that might help us interpret the data. For example, asking if a participant has used your product before, or how often they use the product, allows us to generate weighted analyses around exposure to the system. These results can help decision making around new users, training, and other user experience factors. This is better than trying to remove a set of participants altogether because they don’t perform well or are an unknown.
4. See the Good in the Bad
Know that bad results or no results tell us just as much about a study as good results do. When a system isn’t usable, we don’t want to tweak the data to show that it is good, we want to show just how bad it is so that we can fix it! In an academic study, sometimes no correlation is as big of a conclusion as a positive correlation. Any headline that starts “Study shows no link between…” is proof that these studies are important! We make sure to remain unbiased, and report the true results.
Know that bad results or no results tell us just as much about a study as good results do.
5. Be Realistic with Results
User testing in general is typically a more fluid process than an academic study. We present our findings at the appropriate level of significance, and if our clients or customers try to draw exaggerated conclusions, we gently remind them of the nature of the data. We help our clients understand complexities such as how large and small sample sizes impact confidence intervals, and that metrics like counting clicks or time on task aren’t always a measure of usability.
In a world that wants results, it can be hard to remain unbiased and present less-than-stellar findings, but we owe it to you and your end users to present information in an honest way. So stop trying to look at that spreadsheet sideways, and start planning the next study.