Unfiltered Friday: Statistical Significance in Golf Equipment Tests
In the first edition of our new weekly feature, Unfiltered Friday, I’ll offer a few thoughts on the much larger conversation of golf equipment reviews, testing methods, and the importance of following proper testing protocols to educate each other.
As always, I welcome you to offer your take in the comment section below. You can also reach us via social media at @golfunfiltered, or via email at firstname.lastname@example.org.
Golf Equipment Reviews and Improvement
Having the opportunity to interview and meet many of the biggest names in golf equipment has been extremely exciting for me over the years. Learning about the thought processes that go into the design of a new club or golf ball is fascinating, and the money that's invested to help us enjoy the game can be staggering.
Unfortunately, I continuously see golf equipment testing or experiments that leave more questions than answers. Gaps in data analysis, questionable testing methods and confusion on results breeds frustration more than education, and at times impacts sales for our favorite brands. Conversations with multiple OEMs validate I’m not the only one who feels this way, and it’s no secret which websites do a better job than others.
That's why I think it is so important to understand that when those products are tested by consumers, whether it be via a blog, club demo or custom fitting, having a fundamental understanding of what it means to "improve" can be helpful.
Long-time readers of GU know when I'm not littering the internet with my golf thoughts, I work in process improvement during the day. A big part of that deals with statistics, specifically the concept of proving something through hypothesis testing and statistical significance. This can only be done through the establishment of a sound measurement system, used in a properly designed experiment, which includes appropriate sample sizes and testing methods.
For example, let's say you go to your local demo day where a company rep has a Trackman set up on the range. You grab the newest driver from a bouquet of options and step up to the tee. You've got your current driver as well to compare to the new one.
How many golf balls should you hit with each driver before you can safely say you've got the appropriate sample size? Furthermore, how do you analyze the comparison data to determine if one driver truly outperforms the other?
The Metrics We Read Are Misleading
More often than not, most of us will focus on carry distance and shot dispersion. Those are real metrics we can gather from Trackman. That data is usually given to us in terms of average total.
But there's a problem with averages. They are awfully touchy when it comes to outliers, especially when our sample size is low (<30). I contend this is the worst way to compare two products against each other, preferring to use the median (50% of data above, 50% of data below) of the data set, but that's controversial and I digress.
Once you have your data sets ready, the next best thing is to test for statistical significance. This brings up the concepts of a null hypothesis and p-value.
At a high level, the null hypothesis suggests that there is no difference between two samples being compared. It's an assumption you go into every experiment with as to not have any bias toward one sample or the other. What you are testing, then, is whether or not you should reject or fail to reject the null hypothesis. To do that, you need to find the p-value.
A p-value is "the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event." In other words, it's the probability that the comparison results you are seeing are due to chance if the null hypothesis is true. For many industries, the standard p-value is 0.05 (in healthcare it is 0.01). That means there is a 5% chance the results we are seeing are due to chance. Anything less than that means we reject the null hypothesis and can say with some level of certainty there is a difference between the two samples.
To go back to our example, we should go in to the driver test with the null hypothesis that there is no difference between the two drivers. With the appropriate sample sizes of drives from each club, we can then collect the data and compare the two means together (for example, carry distance numbers). This process -- called hypothesis testing -- will result in a p-value. If the p-value is less than 0.05, we can suggest there is a statistically significant difference between the two drivers being tested.
But all of that is boring and confusing, right? It also sounds expensive and time consuming. So why bother?
Instead, what many choose to do is simply hit a bunch of drives and make a determination on which club is better based on what we see in a very limited amount of time. What other choice do we have?
One choice would be to educate yourself on the importance of appropriate testing methods, understanding that there are very few golf sites that have zero bias (let alone funding), and if it sounds too good to be true it probably is.
We can be better than that.