Experiment Segmentation: Avoiding Old Dogs and Watered Down Results
One of the biggest growth bets we placed during my time at Shopkick was on geofenced notifications. Geofenced notifications are location-based alerts users received when they were near one of our partner stores. To drive more in-store visits, the notification would tell users how many reward points were available at the store and remind them to pull out the app. Since the iOS and Android support for geofencing was pretty new at the time, we had to spend a lot of engineering effort building out the feature and fine-tuning it to strike the right balance between accuracy vs. battery life. We chose to make such a big investment because we believed it could increase store visits by 20%-30%. When we launched the experiment however, we were pretty disappointed. The initial experiment results showed only a 3% increase in store visits, which was far less than our expectations. We knew something was wrong because we really believed that geofencing could be a game changer, so we spent the next several weeks on a major effort to debug and try to figure out what the problem was. The team even went as far as building a standalone iOS app for the sole purpose of testing and debugging geofencing and driving all over the Bay Area to do field tests. After all this work, we found a few minor issues but still couldn’t pinpoint any major problems. Finally, we took a step back and took a second look at our experiment data. This time, however, we chose to isolate our analysis to just new users who had joined in the weeks since the experiment started. It was then that we saw that geofencing had increased store visits by over 20% amongst new users and substantially improved new user activation.
When it comes to experiments aimed at increasing user activity or engagement, it is critical to segment your experiment analysis to get the full and accurate picture of how the experiment is performing. There are two main effects to watch out for:
Old Dogs: We’ve all heard the idiom “you can’t teach an old dog new tricks”. The first effect to watch out for is that existing users have a strong bias towards using the product the way it was before the experiment. This is because they learned how to use the product before the experiment, they experienced enough value to stick around without the experiment existing, and they will most likely continue to use the product in the patterns they developed before the experiment. New users, however, have no preconceived notions, and as far as they know, the experiment has always been a part of the product. Looking at new users can give you valuable insight into the experiment from an unbiased population.
Watered Down Results: The second effect to look out for is that an established userbase of highly active and highly engaged users can dilute the results for experiments aimed at increasing engagement. The reason is that it can be very difficult to take someone who is already hyper-engaged with the product and increase their level of engagement. However, it can be much easier to take a less engaged user or a new user and get them to become more engaged. This effect was illustrated in an experiment I ran at Pinterest. The experiment was to send a new push notification to a group of users. Overall, the experiment showed a 3% lift in WAUs amongst the target population.
Experiment results amongst all users
However, when we segmented our analysis and looked at how the experiment performed amongst less engaged users (users who usually use the app <4 times a month), we saw that it resulted in a lift of 10% in WAUs amongst that particular group.
Experiment results from users who usually use the app <4 times a month
Sure enough, when we looked at how the experiment performed amongst Core users (users who usually use the app multiple times a week), we can see it had no impact on moving the WAU metric.
Experiment results from users who usually use the app multiple times a week
A/B experimentation on the surface sounds easy. However, experiments rarely affect all users equally and only looking at the macro level results can be misleading. Segmenting experiments by country, gender, the user’s level of engagement (prior to the experiment starting, of course), and being aware of Old Dogs and Watered Down Results is crucial in fully understanding the impact the experiment had. You may even discover certain segments of the userbase that perhaps don’t need the experiment and where the experiment may actually be doing more harm than good.