Growth Engineering

July 27, 2024

How Pinterest increased MAUs with one simple trick

For many areas of growth, presenting your message with the right hook to pique a user’s interest and to get them to engage is critical. Copy is especially important in areas such as landing pages, email subject lines or blog post titles, where users make split second decisions on whether or not to engage with the content based on a short phrase. Companies like BuzzFeed have built multi-billion dollar businesses in part by getting this phrasing down to a science and doing it more effectively than their competitors.

At Pinterest, we knew copy testing could be impactful, but we weren’t regularly running copy experiments because they were tedious to setup and analyze in our existing systems. This made it difficult to do the type of iteration necessary to optimize a piece of copy. Last year, however, we built a framework called Copytune to help address these issues. The framework has helped us optimize copy across numerous languages and significantly boost MAUs (Monthly Active Users). In this post, we’ll cover how we built Copytune, the strategy we’ve found most effective for optimizing copy and some important lessons we learned along the way.

Building Copytune

When we decided to build Copytune, we had a few goals in mind:

1) Optimize copy on a per-language basis by running an independent experiment for each language. What performs best in English won’t necessarily perform the best in German.

2) Make copy experiments easy to set up, and eliminate the need to change code in order to setup an experiment.

3) Have copy experiments auto-resolve themselves. When you’re running 30+ independent experiments (one experiment for each language), each with 15 different variants, it becomes too much analysis overhead to have a human go in and pick the winning variant for each language.

Copytune dashboard showing different winners among languages

To achieve these goals, we built a framework that mimicked the API for Tower, the translation library that every string passes through. We first had every string pass through Copytune, which would check the database to see if there was an experiment setup for that string. If so, it would return one of the variants. If the string was not in experiment, Copytune would then pass the string to Tower to get the correct translation of the string. A nightly job would then compile statistics on all the copy experiments and would automatically shut down experiments when there was enough data to declare the winner.

Copy optimization strategy

Testing copy requires an iterative process to achieve the best results. It’s almost impossible to identify the ‘best’ copy in one go, so we took an incremental approach to discover it.

Explore Phase: You can’t know for sure what will work, so we started by testing many variants that touch on very different themes, tones, etc. We typically brainstorm 15 – 20 different variants. For example:

The latest Pins in Home Decor
Come see the top Pins in Home Decor for 12/3/2015
We found a few Pins you might like

Refine Phase: After the Explore Phase, we began to see which tones and phrasing were performing best. Then we could refine by testing different components of the winning variants of the Explore Phase.

Let’s say that in Explore Phase, the winner was “We found some {pin_keyword} and {pin_topic} Pins and boards for you!”. There are many possible optimizations we can test in this example.

Example of component variations

We can try adding “Hey Emma!” at the beginning to catch the Pinner’s attention. We can even test whether “Hi Emma!” or just “Emma!” is better than “Hey Emma!”. We can test some phrases like “we found” vs. “we picked.” We can test if “Pins and boards” is better than just having “Pins” or “Boards.” In this example there are at least 10 components we can test. We treat them as independent components and test each of them against the winner.

Combine Phase: Let’s say “Hi Emma!”, “we picked” and “{pin_topic}” were winners in the Refine Phase. We can now test if the combination Refine Phase winners (a) performs better than the original winner (b)

1. “Hi Emma! We picked some {pin_topic1} and {pin_topic2} Pins and boards for you!”

2. “We found some {pin_keyword} and {pin_topic} Pins and boards for you!”

Note that it’s possible that some components are not independent, so we also tested other combinations that seem promising.

In one of our highest volume emails, the winning variant from the Explore Phase showed only a gain in one percent open rate. By the end of the whole iteration, optimizing the subject line on one email boosted it to an 11 percent gain, adding hundreds of thousands more active Pinners each week.

Lessons learned

Copytune has been in place for almost a year now, and we’ve learned some lessons along the way:

Defining Success: When we initially started testing email subject lines, we defined the success criteria as driving an email open. This seemed to be the most straightforward since the Pinner reads the email subject line and the next action is to either open it or don’t. What we found, however, was that defining success with metrics that were further downstream (i.e. clicking on the content in the email) was more effective. Some subject lines were great at getting opens, but there was a mismatch in user expectations based on the subject line and the actual content in the email so net net they were actually resulting in fewer clicks.

Picking Variants: The original vision for Copytune was to use a Multi Armed Bandit framework for picking variants and auto-resolving experiments. The difficulty we ran into was feature owners wanted to see how the experiment performed across a variety of metrics and to be able to report concrete MAU gains from the experiment. To accommodate these needs, we ultimately needed to integrate Copytune with our internal A/B testing framework.

Acknowledgements: Koichiro Narita for co-writing this post, helping develop Copytune, and running the subject line experiments covered in this post. Devin Finzer and Sangmin Shin for helping develop Copytune.

This post was originally published on the Pinterest Engineering Blog

4 Steps To Develop Your Push Notification Strategy

Startups often struggle with how to develop their push notification strategy. While email has been around for decades and is fairly mature, push has only been around a few years and people are still trying to get a handle on it. In this post, I’ll cover the basics of how to develop a messaging strategy that applies to both push and email and how to take advantage of some of the unique aspects of push.

Step 1: Define the product’s core value proposition

Push notifications should be an extension of the product’s core value proposition. I can’t emphasize this enough. One of the biggest mistakes I see startups make (and I’ve made myself) is that they send emails/notifications about things that are not strongly tied to the value proposition. The value proposition is the reason people engage with the product and is what sets your product apart. Push notifications should further that engagement and make it easier for them to derive that core value. For users who don’t yet “get” the product, push should help them understand the value. For users who do “get” the product, push notifications should help them engage further engage with it.

Step 2: Figure out what you can send that is tied to that core value proposition

Content generally falls into one of three broad buckets, each with their own pros and cons. The type of content you send depends both on what makes sense with the product and what your resource constraints are.

Marketing Driven: These are notification blasts sent out by the marketing team to most or all users. A lot of ecommerce and brick and mortar retailers fall into this bucket.

Pros:

Coverage: Can send to every single user
Minimal engineering effort required, which makes it great for early stage startups

Cons:

Content is not personalized, which leads to low engagement rates
Users have lower tolerance to these types of notifications, which means you have to use them sparingly to avoid high unsubscribe and app deletion rates

Transactional: These notifications are triggered by users’ actions on the service. They inform other users about those actions. Facebook and LinkedIn are great examples of this.

Pros:

Generally good engagement since content is relevant by virtue of the fact that the user has an direct connection to the action
Higher level of tolerance since users understand what is triggering the notifications

Cons:

Need to have enough engagement on the site, or be connected to enough users, to get the flywheel going

Content Driven: Content driven notifications connect users with relevant and interesting content. They generally use some amount of personalization to figure out which content to recommend. Twitter for example will send emails/notifications to less engaged users about popular tweets they think the user will be interested in.

Pros:

Can get good engagement rates by sending highly personalized, relevant content.
Can get good coverage by sending trending and popular content to users for which you don’t have enough signal to personalize recommendations.

Cons:

Engagement rates get worse the less the user has engaged with the site
Expensive to build out recommendation algorithms from an engineering effort perspective

Step 3: Figure out your user segments

Once you’ve figured out what content you send, you then need to figure out who you want to send to. Not all notifications are good for all users. Notifications should be targeted based on where the user is in their lifecycle. A very simple but powerful segmentation is classifying users into new, engaged, and unengaged.

New Users – Send notifications that help reinforce the product’s value and help them figure out how to get more value out of the app.

Engaged Users –These users are already engaged and understand the product, so only send them the best, most useful notifications that help them engage even further.

Unengaged Users – Unengaged users are always the toughest nut to crack since they have already shown a bias towards not engaging with the product. The signals you have on them may or may not be accurate so sending a mix of personalized notifications and broader non-personalized notifications is necessary to try and re-engage them.

Step 4: Think about what makes push unique

Up to this point, everything we’ve talked about can apply just as much to email as it does to push. However, there are a few things that really differentiate push from email and may change your approach to push.

1) Timeliness – Since most people have their phones on them at all times, push notifications allow you to reach users more immediately than email.

2) Location Based – Both iOS and Android have good support for geofenced notifications that allow you to notify the user when they are near a certain latitude and longitude point.

3) Badging – Badging is a way to give the user an indicator that there is something new in the app in a way that is less intrusive than sending an email or normal push, and still triggers a lot of engagement.

The final step is to ask yourself if there is any way these attributes naturally dovetail with your product’s value proposition. For example, geofencing notifications are a great fit for location based apps, but can feel out of place if location is not a core feature of the app.

Wrap Up

As with anything Growth related, push notifications require a lot of trial and error, iteration, and experimentation. However, I’ve found thinking of push notifications as an extension of the app’s value proposition and then thinking through this framework has helped me a lot when crafting a push notification strategy.

Experiment Segmentation: Avoiding Old Dogs and Watered Down Results

One of the biggest growth bets we placed during my time at Shopkick was on geofenced notifications. Geofenced notifications are location-based alerts users received when they were near one of our partner stores. To drive more in-store visits, the notification would tell users how many reward points were available at the store and remind them to pull out the app. Since the iOS and Android support for geofencing was pretty new at the time, we had to spend a lot of engineering effort building out the feature and fine-tuning it to strike the right balance between accuracy vs. battery life. We chose to make such a big investment because we believed it could increase store visits by 20%-30%. When we launched the experiment however, we were pretty disappointed. The initial experiment results showed only a 3% increase in store visits, which was far less than our expectations. We knew something was wrong because we really believed that geofencing could be a game changer, so we spent the next several weeks on a major effort to debug and try to figure out what the problem was. The team even went as far as building a standalone iOS app for the sole purpose of testing and debugging geofencing and driving all over the Bay Area to do field tests. After all this work, we found a few minor issues but still couldn’t pinpoint any major problems. Finally, we took a step back and took a second look at our experiment data. This time, however, we chose to isolate our analysis to just new users who had joined in the weeks since the experiment started. It was then that we saw that geofencing had increased store visits by over 20% amongst new users and substantially improved new user activation.

When it comes to experiments aimed at increasing user activity or engagement, it is critical to segment your experiment analysis to get the full and accurate picture of how the experiment is performing. There are two main effects to watch out for:

Old Dogs: We’ve all heard the idiom “you can’t teach an old dog new tricks”. The first effect to watch out for is that existing users have a strong bias towards using the product the way it was before the experiment. This is because they learned how to use the product before the experiment, they experienced enough value to stick around without the experiment existing, and they will most likely continue to use the product in the patterns they developed before the experiment. New users, however, have no preconceived notions, and as far as they know, the experiment has always been a part of the product. Looking at new users can give you valuable insight into the experiment from an unbiased population.

Watered Down Results: The second effect to look out for is that an established userbase of highly active and highly engaged users can dilute the results for experiments aimed at increasing engagement. The reason is that it can be very difficult to take someone who is already hyper-engaged with the product and increase their level of engagement. However, it can be much easier to take a less engaged user or a new user and get them to become more engaged. This effect was illustrated in an experiment I ran at Pinterest. The experiment was to send a new push notification to a group of users. Overall, the experiment showed a 3% lift in WAUs amongst the target population.

Experiment results amongst all users

However, when we segmented our analysis and looked at how the experiment performed amongst less engaged users (users who usually use the app <4 times a month), we saw that it resulted in a lift of 10% in WAUs amongst that particular group.

Experiment results from users who usually use the app <4 times a month

Sure enough, when we looked at how the experiment performed amongst Core users (users who usually use the app multiple times a week), we can see it had no impact on moving the WAU metric.

Experiment results from users who usually use the app multiple times a week

A/B experimentation on the surface sounds easy. However, experiments rarely affect all users equally and only looking at the macro level results can be misleading. Segmenting experiments by country, gender, the user’s level of engagement (prior to the experiment starting, of course), and being aware of Old Dogs and Watered Down Results is crucial in fully understanding the impact the experiment had. You may even discover certain segments of the userbase that perhaps don’t need the experiment and where the experiment may actually be doing more harm than good.

When do features drive growth?

As I mentioned in my previous post, I often see this belief in product development that adding new product features to a product will help spur growth. The thinking behind it is basically that more features == more value == more growth. I disagreed with that thinking in my previous post and received a few questions about that point, so I will expand upon it in this post.

Hypothesis

First I want to define what a core product feature is. A core product feature is a feature that is part of the normal everyday use of the product. So, I’m excluding typical growth features such as new user flows, invite referrals, sharing to social networks, SEO, etc., where the immediate impact on growth is very clear, but the feature is not part of the core usage of the product. My hypothesis is that new core product features only help change a company’s growth trajectory if it creates an engagement loop or creates a step-change improvement in the amount of value the average user gets from the product (preferably it does both). Specifically, for a feature to accelerate growth, I think it needs to meet the following criteria:

A) Most important is mass adoption by the user base. Over 50% of the users will need to interact with the feature on a regular basis (i.e. daily or weekly basis).

B) The feature creates an engagement loop that allows you to email or notify users on a regular basis. The content in those emails/notifications also needs to be compelling enough that it maintains a high click through rate over time.

C) The feature is a step-change improvement in the core value of the product for a majority of the userbase.

Facebook: A Case Study

To put this hypothesis to the test, Facebook serves as a great case study of which types of features drive Growth and which do not. Over the years, they have launched Photos, News Feed, Platform, and Chat, which have all been major drivers of growth and engagement. However, they have also launched Questions, Places, Deals, Gifts, and Timeline, which have not fared as well in terms of driving growth and engagement.

To define what Facebook’s core product value is, I’ll use one of their more succinct mission statements from 2008, which is: “Facebook helps you connect and share with the people in your life.” [1]

Features That Did Drive Growth and Engagement

Photos: Photos was launched in 2005 and was one of the first major new features Facebook added after launch. Facebook has stated before how successful the Photos product was at driving engagement. If we look through the lens of the criteria laid out above, we can start to understand why.
A) The majority of users upload photos or are tagged in photos.
B) Tagging allows Facebook to re-engage users and has been so successful that Facebook has heavily invested in facial recognition to make tagging easier.
C) Photos provides a significant increase in the core product value by allowing people to much more easily share important moments and events in their lives and allowing people to be much better connected with what their friends and family are doing.

News Feed: News Feed was very controversial when it first launched in 2006, but ultimately led to a significant uptick in user engagement. Looking at our criteria:
A) Pretty much 100% of users view their News Feed.
B) News Feed dramatically improved the content sharing engagement loop. After News Feed, when someone shared a set of photos or a status update, those shares had dramatically higher visibility compared to when they were only visible by navigating to a user’s profile. This meant that these posts now received many more likes and comments that Facebook would then notify the user about.
C) News Feed provided another step-change improvement to Facebook’s mission. It was now much easier to passively stay connected to friends by seeing their posts and updates. As previously mentioned, it also fundamentally altered Facebook’s value proposition. It was no longer just a way to keep in touch, but it was now a way to broadcast and communicate with your social graph.

Platform: In a bold move at the time, Facebook opened up their platform to third party developers in 2006 [2]. Although Facebook has since clamped down significantly on the platform in the interest of user experience, Facebook Platform was a big engagement driver for a period of a few years.
A) For the first few years, a significant percentage of users interacted directly with Facebook apps or would get notifications from others using the apps.
B) Facebook outsourced the work of constructing the engagement loop to third-parties. Apps like Farmville, etc. had to create their own engagement loops to survive by using gamification mechanics such as crop harvesting to bring users back. By bringing a user back to Farmville, Zynga also brought a user back to Facebook. They also generated billions of notifications to Facebook users by getting users to spam app requests out to friends until Facebook finally clamped down on the app request spam.
C) Social gaming was the primary category of app that achieved significant traction on Facebook Platform. The product value of Facebook is to connect with friends and for entertainment which social gaming helped boost for a period of time until the novelty wore off and Facebook started to clamp down on the free distribution apps were getting in the newsfeed.

Chat: Messages have been part of Facebook since its initial launch in 2004. However, in 2008 Facebook launched Chat [3]. Examining Chat, we can see that:
A) To ensure mass adoption, Facebook made Chat highly visible by including it as a sidebar overlay on every page.
B) Chat creates a very compelling high-frequency engagement loop.
C) Chat again significantly expanded on the ability for friends and family to stay connected through Facebook and eventually evolved into Facebook Messenger.

Features That Didn’t Drive Growth and Engagement

Questions: Launched in 2010 [4], possibly as a jab at Quora co-founder and former Facebook CTO Adam D’Angelo, Questions never really gained much traction.
A) The number of users using questions rapidly dropped off after the initial launch.
B) Questions did create an engagement loop, but the frequency was relatively low.
C) Questions was an incremental improvement and did not significantly expand on Facebook’s value proposition since people could already ask questions by just posting a status update.

Places: Also launched in 2010 [5], Places was Facebook’s response to the surging popularity of Foursquare. Places promised to enable people to share where they were.
A) Facebook probably did get over half the users to use Places through both checkins and location tagging in photos.
B) Places did not create an engagement loop.
C) Places was an incremental improvement and did not significantly expand on the value proposition with regards to sharing since people could already share where they were through status updates or picture descriptions.

Deals and Gifts: Deals and Gifts are both pretty similar and were aimed more at monetization than growth, but they are still worth covering. Deals launched in 2010 (must have been a busy year) [6]. Gifts first launched in 2007 [7] and then re-launched in 2012 following Facebook’s acquisition of Karma [8]. Looking at our criteria:
A) Fewer than 50% of users bought a deal/gift.
B) Deals and Gifts did have the potential to create an engagement loop via a daily email, but Facebook didn’t capitalize on it. Even if they did, they would have discovered the same thing Groupon and LivingSocial did, which was that click through rates on the emails decay significantly over time as users grow tired of receiving deal after deal (or gift) they are not interested in buying.
C) Deals were not tied to Facebook’s value proposition at all. Gifts at least helped people be more connected, but the frequency was too low and the consumer adoption wasn’t there.

Timeline: Introduced in 2011, Timeline served two purposes. First, it was meant to help users rediscover things they had shared in the past and second, it allow users to more easily share from other apps [9]. However, according to insiders, it did not lead to any significant gains in growth or engagement.
A) Over 50% interact with Timeline.
B) Timeline did not create a significant engagement loop. It generated a lot more posts from apps, but due to the trivial nature of the information being shared, they did not receive many likes and comments compared to organic shares and posts.
C) Timeline was more of an incremental improvement in the core product value rather than a step-change. In terms of being connected to friends and family, sure, you could now more easily scroll through hundreds of posts your friends had made over the years, but no one does that on a regular basis. In terms of sharing, the data shared from apps was trivial information such as a song you listened to or an article you read, but it didn’t necessarily mean you liked the song or thought the article was interesting, so the share had little value.

So What Does This All Mean?

Should we stop all core product feature development that doesn’t meet these criteria? Of course not. Core product teams should continue to develop features to make the product incrementally better and make users happy. However, if you are contemplating adding a new feature on the basis of expecting it to drive growth and engagement, you should ruthlessly evaluate it against these criteria.

Agree? Disagree? Discuss this post on growthhackers.com

Acknowledgements: Special thanks to Casey Winters and Stephanie Egan for helping review and refine this post.

How Can Reddit Solve its Growth Problem?

Reddit has been undergoing a lot of turmoil lately. CEO Ellen Pao resigned ostensibly because she felt she couldn’t deliver the growth numbers the board wanted to see in the next six months. A question was posed yesterday on /r/Entrepreneur about “What would you do as Reddit’s CEO to grow the user base in the next 6 months”. The comments in the post were filled with ideas of new features that could be added or existing features that could be tweaked. For instance suggestions were to tweak the upvote/downvote system, build improved moderator tools, or give greater visibility to underused features such as multireddits, etc

Do Features Drive Growth?

I think one mistake people often make is thinking that new features can help spur growth. They think more features == more user value == more growth. Whether you’re a tiny startup just getting off the ground or a mature product used by hundreds of millions of people, I think new features rarely lead to a significant change in growth trajectory. I believe this is because for a new feature to drive more growth it can’t just add incrementally more value; it has to create a step change in the amount of value that the average user gets from the product.

Reddit has a few rough edges, but they couldn’t have grown to 130 million monthly unique visitors if they didn’t have solid product/market fit & weren’t delivering a ton of value to users already. So, I don’t think adding new features are the answer.

So What Should Reddit Do?

I hypothesize reddit derives a majority of its traffic from core users who by habit check reddit on a daily basis, referral traffic from blogs, and SEO. I think the biggest thing Reddit is missing is an engagement loop to bring non-core users back. Reddit currently does a very poor job of utilizing email, push notifications, and other social media platforms to re-engage users. My guess is because they are worried about being spammy, which is a huge mistake, since these channels can be leveraged in a non-spammy way that actually puts users first.

So what would I do?

1) Re-engagement emails for non-daily active users that give them a digest of the top 10 posts of the previous day or the previous week. I think less active redditors would get a ton of value because it would allow them to discover content they may have otherwise missed.

2) Push notifications for trending posts where timeliness matters (ex: AMAs, breaking news, etc). Often users complain that they discover a post too late after it already has thousands of comments and they feel any comment they make at that point would just get lost in the crowd.

3) Currently Reddit has about 130MM monthly uniques, but only about 9MM registered accounts. In order to make the engagement loop work Reddit would need more aggressive signup prompts for unauth users. Reddit’s user base are pretty anti-signup but I think communicating the value of creating an account would help convince users. Once a user is signed up they can start curating their subreddit subscriptions, they can start engaging in discussions and submitting links (and they can also start receiving the previously mentioned re-engagement hooks).

4) Finally Buzzfeed, 9gag, the chive, etc., leech a ton of their traffic by repackaging stuff from Reddit and posting it to social media (namely Facebook). Invest in making sharing much more prominent so Reddit can start to capture some of that traffic. This doubles as both an acquisition strategy & a re-engagement strategy.

What would you do?

Share your thoughts in the discussion on growthhackers.com. If anyone from Reddit (or any other startup for that matter) wants some Growth advice, feel free to drop me a line.

Growth Tools & Frameworks

We recently held a meetup at Pinterest’s offices to discuss some of the tools & frameworks that some of the most successful companies have built in-house to enable them to drive growth at scale. We had a great turnout to hear speakers from Dropbox, Pinterest, and Facebook.

A few of the tools & frameworks we heard about were:

– Gandalf, a framework to target marketing messages & campaigns to users

– How Dropbox gets new users to bridge the gap between desktop and mobile

– Copytune, a framework for optimizing copy on a per language basis

– How to build an SEO Experimentation framework

– How Facebook uses “quick experiments” to assess the impact of even the smallest changes, such as bug fixes

If you’re an engineer interested in growth, join the SF Growth Engineering Meetup to find out more about these events in the future.

About the Speakers:

Darius Contractor – Darius works on Growth at Dropbox. He’s previously VP of Engineering at Bebo (acquired by AOL) and PM/Senior Eng at Tickle.com (acquired by Monster). He focuses on building the right product as simply as possible, iterative engineering and having fun. Occasionally, he blogs about psychology at http://darius.com

Viraj Mody – Viraj is an Engineering Manager and has been at Dropbox for 2.5 years where he focuses on onboarding/education/engagement initiatives & building infrastructure for growth. Before Dropbox, Viraj was a founder of Audiogalaxy (acquired by Dropbox in 2012).

John Egan – John is a lead engineer on the Growth team at Pinterest where he leads up efforts on emails & notifications. Prior to Pinterest, he led the Growth engineering team at Shopkick (acquired by SK Planet). You can read his thoughts on growth at http://jwegan.com

Julie Ahn – Julie is a software engineer on the Growth team at Pinterest where she focuses on search engine optimization. She built out the SEO experimentation framework which allows Pinterest to demystify SEO and help drive millions of incremental visits a day to Pinterest. Prior to Pinterest, she was a mechanical engineer in South Korea.

Ran Makavy – Ran is a Director of Product Management at Facebook. He spent the first three years on the Growth team, looking at mobile and emerging market. Today, he is running Facebook’s Local and Entities teams, building consumer products around places and location. Before Facebook, he co-founded Snaptu & grew it to over 100 million active users before it was acquired by Facebook.

Why You Should Be A/B Testing Your Infrastructure

The benefits of using a data-driven approach to product development are widely known. Most companies understand the benefits of running an A/B experiment when adding a new feature or redesigning a page. While engineers and product managers have embraced a data-driven approach to product development, few think to apply it to backend development. We’ve applied A/B testing to major infrastructural changes at Pinterest and have found it extremely helpful in validating those changes have no negative user-facing impact.

Bugs are simply unavoidable when it comes to developing complex software. It’s often hard to prove you’ve you covered all possible edge cases, all possible error cases and all possible performance issues. However, when replacing or re-architecting an existing system, you have the unique opportunity to prove that the new system is at least as good as the one it’s replacing. For rapidly growing companies like Pinterest, the necessity to re-architect or replace one component of our infrastructure happens relatively frequently. We rely heavily on logging, monitoring and unit tests to ensure we’re creating quality code. However, we also run A/B experiments whenever possible as a final step of validation to ensure there’s no unintended impact on Pinners. The way we run the experiment is pretty simple: half of Pinners are sent down the old code path and hit the old system and the other half use the new system. We then monitor the results to make sure there’s no impact across all our key metrics for Pinners in the treatment group. Here are the results of three such experiments.

2013: A new web framework

Our commitment to A/B testing infrastructural changes was forged in early 2013 when we rewrote our web framework. Our legacy code had grown increasingly unwieldy over time, and its functionality was beginning to diverge from that of our mobile apps because it ran through completely independent code paths. So we built a new web framework (code-named Denzel) that was modular and composable and consumed the same API as our mobile clients. At the same time we redesigned the look and feel of the website.

When it came time to launch, we debated extensively whether we should run an experiment at all, since we were fully organizationally committed to the change and hadn’t yet run many experiments on changes of this magnitude. But when we ran the experiment on a small fraction of our traffic, we discovered not only major bugs in some clients we hadn’t fully tested but also that some minor features we hadn’t ported over to the new framework were in fact driving significant user engagement. We reinstated these features and fixed the bugs before fully rolling out the new website, which gave Pinners a better experience and allowed us to understand our product better at the same time.

This first trial by fire helped us establish a broad culture of experimentation and data-driven decision-making, as well as learn to break down big changes into individually testable components.

2014: Pyapns

We rely on an open-source library called pyapns for sending push notifications via Apple’s servers. The project was written several years ago and wasn’t well maintained. Based on our data and what we’d heard from other companies, we had concerns about its reliability. We decided to test out using a different library called PyAPNs, which seemed better written and better maintained. We set up an A/B experiment, monitored the results and found that there was a 1 percent decrease in our visitors with PyAPNs. We did some digging and couldn’t determine the cause for the drop, so we eventually decided to roll back and stick with pyapns.

Figure 1: Experiment results for replacing pyapns

2015: User Service

We’ve slowly been moving towards a more service-oriented architecture. Recently we extracted a lot of our code for managing users and encapsulated it into our new UserService. We took an iterative approach to building the service, extracting one piece of functionality at a time. With such a major refactor of how we handle all user-related data, we wanted to ensure nothing broke. We set up an experiment for each major piece of functionality that was extracted, for a total of three experiments. Each experiment completed successfully showing no drop in any metrics. The results have given us strong confidence that this new UserService is at parity with the previous code.

We’ve had a lot of success with A/B testing our infrastructure. It’s helped us identify when changes have caused a serious negative impact that we probably wouldn’t have noticed. When they go well, they also give us the confidence that a new system is performing as expected. If you’re not A/B testing your infrastructure changes, you really should be.

By: John Egan is a growth engineer and Andrea Burbank is a data scientist at Pinterest

Acknowledgements: Dan Feng, Josh Inkenbrandt, Nadine Harik, Vy Phan, John Egan and Andrea Burbank for helping run the experiments covered in this post.

Originally published on the Pinterest Engineering Blog

Long-term Impact of Badging

When it comes to growth, one potential pitfall is over optimizing for short-term wins. Growth teams operate at a pretty fast pace, and our team is no exception. We’re always running dozens of experiments at any given time, and once we find something that works, we ship it and move on to the next experiment. However, sometimes it’s important to take a step back and validate that a new tweak or feature really delivers long-term sustainable growth and isn’t just a short-term win that users will get tired of after prolonged exposure. In this post I’ll cover how we optimize for long-term sustainable growth.

Last year, we started including a badge number with all our push notifications. For many people, when an app has a badge number on it, the impulse to open and clear it is irresistible.

Figure 1: Example of badging on the Pinterest iOS app

When we launched badges, we ran an A/B experiment, as we do with any change, and the initial results were fantastic. Badging showed a 7 percent lift in daily active users (DAUs) and a significant lift in other key engagement metrics such as Pin close-ups, repins and Pin click-throughs. With such fantastic results, we quickly shipped the experiment. However, we had a nagging question about the long-term effectiveness of badging. Is badging effective long-term, or does user fatigue eventually set in and make users immune to it?

To answer this question we created a 1 percent holdout group. A holdout group is an A/B experiment where you ship a feature to 99 percent of Pinners (users) and keep 1 percent from seeing the feature in order to measure the long-term impact. We will typically run a holdout group whenever we have questions about the long-term impact or effectiveness of a particular feature.

We ran a holdout experiment for a little over one year. What we found was that the initial lift of 7 percent in DAUs settled on a long-term baseline of a 2.5 percent lift in DAUs after a couple months (see mobile views in Figure 2). Then last fall we launched a new feature, Pinterest News, a digest of recent activity of the Pinners you follow. As part of News, we would also badge Pinners when there were new News items. As a result, News helped increase the long-term lift of badging from 2.5 percent to 4 percent.

Figure 2: Lift of the badging group over the holdout group for key engagement metrics

We also found that badging was effective at increasing engagement levels. We classify Pinners into core, casual, marginal, etc., and we found badging had a statistically significant impact on attracting those who would have fallen in the marginal or dormant bucket to instead become core or casual users. This finding was compelling since it proved that badging is effective at improving long-term retention.

Figure 3: The badging group drove more Pinners into higher engagement buckets when compared to the holdout group

Holdout groups have been an effective way for us to ensure we’re building for long-term growth. We also have hold out groups for features like ads, user education, etc. In general, holdout groups should be used anytime there is a question about the long-term impact of a feature. In the case of badging, it allowed us to understand how Pinners responded to badging over a prolonged period of time, which will help inform our notification strategy going forward.

This article was also published to the Pinterest Engineering Blog

Growth Hacker TV Episode

Here is my Growth Hacker TV episode. Be sure to view their other great interviews at http://www.growthhacker.tv

Data Driven Growth

Here is the video of a talk I gave recently at the Weapons of Mass Distribution conference run by 500 Startups. I spoke about how you can use data to help drive your growth strategy and figure out which areas of growth you should be focusing on.