Webinar: A/B Testing 101 by Booking.com Product Manager, Saurav Roy

uh hello guys uh uh my.

   Name is saurav and i’m a product manager at booking.com i work in the payments department uh and today i’ll be sharing with you guys uh my thoughts around ab experimentation 101 uh as its public knowledge that booking.com is a company that relies heavily on the data driven approach and we do all our rollouts and all our changes through experimentation and we are widely known for that so i would like to give you just a basic class on ab experimentation for those of you who are.

   Not very familiar with it and also uh i’m happy to deep dive into specific topics etc if you guys are uh interested in something uh to know more you can always reach out to me through linkedin and i guess you can find my details on the product school page okay let’s start so let’s start with uh the objective of this presentation or workshop uh before pre-covert times we use i usually use to do it as a workshop but okay what you’ll get out of it is a basic understanding of a b testing uh what it means when it should be used etc uh how can it used and be used as a form of validation of your changes how to collect data and how to interpret the results and how to formulate a good hypothesis and and measure the results okay and you should be able to understand better the business objectives set by product teams by the end of the workshop and translate them into experiments uh and break them down into smaller and multiple experiments and formulate relevant hypotheses to prove your impact uh this could be relevant.

   Not just for product managers but also for designers developers uh and anyone that’s part of a product team uh at booking we don’t restrict our experiments to product managers only uh yes we are at the forefront of it but like everyone in the coffee is familiar with experimentation and the tool and we gave everyone the right to learn experiments cool let’s start with an example right so on i have to screenshots here uh they’re from the website uh decathlon uh and i just uh want you to compare uh the left side let’s say that’s the base and the right side let’s that’s your variant and and basically look at the difference right so you can find that there are in total uh three things that are different here right uh the bike is different uh on the right side it’s a road bike compared to a city bike on the left the image is significantly larger and if you.

   Notice closely there’s also like a change in the order of the text here in the right there shifted road bikes on top uh so is this a valid experiment to run uh that’s what we are trying to answer here right.

   Now uh the thing is uh a simple a b comparison is.

   Not possible in this case because there are too many changes and too many variables so the idea is that for a b experimentation you try to isolate a single change so that you can understand the impact of that change in particular in this specific case let’s say you see that whatever is your primary metric works well in the variant but you cannot attribute it to whether it’s because of the image the size of the image or the text right it could be that if you did one or or let’s say all these three changes in separate experiments two would have been positive and one is.

   Negative and which means you could have more impact with those two changes compared to all three together uh this is just an example and.

   Now we can see this is another example is what i just changed and basically here the only difference is that you increase the size of the image everything else remains same so this is a good example of a b test and you can really compare the impact of base versus variant and the impact the size of the image has on on your change cool.

   Now there are a few things when we talk about experimentation uh that we.

   Need to know so there’s primary metric always ah.


   Not star metric it could be something said by your company or organization it could be different in different experiments some companies don’t have.

   Not star metrics uh then they’re always like some supporting or secondary metrics as we call them then there are health metrics and binomial goals so i’m going to explain and walk you through all of these and let’s start with the primary metric so uh in the previous example when we saw two bikes the primary metric could be what do you want to achieve at the end of the day let’s say more sales or conversion right so it’s just an example but you could also run that test uh with the primary metric being the.

   Number of clicks on that category right ideally it depends on the approach taken by your company ideally a primary metric uh should be a.

   Non-star metric and that should show be able to show business impact and when you run a test you should always aim for a conclusive change in a primary metric uh by conclusion what i mean is uh it should be significant in terms of uh the data that you gather right uh and so what are supporting metrics uh while you have only one primary metric for your ex a b experiment you can have several supporting metrics that can support your hypothesis right so for example in this case you could have the.


   Number of clicks uh the time spent the funnel conversion and different things right so a good example of e-commerce supporting metrics could be customer service tickets uh.


   Number of returns cancellations how we are doing overall in terms of uh like the hypothesis and health of the product right and then we have the health metrics which uh again differ from company to company it it’s most probably defined uh by an organization or a tech team but these are typically related to the health of your product like your performance your page load times your the speed of your website the errors you see the app crashes you have etc right so these are good to monitor in all experiments so that you can find out uh that there are.


   No unexpected changes uh in these health metrics finally uh there are binomial rules uh you could say all metrics are binomial goals but basically you could [Music] these goals are basically measured in all experiments right so it could be your bounce rate it could be your some of the health metrics every metric can be categorized as a binomial goals but then you don’t.

   Need to add them uh as part of your experiment criteria but they are also good to monitor uh so they give you like an idea of how your changes impact the overall website moving on uh i have an example uh from a very old booking.com ui again we will i cannot share very recent stuff uh those are complete guidelines so i’ll uh just go through this example so this is how our search results used to look like uh some time back and these are two screenshots.


   Now uh if we were doing this in person i would try to ask you to identify the changes but like i’m already giving it out here so basically it’s just the spacing between uh your your location and the description right.

   Now believe it or.

   Not this experiment increased conversion uh to a great extent and we can isolate uh the result or the attribution of the result to this experiment because this was a single change that we measured again you can ask how did someone come up with the idea of this it looks more probably like a bug that was missed by the developer etc those are all valid points but this is just an example uh that a small change could lead to such big impact cool the.


   Next question that we get a lot i get a lot is typically how long should an a b test be run or like in general an experiment be it on right there are two things to consider when you consider the runtime of a test uh basically how much traffic do you.


   Need uh and what is the minimum change that you want to detect right so i i’ll give you an example the answer uh is.

   Not straightforward so you have the website traffic you have a base conversion rate of your primary metric uh let’s take an example if you’re talking about conversion let’s say from your final checkout page to actual uh conversions the conversion rate is 70 right so.

   Now from that 70 percent do you want to be able to detect a change if it goes from 70 to 71 or do you want predictor change if it goes from 70 to 75 with the same traffic i’ll show you with an example here uh but then those are two different runtimes right because the more granular the change you want to be able to identify uh the more time you.

   Need uh given the traffic remains same and vice versa right so if the change remains same uh the more traffic you have the less time you.


   Need to run your test for but as a rule of thumb we always say that even if you have enough traffic and your change is.

   Not too big that you.

   Need to detect it should always be run for one or two full weeks cycle and this is to take into account any sort of diversity or seasonality weekly seasonality that you might have in your customer behavior right.

   Now you might say like people might be a very differently in holiday season compared to summers etc those are valid points but it’s just.

   Not feasible to run experiments for a year long so this is the best trade-off we can have this is an example of a power calculator uh that can tell you how long you.


   Need to run a test you can google online it’s there from optimizely uh so basically in the left and right i’m showing uh base conversion rate is 30 the.

   Num the traffic and the.

   Number of variants and and how much users attract is exactly the same.

   Now if you want to detect the change like whether your change is going from 30 to 31 uh then you.

   Need 37 days to run that experiment but if you only want to detect a change if conclusively if it’s going from 30 to 35 then you can just run for a day technically right what this means is basically in if you run it for a day and let’s say you improve conversion by four percent you will.

   Never know conclusively that it’s true uh i hope that makes sense yeah so as long as it’s it’s.

   Not more than five percent impact you won’t have a conclusive result uh and in the other case if you run for thirty seven percent you can detect up to one percent granularity could uh.


   Next we move on uh to the hypothesis uh so every experiment uh before you run it should have a hypothesis a hypothesis is basically why you want to run an experiment and what outcome you wish to have so uh yeah it’s a.

   Nice saying that like it if you don’t have a hypothesis like throwing spaghetti on a wall and see what sticks so basically you are throwing stones in the dark and you don’t know what you’re aiming for right so hypothesis uh is usually the tool that protects us from all our own biases so.


   Next we look at how to formulate a good hypothesis so uh a hypothesis could look something like this right so based on a certain evidence it could be based on past user behavior or data that you’ve collected or talking to your users it could be from research etc we believe that if we make this change what is your change uh for what type of users it could be for everyone or it could be very.


   Niche for logged in users from the.

   Netherlands uh it will help create what impact for them right so this is your core hypothesis and then you have a validation for the hypothesis basically and how will you know it your hypothesis holds true so we will know that this hypothesis is true if we see let’s say you can say a reduction in customer support calls or an increase in conversion or uh increasing p a reduction in drop-off rate or increase in people going from search results to the product page right so we will know this is true true if we see this change in the primary metric that you have so all the ones i gave examples for could be good primary metrics so and why is this good for the business right so this is good for the business because if you increase this primary metric then uh it affects certain business kpis a primary metric should always be able to correlate it uh with the business impact so that’s your cheat code to form a good hypothesis.

   Now you can of course have tons of more details in your hypothesis but uh ideally this is the bare minimum you.

   Need and a good hypothesis remember it protects you from your own biases because once you see the experiment results it’s very easy to say okay i did.


   Not expect this or i expected this being biased from the results that you see and people do it even unconsciously right so it’s very important to formulate and have a good hypothesis before running the experiment cool.


   Next uh i get a lot of questions or we are looking at a lot of questions typically how many experiments do you run at this same given time right and the answer is like hundreds sometimes thousands right and uh a lot of people ask okay isn’t there interaction between these experiments right so the answer there’s.

   No straightforward answer to that but i’ll give you an example in the.

   Next slides but yes there could be interactions if some things are very very closely related there could be uh impact from one experiment on another but typically uh over a large enough data set that impact seems to be distributed equally on both variants i’ll explain what i mean with this example right so let’s say we are running two experiments in parallel the first experiment is that your variant uh and has a different from font than your base right so uh in this you can see like a sans font open sounds and in this on the right you have a serif and your experiment two is changing the color of the button so on base you have a green colored button in variant you have a yellow color button.

   Now uh it’s true that these experiments are very related but what will happen over a large enough set of users and data is that the impact of experiment 2 the colors will be equally distributed uh in the base and variant of the fonts experiment right so if if i look at the base which is this the font stays the same uh 25 percent of people will be exposed to green and 25 of people will be exposed to yellow while the font stays the same right so and the same treatment will happen for people in variant so over a large enough set of users and data uh the effect of the green and yellow should be equal across the variance of experiment one i hope that is clear and moving on uh and finally i think uh so there’s there’s uh a lot of questions around okay uh the ab experiments are typically comparative experiments right so uh what if you.

   Need to do a feature rollout or improvement where you’re fixing a bug where you’re.


   Necessarily expecting things to be better but you just.

   Need to do it right even there’s a form of experiment that you can use for that so uh it’s called.

   Non-inferiority test you can use them for feature rollouts uh.

   New products uh and and bunch of other things i try to explain what that is so typically uh.

   Non-inferiority tests uh are experiments where you say that your variant uh is.

   Not worse than your base right so it ha it is at par or it is.

   Not worse by a certain extent by more than a certain extent right so i’ll give you an example what i mean so and this is again from booking a couple of years back we introduced a feature uh called uh payment receipt right so uh this is uh this was actually done by my team uh and uh i’m taking this example from payments so you can see that this is base where there’s.

   No uh receipt.


   No option for the user to get the receipt and in variant uh you see the print receipt option right so this was a feature that was creating a lot of cs inbound uh we had requests from gas etc etc right uh and and uh we.

   Needed to do it in some geographies because of legal reasons like japan so you see here that we set up this experiment as a.

   Non-inferiority test what this means is we are saying that our variant is.

   Not uh worse than our base so my primary metric in this case was customer service tickets uh or contacts by guests and i’m saying that the variant will ha have will.

   Not have more than two percent customer service contacts compared to base and this is what you see is a.

   Non-inferiority threshold and you can see that this is uh the limit that we are set and the results you can see is 0.78 increase but it’s within the threshold and there’s a standard deviation of 0.92 right so this is a good enough result and that’s why the significance shows up as yes because it is a one-sided test and uh the results are valid based on your one side of this cool finally i’ll i would like to give you some more examples uh like i said typically if you’re doing this in person i would uh like to do an exercise with you guys but i i think you guys can do it by yourself as well so i will just try to tell you the problem statement so basically uh if you can by yourself try to identify at least two experiments that are going on currently in amazon.com so what you.

   Need to do is basically open amazon.com and open it in an incognito browser so that your cookies are.

   Not saved your details are.

   Not saved and you can see the website experience as a.

   New user would see it and experiments are typically allocated randomly to different cookies so every time you close you.

   Need to close the window of the incognito browser and reopen again you might be able to locate different experiments or if you can do it on different computers that works as well so that’s what i mentioned experiments are randomly assigned based on cookies because you’re.

   Not logged in sometimes experiments are done based on your user account or email address or if it’s an app based on device id and try to identify go to amazon.com if you can pause the video pause it go to amazon.com and try to identify two experiments that are currently being run i would give you an example these screenshots are taken like a few weeks back uh but before we go to that like the goal of the exercise is that you should be able to identify two experiments and formulate a good hypothesis on why amazon.com would be running that experiment in the format that we discussed before and what could be the primary and secondary and the health metrics for that experiment right.

   Now because we are doing it offline i would give you two examples that i found so here you can see again i did this on my browser on incognito a few weeks back so by opening it multiple times i could see different experiences and i took screenshots right so on the left in both cases i’m.

   Not signed in on the left you see that there is a carousel of uh recommended books probably or most popular books on kindle uh and on the right that is missing so that’s one experiment uh i don’t know if it’s still ongoing you might be able to find the same or they might be running other experiments and then and there’s another experiment on ongoing that on the right you see that there’s a prompt to sign in because i’m.

   Not signed it right.

   Now this did.


   Not show up after a certain time period some of you might be wondering it was just there the minute i opened the website so this is definitely an experiment that’s being run by amazon right so these are the kind of a b experiments that are run by companies and of course they have their own hypotheses and metrics so let’s talk about one let’s take the corrosive the hypothesis for introducing the carousel could be that by introducing uh let’s say top selling books on kindle by uh for.




   Non logged in users uh we could help to.

   Navigate the website better and that will lead in more people uh purchasing with us or more shinings whatever is your primary metric right so that’s just an example in this case your primary metric could be more purchases on kindle it could be more sign ins uh that could also be a secondary metric uh your health matrix could be lower bounce rate uh and and so on and so forth yeah uh and last but.

   Not least like uh if you uh are well versed with the console uh in some websites if you open amazon or some facebook or some other website it is.

   Not that difficult to find how you attract in experiments so you can you can find some things going on there these are cookies assigned to different experiments and you might be able to locate them and finally if you have any questions you can ask in the comment section and feel free to reach out to me via linkedin my handle is sort of our there’s also a link here and i’m open always open for interesting product discussions uh be done experimentation or otherwise so feel free to reach out and thank you so much for listening to the talk and spending your time with me i hope you’ll have a good day thanks bye [Music]

Notify of
0 Comentários
Inline Feedbacks
View all comments