The person on build rotation, or the nightly schlimazel I suppose, went into a hot 5'x8' closet containing an ancient computer. This happened after everyone else had left, so around 8:30PM. Although in crunch time that was more like 11:30PM. And we were in crunch time at one point for a stretch of a year and a half. "That release left a mark," my friend Matt used to say. In a halfhearted attempt at fairness to those who will take this post as a grave insult, I'll concede that my remembrance of these details is the work of The Mark.
Anyway, the build happened after quitting time. This guaranteed that if anything went wrong, you were on your own. Failure in giving birth to the test build implied that the 20 people in Gurgaon comprising the QA department would show up for work in a matter of hours having nothing to do.
You used a tool called "VBBuild." This was a GUI tool, rumored to be written by Russians:
VBBuild did mysterious COM stuff to create the DLLs that nobody at the time understood properly. It presented you with dozens of popups even when it was working perfectly, and you had to be present to dismiss each of them. The production of executable binary code was all smoke and lasers. And, apparently, popups.
Developers wrote code using the more familiar VB6 IDE. The IDE could run interpreted code as an interactive debugger, but it could not produce finished libraries in a particularly repeatable or practical way. So the release compilation was different in many respects from what programmers were doing at their desks. Were there problems that existed in one of these environments but not the other? Yes, sometimes. I recall that we had a single function that weighed in at around 70,000 lines. The IDE would give up and execute this function even if it contained clear syntax errors. That was the kind of discovery which, while exciting, was wasted in solitude somewhere past midnight as you attempted to lex and parse the code for keeps.
Developers weren't really in the habit of doing complete pulls from source control. And who could blame them, since doing this whitescreened your machine for half an hour. They were also never in any particular hurry to commit, at least until it was time to do the test build. As there was no continuous integration at the time, this was the first time that all of the code was compiled in several days.
Often [ed: always] there were compilation errors to be resolved. We were using Visual Sourcesafe, so people could be holding an exclusive lock on files containing the errors. Typically, this problem was addressed by walking around the office an hour before build time and reminding everyone to check their files in. In the event that someone forgot [ed: every time], there was an administrative process for unlocking locked files. Not everyone had the necessary rights to do this, but happily, I did.
By design, the build tried to assume an exclusive lock on all of the code. As a result, nobody could work while the build was in progress. Sometimes, the person performing the build would check all of the files out and not check them back in. So your first act the morning after a build might be to walk over to the build closet and release the source files from their chains.
Deployment required dozens of manual steps that I will never be able to remember. When the build was done, you copied DLLs over to the test machines and registered them there. By "copied" I mean that you selected them in an explorer window, pressed "Ctrl-C," and then pressed "Ctrl-V" to paste them into another. There was no batch script worked out to do this more efficiently. Ok, this is a slight lie. There had been a script, but was put out to pasture on account of a history of hideous malfunction. And popups. On remote machines sometimes, where they could only be dismissed by wind and ghosts.
Registration involved connecting to each machine with Remote Desktop and right clicking all the DLLs. You could skip a machine or just one library, and things would be very screwy indeed.
The production release, which happened roughly twice a year under ideal conditions, was identical to this but with the added complexity of about eight more servers receiving the build. And we might take the opportunity to add completely new machines, which would not necessarily have the same patch levels for, oh, like 700,000 windows components that were relied upon.
Given eight or ten machines, the probability of a mistake on at least one of the servers approached unity. So the days and weeks following a production release were generally spent sussing out all of the minute differences and misconfigurations on the production machines. There would be catastrophic bugs that affected a tiny sliver of requests, under highly specific server conditions, and only if executed on one server out of eight. I was an expert at debugging in disassembly at the time. Upon leaving the job, I thought that this was pretty badass. But in the seven years since--do you know what? It's never come up.
At one point I wrote a new script to perform the deployment. It was an abomination of XML to be sure, but it got the job done without all of the popups. I started doing the test build with this with some success and suggested that we use it for the production release. This was out of the question, I was told by one of my closer allies in the place. The production release was "too important to use a script."
The operating systems and supporting libraries on the machines were also set up by hand, by a separate team, working from printed notes. The results were similar. This is kind of another story.
For four months ending in early 2011, I worked on team of six to redesign Etsy's homepage. I don't want to overstate the weight of this in the grand scheme of things, but hopes flew high. The new version was to look something like this:
But perhaps worst of all, we publicized the experiment. Well, "publicized" does not accurately convey the magnitude of what we did. We allowed visitors to join the treatment group using a magic URL. We proactively told our most engaged users about this. We tweeted the magic URL from the @Etsy account, which at that point had well over a million followers.
This project was a disaster for many reasons. Nearly all of the core hypotheses turned out to be completely wrong. The work was thrown out as a total loss. Everyone involved learned valuable life lessons. I am here today to elaborate on one of these: telling users about the experiment as it was running was a big mistake.
The Diamond-Forging Pressure to Disclose Experiments
If you operate a website with an active community, and you do A/B testing, you might feel some pressure to disclose your work. And this seems like a proper thing to do, if your users are invested in your site in any serious way. They may notice anyway, and the most common reaction to change on a beloved site tends to be varying degrees of panic.
As an honest administrator, your wish is to reassure your community that you have their best interest at heart. Transparency is the best policy!
Except in this case. I think there's a strong argument to be made against announcing the details of active experiments. It turns out to be easier for motivated users to overturn your experiment than you may believe. And disclosing experiments is work, and work that comes before real data should be minimized.
Online Protests: Not Necessarily A Waste of Time
A fundamental reason that you should not publicize your A/B tests is that this can introduce bias that can affect your measurements. This can even overturn your results. There are many different ways for this to play out.
Most directly, motivated users can just perform positive actions on the site if they believe that they are in their preferred experiment bucket. Even if the control and treatment groups are very large, the number of people completing a goal metric (such as purchasing) may be just a fraction of that. And the anticipated difference between any two treatments might be slight. It's not hard to imagine how a small group of people could determine an outcome if they knew exactly what to do.
50 organic conversions
50 organic conversions
10 gamed conversions
0 gamed conversions
Figure 1: In some cases a small group of motivated users can change an outcome, even if the sample sizes are large.
As the scope and details of an experiment become more fully understood, this gets easier to accomplish. But intentional, organized action is not the only possible source of bias.
Even if users have no preference as to which version of a feature wins, some will still be curious. If you announce an experiment, visitors will engage with the feature immediately who otherwise would have stayed away. This well-intentioned interest could ironically make a winning feature appear to be a loss. Here's an illustration of what that looks like.
500 oblivious visits
500 oblivious visits
50 rubbernecking visits
250 rubbernecking visits
550 total visits
750 total visits
Figure 2: An example in which 100 engaged users are told about a new experiment. They are all curious and seek out the feature. Those seeing the new treatment visit the new feature more often just to look at it, skewing measurement.
Good experimental practice requires that you isolate the intended change as the sole variable being tested. To accomplish this, you randomly assign visitors the new treatment or the old, controlling for all other factors. Informing visitors that they're part of an experiment places this central assumption in considerable jeopardy.
Predicting Bias is Hard
"But," you might say, "most users aren't paying attention to our communiqués." You may think that you can announce experiments, and only a small group of the most engaged people will notice. This is very likely true. But as I have already shown, the behavior of a small group cannot be dismissed out of hand.
Obviously, this varies. There are experiments in which a vocal minority cannot possibly bias results. But determining if this is true for any given experiment in advance is a difficult task. There is roughly one way for an experiment to be conducted correctly, and there are an infinite number of ways for it to be screwed.
A/B tests are already complicated: bucketing, data collection, experimental design, experimental power, and analysis are all vulnerable to mistakes. From this point of view, "is it safe to talk about this?" is just another brittle moving part.
Communication Plans are Real Work
Something I have come to appreciate over the years is the role of product marketing. I have been involved in many releases for which the act of explaining and gaining acceptance for a new feature constituted the majority of the effort. Launches involve a lot more than pressing a deploy button. This is a big deal.
It also seems to be true that people who are skilled at this kind of work are hard to come by. You will be lucky to have a few of them, and this imposes limits on the number of major changes that you can make in any given year.
It makes excellent sense to avoid wasting this resource on quite-possibly-fleeting experiments. It will delay their deployment, steal cycles from launches for finished features, and it will do these things in the service of work that may never see the light of day!
Users will tend to view any experiment as presaging an imminent release, regardless of your intentions. Therefore, you will need to put together a relatively complete narrative explaining why the changes are positive at the outset. A "minimum viable announcement" probably won't do. And you will need to execute this without the benefit of quantitative results to bolster your case.
Your Daily Reminder that Experiments Fail
Doing data-driven product work really does imply that you will not release changes that don't meet some quantitative standard. In such an event you might tweak things and start over, or you might give up altogether. Announcing your running experiments is problematic given this reality.
Obviously, product costs will be compounded by communication costs. Every time you retool an experiment, you will have to bear the additional weight of updating your community. Adding marginal effort makes it more difficult for humans to behave rationally and objectively. We have a name for this well-known pathology: the sunk cost fallacy. We've put so much into this feature, we can't just give up on it now.
Announcing experiments also has a way of raising the stakes. The prospect of backtracking with your users (and being perceived as admitting a mistake) only makes killing a bad feature less palatable. The last thing you need is additional temptation to delude yourself. You have plenty of this already. The danger of living in public is that it will turn a bad release that should be discarded into an inevitability.
Consistency and Expectations
Let's say you've figured out workarounds for every issue I've raised so far. You are still going to want to run experiments that are not publicly declared.
Some experiments are inherently controversial or exploratory. It may be perfectly legitimate to try changes that you would never release to learn more about your site. Removing a dearly beloved feature temporarily for half of new registrations is a good example of this. By doing so, you can measure the effect of that feature on lifetime value, and make better decisions with your marketing budget.
Other experiments work only when they're difficult to detect. Search ranking is a high-stakes arms race, and complete transparency can just make it easier for malicious users gain unfair advantages. It's likely you're going to want to run experiments on search ranking without disclosing them.
It would be malpractice to give users the expectation that they will always know the state of running experiments. They will not have the complete picture. Leading them to believe otherwise can do more harm to your relationship than just having a consistent policy of remaining silent until features are ready for release.
What can you share?
Sharing too much too soon can doom your A/B tests. But this doesn't mean that you are doomed to be locked in a steel cage match with your user base over them.
You can do rigorous, well-controlled experiments and also announce features in advance of their release. You can give people time to acclimate to them. You can let users preview new functionality, and enable them at a slower pace. These practices all relate to how a feature is released, and they are not necessarily in conflict with how you decide which features should be released. It is important to decouple these concerns.
You can and should share information about completed experiments. "What happened in the A/B test" should be a regular feature of your release notes. If you really have determined that your new functionality performs better than what it replaces, your users should have this data.
Counterintuitively, perhaps, trust is also improved by sharing the details of failed experiments. If you only tell users about your victories, they have no reason to believe that you are behaving objectively. Who's to say that you aren't just making up your numbers? Showing your scars (as I tried to do with my homepage story above) can serve as a powerful declaration against interest.
Successful Testing is Good Stewardship
Your job in product development, very broadly, is to make progress while striking a balance between short and long term concerns.
Users should be as happy as possible in the short term.
Your site should continue to exist in the long term.
The best interest of your users is ultimately served by making the correct changes to your product. Talking about experiments can break them, leading to both quantitative errors and mistakes of judgment.
I firmly believe that A/B tests in any organization should be as free, easy, and cheap as humanly possible. After all, running A/B tests is perhaps the only way to know that you're making the right changes. Disclosing experiments as they are running is a policy that can alleviate some discontent in the short term. But the price of this is making experiments harder to run in the long term, and ultimately making it less likely that measurement will be done at all.
I have been very fortunate in a number of respects. I have access to the twisted and talented mind of Eric Beug. And not only do I have a broad mandate to behave like a lunatic, but I also have dozens of like-minded coworkers. I got paid to make this. That makes me a professional actor.
The question of how long an A/B test needs to run comes up all the time. And the answer is that it really depends. It depends on how much traffic you have, on how you divide it up, on the base rates of the metrics you're trying to change, and on how much you manage to change them. It also depends on what you deem to be acceptable rates for Type I and Type II errors.
In the face of this complexity, community concerns ("we don't want too many people to see this until we're sure about it") and scheduling concerns ("we'd like to release this week") can dominate. But this can be setting yourself up for failure, by embarking on experiments that have little chance of detecting positive or negative changes. Sometimes adjustments can be made to avoid this. And sometimes adjustments aren't possible.
To help with this, I built a tool that will let you play around with all of the inputs. You can find it here:
Many have asked me if Etsy does bandit testing. The short answer is that we don't, and as far as I know nobody is seriously considering changing that anytime soon. This has come up often enough that I should write down my reasoning.
First, let me be explicit about terminology. When we do tests at Etsy, they work like this:
We have a fixed number of treatments that might be shown.
We assign the weighting of the treatments at the outset of the test, and we don't change them.
We pick a sample size ahead of time that makes us likely to notice differences of consequential magnitude.
In addressing "bandit testing," I'm referring to any strategy that might involve adaptively re-weighting an ongoing test or keeping experiments running for indefinitely long periods of time.
Noel Welsh at Untyped has written a high-level overview of bandit testing, here. It's a reasonable introduction to the concept and the problems it addresses, although I view the benefits to be more elusive than it presents. "It is well known in the academic community that A/B testing is significantly sub-optimal," it says, and I have no reason to doubt that this is true. But as I hope to explain, the domain in which this definition of "sub-optimal" applies is narrowly constrained.
Gremlins: The Ancient Enemy
At Etsy, we practice continuous deployment. We don't do releases in the classical sense of the word. Instead, we push code live a few dozen lines at a time. When we build a replacement for something, it lives beside its antecedent in production code until it's finished. And when the replacement is ready, we flip a switch to make it live. Cutting and pasting an entire class, making some small modifications, and then ramping it up is not unheard of at all. Actually, it's standard practice, the aesthetics of the situation be damned.
This methodology is occasionally attacked for its tendency to leave bits of dead code lying around. I think that this criticism is unfair. We do eventually excise dead code, thank you. And all other methods of operating a consumer website are inferior. That said, if you twist my arm and promise to quote me anonymously I will concede that yes, we do have a pretty epic pile of dead code at this point. I'm fine with this, but it's there.
My experience here has revealed what I take to be a fundamental law of nature. Given time, the code in the "off" branch no longer works. Errors in a feature ramped up to small percentages of traffic also have a way of passing unnoticed. For practitioners of continuous deployment, production traffic is the lifeblood of working code. Its denial is quickly mortal.
This relates to the discussion at hand in that bandit testing will ramp the losers of experiments down on its own, and keep them around at low volume indefinitely. The end result is a philosophical conundrum, of sorts. Are the losers of experiments losing because they are broken, or are they broken because they are losing?
The beauty of Etsy's A/B testing infrastructure lies in its simplicity.
Experiments are initiated with minimal modifications of our config file.
Visitors are bucketed based on a single, fixed-width value in a persistent cookie.
One of the advantages of this parsimony is that new tests are "free," at least in the engineering sense of the word. They're not free if we are measuring the mass of their combined cognitive overhead. But they are free in that there are no capacity implications of running even hundreds of experiments at once. This is an ideal setup for those of us who maintain that the measurement of our releases ought to be the norm.
Bandit testing upsets this situation in an insidious way. As I explained above, once we weight our tests we don't tweak the proportions later. The reason for this is to maintain the consistency of what visitors are seeing.
Imagine the flow of traffic immediately before and after the initiation of an experiment on Etsy's header. For visitors destined for the new treatment, at first the header looks as it has for several years. Then in their next request, it changes without warning. Should we attribute the behavior of that visitor to the old header or to the new one? Reconciling this is difficult, and in our case we dodge it by throwing out visits that have switched buckets. (We are not even this precise. We just throw out data for the entire day if it's the start of the experiment.)
Bandit testing, in adjusting weights much more aggressively, exacerbates this issue. We would be forced to deal with it in one way or another.
We could try to establish a rule for what to do with visits that see inconsistent behavior. A universally-applicable heuristic for this is not straightforward. And even if feasible, this approach would necessitate making the analysis more complicated. Increasing complexity in analysis increases the likelihood of it being incorrect.
We could continue to ignore visits that see inconsistent behavior. Depending on specifics, this could discard a large amount of data. This decreases the power of the experiment, and undermines its ability to reach a correct conclusion.
We could attempt to ensure that visits only ever see one treatment, while re-weighting the test for fresh visitors. This sounds like a great idea, but ruins the notion of tests as an operational free lunch. Test variant membership, for Etsy, is independent across web requests. Introducing dependence brings tradeoffs that developers should be familiar with. We could keep test membership in a larger cookie, but if the cookie gets too large it will increase the number of packets necessary for user requests. We could record test membership on the server, but we would have to build, maintain, and scale that infrastructure. And every time we added an experiment, we would have to ask ourselves if it was really worth the overhead.
On the Ridiculous Expectations of Runaway Victory
When we release any new feature, it is our hope that it will be a gigantic and undeniable success. Sadly (and as I have discussed at length before), this is practically never what happens. Successful launches are almost always characterized by an incremental improvement in some metric other than purchases or registrations.
Wins in terms of purchases do happen occasionally, and they make life considerably more bearable when they do. But they're exceedingly rare. What is not rare is the experience of releasing something that makes purchase conversion worse. This turns out to be very easy in an annoyingly asymmetrical way.
What we are usually aiming for with our releases is tactical progress on our longer-term strategic goals. Modest gains or even just "not extremely broken" is what we can rationally hope for. Given this background, bandit testing would be wildly inappropriate.
Regret Approaches Zero
Let me point out something that may not be obvious: when we test features on Etsy, we are not typically testing the equivalent of banner advertisements with a limited shelf life. Not that I am suggesting that there is anything wrong with doing so. Nor do I think this is the only scenario in which bandit testing is called for.
But new features and the redesign of existing features are different in several important ways. The unlikelihood of purchase or registration conversion wins means that "regret" in the vernacular sense is minimal to begin with, obviating the need for an algorithm that minimizes regret in the technical sense. And the fact that we are building features for the longer term implies that any regret accumulated during the course of an experiment is minor from the perspective of all history. From this vantage point, the elegant simplicity of not banding testing wins out.
Is bandit testing right for you? I believe it is a question worth asking. It may be the case that you should (to borrow Noel's imagery) "join their merry band." And if so, master, be one of them; it is an honourable kind of thievery.
In the absence of practical constraints, I have no argument against this. But reality is never lacking in practical constraints.
My first several years out of college were spent building a financial data website. The product, and the company, were run by salesmen. Subscribers paid tens of thousands of dollars per seat to use our software. That entitled them to on-site training and, in some cases, direct input on product decisions. We did giant releases that often required years to complete, and one by one we were ground to bits by long stretches of hundred-hour weeks.
Whatever I might think of this as a worthwhile human endeavor generally, as a business model it was on solid footing. And experimental rigor belonged nowhere near it. For one thing, design was completely beside the point: in most cases the users and those making the purchasing decisions weren't the same people. Purchases were determined by a comparison of our feature set to that of a competitor's. The price point implied that training in person would smooth over any usability issues. Eventually, I freaked out and moved to Brooklyn.
When I got to Etsy in 2007, experimentation wasn't something that was done. Although I had some awareness that the consumer web is different animal, the degree to which this is true was lost on me at the time. So when I found the development model to be the same, I wasn't appropriately surprised. In retrospect, I still wouldn't rank waterfall methodology (with its inherent lack of iteration and measurement) in the top twenty strangest things happening at Etsy in the early days. So it would be really out of place to fault anyone for it.
So anyway, in my first few years at Etsy the releases went as follows. We would plan something ambitious. We'd spend a lot of time (generally way too long, but that's another story) building that thing (or some random other thing; again, another story). Eventually it'd be released. We'd talk about the release in our all-hands meeting, at which point there would be applause. We'd move on to other things. Etsy would do generally well, more than doubling in sales year over year. And then after about two years or so we would turn off that feature. And nothing bad would happen.
Some discussion about why this was possible is warranted. The short answer is that this could happen because Etsy's growth was an externality. This is still true today, in 2013. We have somewhere north of 800,000 sellers, thousands of whom are probably attending craft fairs as we speak and promoting themselves. And also, our site. We're lucky, but any site experiencing growth is probably in a similar situation: there's a core feature set that is working for you. Cool. This subsidizes anything else you wish to do, and if you aren't thinking about things very hard you will attribute the growth to whatever you did most recently. It's easy to declare yourself to be a genius in this situation and call it a day. The status quo in our working lives is to confuse effort with progress.
But I had stuck around at Etsy long enough to see behind the curtain. Eventually, the support tickets for celebrated features would reach critical mass, and someone would try to figure out if they were even worth the time. For a shockingly large percentage, the answer to this was "no." And usually, I had something to do with those features.
I had cut my teeth at one job that I considered to be meaningless. And although I viewed Etsy's work to be extremely meaningful, as I still do, I couldn't suppress the idea that I wasn't the making the most of my labor. Even if the situation allowed for it, I did not want to be deluded about the importance and the effectiveness of my life's work.
Measurement is the way out of this. When growth is an externality, controlled experiments are the only way to distinguish a good release from a bad one. But to measure is to risk overturning the apple cart: it introduces the possibility of work being acknowledged as a regrettable waste of time. (Some personalities you may encounter will not want to test purely for this reason. But not, in my experience, the kind of personalities that wind up being engineers.)
Through my own experimentation, I have uncovered a secret that makes this confrontation palatable. Here it is: nearly everything fails. As I have measured the features I've built, it's been humbling to realize how rare it is for them to succeed on the first attempt. I strongly suspect that this experience is universal, but it is not universally recognized or acknowledged.
If someone claims success without measurement from an experiment, odds are pretty good that they are mistaken. Experimentation is the only way to seperate reality from the noise, and to learn. And the only way to make progress is to incorporate the presumption of failure into the process.
Don't spend six months building something if you can divide it into smaller, measurable pieces. The six month version will probably fail. Because everything fails. When it does, you will have six months of changes to untangle if you want to determine which parts work and which parts don't. Small steps that are validated not to fail and that build on one another are the best way, short of luck, to actually accomplish our highest ambitions.
To paraphrase Marx: the demand to give up illusions is the demand to give up the conditions that require illusions. I don't ask people to test because I want them to see how badly they are failing. I ask them to test so that they can stop failing.
In my post about real-time analysis I shared a screenshot of part of Etsy's deployment dashboard. This is the dashboard that every engineer watches as he or she pushes code to production. A bunch of alert readers noticed some odd things about it:
The screenshot is not doctored, so yes we do graph "Three-Armed Sweaters" and "Screwed Users." I can explain. In fact, I can give you excruciating detail about it, if you're interested! Here goes.
"Three-Armed Sweaters" refers to our error pages, which feature one of my favorite drawings in the world. It was done by Anda Corrie:
So the graph on the dashboard is just counting the number of times this page is shown. But in order to reduce the frequency of false alarms, the graph is actually based on requests to an image beacon hidden on the page. This excludes most crawlers and vulnerability scanners. Those constituencies have a habit of generating thousands of errors when nothing is malfunctioning. But lucky for us, they almost never waste bandwidth on images.
Now, there are many reasons why Etsy might not be working, and they don't all result in our machines serving a sweater page. If our CDN provider can't reach our production network, it will show an error page of its own instead. In these cases, our infrastructure may not even be seeing the requests. But we can still graph these errors by situating their image beacon on a wholly separate set of web machines.
The "screwed users" graph is the union of all of these conditions. So-called, presumably, because all of this nuance is relatively meaningless to outsiders. "Screwed users" also attempts to only count unique visitors over a trailing interval. This has the nice property of causing the screwed users and sweaters graphs to diverge in the event that a single person is generating a lot of errors. The internet, after all, is full of weird people who occasionally do weird things with scripts and browsers.
You now know exactly as much as I do about the graphing of web errors in real time. I assume that this is a tiny fraction of the world's total knowledge pertaining to the graphing of web errors in real time. So you would be ill-advised to claim expert status on the basis of grasping everything I have explained here.
By the way, most of the software Etsy uses to produce these graphs is freely available. Here's StatsD and Logster.