I have been very fortunate in a number of respects. I have access to the twisted and talented mind of Eric Beug. And not only do I have a broad mandate to behave like a lunatic, but I also have dozens of like-minded coworkers. I got paid to make this. That makes me a professional actor.
The question of how long an A/B test needs to run comes up all the time. And the answer is that it really depends. It depends on how much traffic you have, on how you divide it up, on the base rates of the metrics you're trying to change, and on how much you manage to change them. It also depends on what you deem to be acceptable rates for Type I and Type II errors.
In the face of this complexity, community concerns ("we don't want too many people to see this until we're sure about it") and scheduling concerns ("we'd like to release this week") can dominate. But this can be setting yourself up for failure, by embarking on experiments that have little chance of detecting positive or negative changes. Sometimes adjustments can be made to avoid this. And sometimes adjustments aren't possible.
To help with this, I built a tool that will let you play around with all of the inputs. You can find it here:
Many have asked me if Etsy does bandit testing. The short answer is that we don't, and as far as I know nobody is seriously considering changing that anytime soon. This has come up often enough that I should write down my reasoning.
First, let me be explicit about terminology. When we do tests at Etsy, they work like this:
We have a fixed number of treatments that might be shown.
We assign the weighting of the treatments at the outset of the test, and we don't change them.
We pick a sample size ahead of time that makes us likely to notice differences of consequential magnitude.
In addressing "bandit testing," I'm referring to any strategy that might involve adaptively re-weighting an ongoing test or keeping experiments running for indefinitely long periods of time.
Noel Welsh at Untyped has written a high-level overview of bandit testing, here. It's a reasonable introduction to the concept and the problems it addresses, although I view the benefits to be more elusive than it presents. "It is well known in the academic community that A/B testing is significantly sub-optimal," it says, and I have no reason to doubt that this is true. But as I hope to explain, the domain in which this definition of "sub-optimal" applies is narrowly constrained.
Gremlins: The Ancient Enemy
At Etsy, we practice continuous deployment. We don't do releases in the classical sense of the word. Instead, we push code live a few dozen lines at a time. When we build a replacement for something, it lives beside its antecedent in production code until it's finished. And when the replacement is ready, we flip a switch to make it live. Cutting and pasting an entire class, making some small modifications, and then ramping it up is not unheard of at all. Actually, it's standard practice, the aesthetics of the situation be damned.
This methodology is occasionally attacked for its tendency to leave bits of dead code lying around. I think that this criticism is unfair. We do eventually excise dead code, thank you. And all other methods of operating a consumer website are inferior. That said, if you twist my arm and promise to quote me anonymously I will concede that yes, we do have a pretty epic pile of dead code at this point. I'm fine with this, but it's there.
My experience here has revealed what I take to be a fundamental law of nature. Given time, the code in the "off" branch no longer works. Errors in a feature ramped up to small percentages of traffic also have a way of passing unnoticed. For practitioners of continuous deployment, production traffic is the lifeblood of working code. Its denial is quickly mortal.
This relates to the discussion at hand in that bandit testing will ramp the losers of experiments down on its own, and keep them around at low volume indefinitely. The end result is a philosophical conundrum, of sorts. Are the losers of experiments losing because they are broken, or are they broken because they are losing?
The beauty of Etsy's A/B testing infrastructure lies in its simplicity.
Experiments are initiated with minimal modifications of our config file.
Visitors are bucketed based on a single, fixed-width value in a persistent cookie.
One of the advantages of this parsimony is that new tests are "free," at least in the engineering sense of the word. They're not free if we are measuring the mass of their combined cognitive overhead. But they are free in that there are no capacity implications of running even hundreds of experiments at once. This is an ideal setup for those of us who maintain that the measurement of our releases ought to be the norm.
Bandit testing upsets this situation in an insidious way. As I explained above, once we weight our tests we don't tweak the proportions later. The reason for this is to maintain the consistency of what visitors are seeing.
Imagine the flow of traffic immediately before and after the initiation of an experiment on Etsy's header. For visitors destined for the new treatment, at first the header looks as it has for several years. Then in their next request, it changes without warning. Should we attribute the behavior of that visitor to the old header or to the new one? Reconciling this is difficult, and in our case we dodge it by throwing out visits that have switched buckets. (We are not even this precise. We just throw out data for the entire day if it's the start of the experiment.)
Bandit testing, in adjusting weights much more aggressively, exacerbates this issue. We would be forced to deal with it in one way or another.
We could try to establish a rule for what to do with visits that see inconsistent behavior. A universally-applicable heuristic for this is not straightforward. And even if feasible, this approach would necessitate making the analysis more complicated. Increasing complexity in analysis increases the likelihood of it being incorrect.
We could continue to ignore visits that see inconsistent behavior. Depending on specifics, this could discard a large amount of data. This decreases the power of the experiment, and undermines its ability to reach a correct conclusion.
We could attempt to ensure that visits only ever see one treatment, while re-weighting the test for fresh visitors. This sounds like a great idea, but ruins the notion of tests as an operational free lunch. Test variant membership, for Etsy, is independent across web requests. Introducing dependence brings tradeoffs that developers should be familiar with. We could keep test membership in a larger cookie, but if the cookie gets too large it will increase the number of packets necessary for user requests. We could record test membership on the server, but we would have to build, maintain, and scale that infrastructure. And every time we added an experiment, we would have to ask ourselves if it was really worth the overhead.
On the Ridiculous Expectations of Runaway Victory
When we release any new feature, it is our hope that it will be a gigantic and undeniable success. Sadly (and as I have discussed at length before), this is practically never what happens. Successful launches are almost always characterized by an incremental improvement in some metric other than purchases or registrations.
Wins in terms of purchases do happen occasionally, and they make life considerably more bearable when they do. But they're exceedingly rare. What is not rare is the experience of releasing something that makes purchase conversion worse. This turns out to be very easy in an annoyingly asymmetrical way.
What we are usually aiming for with our releases is tactical progress on our longer-term strategic goals. Modest gains or even just "not extremely broken" is what we can rationally hope for. Given this background, bandit testing would be wildly inappropriate.
Regret Approaches Zero
Let me point out something that may not be obvious: when we test features on Etsy, we are not typically testing the equivalent of banner advertisements with a limited shelf life. Not that I am suggesting that there is anything wrong with doing so. Nor do I think this is the only scenario in which bandit testing is called for.
But new features and the redesign of existing features are different in several important ways. The unlikelihood of purchase or registration conversion wins means that "regret" in the vernacular sense is minimal to begin with, obviating the need for an algorithm that minimizes regret in the technical sense. And the fact that we are building features for the longer term implies that any regret accumulated during the course of an experiment is minor from the perspective of all history. From this vantage point, the elegant simplicity of not banding testing wins out.
Is bandit testing right for you? I believe it is a question worth asking. It may be the case that you should (to borrow Noel's imagery) "join their merry band." And if so, master, be one of them; it is an honourable kind of thievery.
In the absence of practical constraints, I have no argument against this. But reality is never lacking in practical constraints.
My first several years out of college were spent building a financial data website. The product, and the company, were run by salesmen. Subscribers paid tens of thousands of dollars per seat to use our software. That entitled them to on-site training and, in some cases, direct input on product decisions. We did giant releases that often required years to complete, and one by one we were ground to bits by long stretches of hundred-hour weeks.
Whatever I might think of this as a worthwhile human endeavor generally, as a business model it was on solid footing. And experimental rigor belonged nowhere near it. For one thing, design was completely beside the point: in most cases the users and those making the purchasing decisions weren't the same people. Purchases were determined by a comparison of our feature set to that of a competitor's. The price point implied that training in person would smooth over any usability issues. Eventually, I freaked out and moved to Brooklyn.
When I got to Etsy in 2007, experimentation wasn't something that was done. Although I had some awareness that the consumer web is different animal, the degree to which this is true was lost on me at the time. So when I found the development model to be the same, I wasn't appropriately surprised. In retrospect, I still wouldn't rank waterfall methodology (with its inherent lack of iteration and measurement) in the top twenty strangest things happening at Etsy in the early days. So it would be really out of place to fault anyone for it.
So anyway, in my first few years at Etsy the releases went as follows. We would plan something ambitious. We'd spend a lot of time (generally way too long, but that's another story) building that thing (or some random other thing; again, another story). Eventually it'd be released. We'd talk about the release in our all-hands meeting, at which point there would be applause. We'd move on to other things. Etsy would do generally well, more than doubling in sales year over year. And then after about two years or so we would turn off that feature. And nothing bad would happen.
Some discussion about why this was possible is warranted. The short answer is that this could happen because Etsy's growth was an externality. This is still true today, in 2013. We have somewhere north of 800,000 sellers, thousands of whom are probably attending craft fairs as we speak and promoting themselves. And also, our site. We're lucky, but any site experiencing growth is probably in a similar situation: there's a core feature set that is working for you. Cool. This subsidizes anything else you wish to do, and if you aren't thinking about things very hard you will attribute the growth to whatever you did most recently. It's easy to declare yourself to be a genius in this situation and call it a day. The status quo in our working lives is to confuse effort with progress.
But I had stuck around at Etsy long enough to see behind the curtain. Eventually, the support tickets for celebrated features would reach critical mass, and someone would try to figure out if they were even worth the time. For a shockingly large percentage, the answer to this was "no." And usually, I had something to do with those features.
I had cut my teeth at one job that I considered to be meaningless. And although I viewed Etsy's work to be extremely meaningful, as I still do, I couldn't suppress the idea that I wasn't the making the most of my labor. Even if the situation allowed for it, I did not want to be deluded about the importance and the effectiveness of my life's work.
Measurement is the way out of this. When growth is an externality, controlled experiments are the only way to distinguish a good release from a bad one. But to measure is to risk overturning the apple cart: it introduces the possibility of work being acknowledged as a regrettable waste of time. (Some personalities you may encounter will not want to test purely for this reason. But not, in my experience, the kind of personalities that wind up being engineers.)
Through my own experimentation, I have uncovered a secret that makes this confrontation palatable. Here it is: nearly everything fails. As I have measured the features I've built, it's been humbling to realize how rare it is for them to succeed on the first attempt. I strongly suspect that this experience is universal, but it is not universally recognized or acknowledged.
If someone claims success without measurement from an experiment, odds are pretty good that they are mistaken. Experimentation is the only way to seperate reality from the noise, and to learn. And the only way to make progress is to incorporate the presumption of failure into the process.
Don't spend six months building something if you can divide it into smaller, measurable pieces. The six month version will probably fail. Because everything fails. When it does, you will have six months of changes to untangle if you want to determine which parts work and which parts don't. Small steps that are validated not to fail and that build on one another are the best way, short of luck, to actually accomplish our highest ambitions.
To paraphrase Marx: the demand to give up illusions is the demand to give up the conditions that require illusions. I don't ask people to test because I want them to see how badly they are failing. I ask them to test so that they can stop failing.
In my post about real-time analysis I shared a screenshot of part of Etsy's deployment dashboard. This is the dashboard that every engineer watches as he or she pushes code to production. A bunch of alert readers noticed some odd things about it:
The screenshot is not doctored, so yes we do graph "Three-Armed Sweaters" and "Screwed Users." I can explain. In fact, I can give you excruciating detail about it, if you're interested! Here goes.
"Three-Armed Sweaters" refers to our error pages, which feature one of my favorite drawings in the world. It was done by Anda Corrie:
So the graph on the dashboard is just counting the number of times this page is shown. But in order to reduce the frequency of false alarms, the graph is actually based on requests to an image beacon hidden on the page. This excludes most crawlers and vulnerability scanners. Those constituencies have a habit of generating thousands of errors when nothing is malfunctioning. But lucky for us, they almost never waste bandwidth on images.
Now, there are many reasons why Etsy might not be working, and they don't all result in our machines serving a sweater page. If our CDN provider can't reach our production network, it will show an error page of its own instead. In these cases, our infrastructure may not even be seeing the requests. But we can still graph these errors by situating their image beacon on a wholly separate set of web machines.
The "screwed users" graph is the union of all of these conditions. So-called, presumably, because all of this nuance is relatively meaningless to outsiders. "Screwed users" also attempts to only count unique visitors over a trailing interval. This has the nice property of causing the screwed users and sweaters graphs to diverge in the event that a single person is generating a lot of errors. The internet, after all, is full of weird people who occasionally do weird things with scripts and browsers.
You now know exactly as much as I do about the graphing of web errors in real time. I assume that this is a tiny fraction of the world's total knowledge pertaining to the graphing of web errors in real time. So you would be ill-advised to claim expert status on the basis of grasping everything I have explained here.
By the way, most of the software Etsy uses to produce these graphs is freely available. Here's StatsD and Logster.
Homer: There's three ways to do things. The right way, the wrong way, and the Max Power way!
Bart: Isn't that the wrong way?
Homer: Yeah. But faster!
- "Homer to the Max"
Every few months, I try to talk someone down from building a real-time product analytics system. When I'm lucky, I can get to them early.
The turnaround time for most of the web analysis done at Etsy is at least 24 hours. This a ranking source of grousing. Decreasing this interval is periodically raised as a priority, either by engineers itching for a challenge or by others hoping to make decisions more rapidly. There are companies out there selling instant usage numbers, so why can't we have them?
Here's an excerpt from a manifesto demanding the construction of such a system. This was written several years ago by an otherwise brilliant individual, whom I respect. I have made a few omissions for brevity.
We believe in...
Timeliness. I want the data to be at most 5 minutes old. So this is a near-real-time system.
Comprehensiveness. No sampling. Complete data sets.
Accuracy (how precise the data is). Everything should be accurate.
Accessibility. Getting to meaningful data in Google Analytics is awful. To start with it's all 12 - 24 hours old, and this is a huge impediment to insight & action.
Performance. Most reports / dashboards should render in under 5 seconds.
Durability. Keep all stats for all time. I know this can get rather tough, but it's just text.
The 23-year-old programmer inside of me is salivating at the idea of building this. The burned out 27-year-old programmer inside of me is busy writing an email about how all of these demands, taken together, probably violate the CAP theorem somehow and also, hey, did you know that accuracy and precision are different?
But the 33-year-old programmer (who has long since beaten those demons into a bloody submission) sees the difficulty as irrelevant at best. Real-time analytics are undesirable. While there are many things wrong with our infrastructure, I would argue that the waiting is not one of those things.
Engineers might find this assertion more puzzling than most. I am sympathetic to this mindset, and I can understand why engineers are predisposed to see instantaneous A/B statistics as self-evidently positive. We monitor everything about our site in real time. Real-time metrics and graphing are the key to deploying 40 times daily with relative impunity. Measure anything, measure everything!
This line of thinking is a trap. It's important to divorce the concepts of operational metrics and product analytics. Confusing how we do things with how we decide which things to do is a fatal mistake.
So what is it that makes product analysis different? Well, there are many ways to screw yourself with real-time analytics. I will endeavor to list a few.
The first and most fundamental way is to disregard statistical significance testing entirely. This is a rookie mistake, but it's one that's made all of the time. Let's say you're testing a text change for a link on your website. Being an impatient person, you decide to do this over the course of an hour. You observe that 20 people in bucket A clicked, but 30 in bucket B clicked. Satisfied, and eager to move on, you choose bucket B. There are probably thousands of people doing this right now, and they're getting away with it.
This is a mistake because there's no measurement of how likely it is that the observation (20 clicks vs. 30 clicks) was due to chance. Suppose that we weren't measuring text on hyperlinks, but instead we were measuring two quarters to see if there was any difference between the two when flipped. As we flip, we could see a large gap between the number of heads received with either quarter. But since we're talking about quarters, it's more natural to suspect that that difference might be due to chance. Significance testing lets us ascertain how likely it is that this is the case.
A subtler error is to do significance testing, but to halt the experiment as soon as significance is measured. This is always a bad idea, and the problem is exacerbated by trying to make decisions far too quickly. Funny business with timeframes can coerce most A/B tests into statistical significance.
Depending on the change that's being made, making any decision based on a single day of data could be ill-conceived. Even if you think you have plenty of data, it's not farfetched to imagine that user behavior has its own rhythms. A conspicuous (if basic) example of this is that Etsy sees 30% more orders on Tuesdays than it does on Sundays.
While the sale count itself might not skew a random test, user demographics could be different day over day. Or very likely, you could see a major difference in user behavior immediately upon releasing a change, only to watch it evaporate as users learn to use new functionality. Given all of these concerns, the conservative and reasonable stance is to only consider tests that last a few days or more.
One could certainly have a real-time analytics system without making any of these mistakes. (To be clear, I find this unlikely. Idle hands stoked by a stream of numbers are the devil's playthings.) But unless the intention is to make decisions with this data, one might wonder what the purpose of such a system could possibly be. Wasting the effort to erect complexity for which there is no use case is perhaps the worst of all of these possible pitfalls.
For all of these reasons I've come to view delayed analytics as positive. The turnaround time also imposes a welcome pressure on experimental design. People are more likely to think carefully about how their controls work and how they set up their measurements when there's no promise of immediate feedback.
Real-time web analytics is a seductive concept. It appeals to our desire for instant gratification. But the truth is that there are very few product decisions that can be made in real time, if there are any at all. Analysis is difficult enough already, without attempting to do it at speed.
If there's anything people are good at, it's retrofitting events with a coherent narrative. Even, or maybe especially, when the ultimate causes of past events are the forces of chance. If you look closely you can see this everywhere.
Ancient examples abound. In Moralia, Plutarch relates an incident in which a vestal virgin was struck by lightning and killed. Although this is the definition of randomness to our ears, to the Romans this was an event of profound significance. Clearly, some sin of the surviving virgins was yet uncovered, and in time several were duly convicted of illicit offenses. Punishment for this, by the way, usually involved the offenders being buried alive. Not quite confident that they had set the universe right again, the Romans consulted their Sibylline books (for a contemporary analogue, consider Nostradamus or the Bible Code). Having done this, for good measure, they elected to also bury two live Gauls and two live Greeks.
The modern reader might be tempted to chuckle at this display of superstitious lunacy. But I don't think we should be so confident that we've conquered this species of irrationality. It would be much too depressing for me to go on at length about the case of Cameron Todd Willingham. "Comedy equals tragedy plus time," they say, and while ancient cruelty fits the bill, Texan cruelty doesn't.
So let me hastily change gears and point out that a large percentage of CNBC programming is a more benign manifestation of this kind of thinking. The stock market never fluctuates up or down on a given day, the stock market sinks or rallies for a reason. And that reason invariably involves the actions of heroes and villains. One article pulled out of a hat reveals arcane symbology and slapstick levels of claimed precision:
A classic bearish head and shoulders pattern seems to suggest that further declines may be ahead. For this scenario to occur the Dow Jones will need to see a break below 12,471 to confirm we are heading lower.
And if you want another example, go check the Facebook profile of every ostensibly-atheist Brooklyn asshole you know the next time Mercury is in retrograde.
There is a school of thought in biology—and don't ask me how widely this is accepted—that evolution favors Type II errors ("failing to reject a falsehood") over Type I errors ("failing to accept a truth"). The argument goes like this: while there's not much immediate consequence to believing that your dancing caused the rain, there is probably a lot of selection pressure working against animals that can't make the connection between hearing a rattle and being bitten by a snake. This eminently practical adaption misfires in our present situation, and that makes us susceptible to the category of cognitive bias I've been describing. We are prone to find meaning in everything. Add a dash of salt, stir, let simmer for a millennium or two and you get the Catholic Church. Please note that I like to repeat this sermon at every opportunity, since it makes such a good story.
I am going somewhere with this, of course. For those of us attempting to do something resembling science on the web, a keen awareness of this tendency is crucial if we are to do our jobs well.
To illustrate what I mean, let me show you some real and recent Etsy data. I ran an A/B test on Etsy search pages for a while, and then I poked around to see what metrics had changed with 95% confidence or better. Here's what I found:
Followed another user
Viewed a listing
Given this, we can try to piece together what happened.
We can see that while fewer people in group A followed another user, many more signed up for Etsy. (This is a very interesting result, because we know that at least half of people register in order to purchase.) We also find that fewer people in group B ever managed to look at an item on Etsy. So the result seems to hint at a fundamental tradeoff between social features and commerce. We can encourage engaged users to follow each other, but this is a fringe activity, and it will put off people that show up on the site to buy things.
Right? No. Completely wrong. What I neglected to explain was that this A/B test does nothing. It randomly assigns visitors into group A or group B, and nothing else. Any changes that we observe in this A/B test are, by definition, due to chance. But that didn't stop me from telling you a story about the result that I think was at least a little bit convincing. Imagine how well I might have done if we were measuring a real feature.
The statistical mistake I've just made is to fast talk my way past "95% confidence" and to hide data. If we're 95% confident of a change, then we're 5% unsure of a change. Or to put it another way, one metric in twenty that we measure like this will be wrong. I had about a hundred metrics, and I cherrypicked four whose changes were calculated to be statistically significant.
From there, the flaws in my own software could take control and explain why this data shows that the virgin was struck by lightning because the goddess of the hearth is pissed about unchaste behavior within her sacred temple and in order to rectify this situation, we have to throw some hapless barbaris in a pit. Obviously.
Binding Oneself to the Mast
So to bring this meditation to a close, let's reflect on how we can reduce our chances of making these mistakes. Well, first, we have to realize that by our very nature we are struggling against it every day. Check. And when our enemy is ourselves, one of the most effective countermeasures known is a Ulysses contract.
After investing some time into building a feature and testing it, we know that we're going to have some level of emotional involvement in it. We're only human. If we can identify specific tendencies that we'll have in our less sober moments, we can drag ourselves back to honest assessment. Kicking and screaming, if necessary.
We may be tempted to cherrypick numbers that look impressive and appeal to our biases.
We are prone to retrofit a story onto any numbers we have. And we will believe it, even if intellectually we understand that this is a common mistake.
We can address these problems with tooling and discipline.
First, we can avoid building tooling that enables fishing expeditions. Etsy's AB analyzer tool forces you to reveal each metric that you think might be worth looking at one at a time. There's no "show me everything" button. Now, this may have arisen entirely by accident and laziness, but it was for the best.
Second, we can limit our post-hoc rationalization by explicitly constraining it before the experiment. Whenever we test a feature on Etsy, we begin the process by identifying metrics that we believe will change if we 1) understand what is happening and 2) get the effect we desire. A few months ago, we ran an experiment suggesting replacements for sold or unavailable items still in user carts.
The product manager wrote the following down at the time of launch:
We expect this experiment to reduce dead-ends on the site and to give cart visitors an avenue for finding replacements. We expect an increase in listings viewed, a decrease in exits from the cart page, and an increase in items added to carts.
Listing a combination of several expected observations is key, since every prediction reduces the odds that this will happen entirely by random chance. If the experiment results in the expected changes, we can have a high degree of confidence that we understand what is happening.
Humans are at once flawed and remarkable animals. Much as we might imagine ourselves to be rational actors, we aren't. But we can erect frameworks in which we can compel ourselves to behave rationally.
In 2010ish, we tried to roll out a feature (Treasury) using MongoDB. It was an interesting experience. I learned quite a bit in the process. I wrote about what I was thinking at the time here. But for the most part it was an abject failure and Ryan Young wound up porting the entire thing to the MySQL shards which had come to maturity in the meantime.
Before you get too excited, the reason for the failure is probably not any of the ones you're imagining. Mainly it's this: adding another kind of production database was a huge waste of time.
If you want to make Mongo your only database, it might work out well for you. I can't personally say it will definitely work out. I know that there's plenty of talk on the internet about Mongo's running-with-scissors-as-default and lack of single-server durability and rumors about data loss or what have you, but, none of that ever affected us. Those concerns may or may not have merit, but I personally have no experience with them.
But what I can say is that if you are considering Mongo plus another database like MySQL, then in all likelihood you shouldn't do it. The benefits of being schemaless are negated by the pain you will feel sorting out:
Slow query optimization.
Probably like 50 other things Allspaw knows about that we developers don't have to care about.
For two databases. In practice, you will do this for one of your databases but not the other. The other one will be a ghetto.
Substitute "figure out" with "deploy" if you want, since I'm sure people now aren't starting from scratch on these points as we were in 2010. We were the first people in the world to attempt several of these bulletpoints, and that certainly didn't help. But regardless, deployment takes real effort. The mere fact that Ganglia integration for Mongo might already exist now doesn't mean that you will be handed a tested and working Mongo+Ganglia setup on a silver platter. Everything is significantly more settled in the MySQL world, but it didn't take us zero time or energy to get our MySQL sharding to where it is today.
Mongo tries to make certain things easier, and sometimes it succeeds, but in my experience these abstractions mostly leak. There is no panacea for your scaling problem. You still have to think about how to store your data so that you can get it out of the database. You still have to think about how to denormalize and how to index.
"Auto-sharding" or not, which is something I don't have direct experience with, you have to choose your sharding key correctly or you are screwed. Does your shard setup cluster the newest users onto a single shard? Congratulations, you just figured out how to send the majority of your queries to a single box.
Keep in mind that almost none of this is specific to MongoDB. I wouldn't discourage anyone from trying Mongo out if they're starting a new site, or if they're using it for some offline purpose where these kinds of concerns can be glossed over. But if you're trying to mix Mongo (or almost anything else) into an established site where it won't be your only database, and doesn't accomplish something really novel, you're probably wasting your time.
Does it ever make sense to add another kind of database? Sure, if the work you would save by using it is not outweighed by all of the work I just described. For most purposes, it's pretty hard to make the case that MySQL and Mongo are really sufficiently different for this to be the case.