On May 5, 3:51 pm, Jay Ashworth <j...@baylink.com> wrote:
----- Original Message -----
From: "Ryan Malayter" <malay...@gmail.com> I like to bag on my developers for not knowing anything about the infrastructure, but sometimes you just can't do it right because of physics. Or you can't do it right without writing your own OS, networking stacks, file systems, etc., which means it is essentially "impossible" in the real world.
"Physics"?
Isn't that an entirely inadequate substitute for "desire"?
Not really. For some applications, it is physics: 1) You need two or more locations separated by say 500km for disaster protection (think Katrina, or Japan Tsunami). 2) Those two locations need to be 100% consistent, with in-order "serializable" ACID semantics for a particular database entity. An example would be some sort of financial account - the order of transactions against that account must be such that an account cannot go below a certain value, and debits to and from different accounts must always happen together or not at all. The above implies a two-phase commit protocol. This, in turn, implies *at least* two network round-trips. Given a perfect dedicated fiber network and no switch/router/CPU/disk latency, this means at least 10.8 ms per transaction, or at most 92 transactions per second per affected database entity. The reality of real networks, disks, databases, and servers makes this perfect scenario unachievable - often by an order of magnitude. I don't have inside knowledge, but I suspect this is why Wall Street firms have DR sites across the river in New Jersey, rather than somewhere "safer". Amazon's EBS service is network-based block storage, with semantics similar to the financial account scenario: data writes to the volume must happen in-order at all replicas. Which is why EBS volumes cannot have a replica a great distance away from the primary. So any application which used the EBS abstraction for keeping consistent state were screwed during this Amazon outage. The fact that Amazon's availability zones were not, in fact, very isolated from each other for this particular failure scenario compounded the problem.