Are you working on a distributed application or client-server system that spans across the internet, across different timezones, branch offices, numerous different companies even?
Have you heard about the Eight Fallacies of Distributed Computing? You should have. There is a brief description of them on James Gosling’s blog as well as at Wikipedia.
Some time ago I found that chapter 17 of the “Software Architecture” book had decent explanations for each of the fallacies and could be used to help hammer the point home to those who just-dont-get-it.
I’m shamelessly lifting the entire first part of that chapter and putting it here for your viewing pleasure. (The entire book is online, recommended reading if you’re a bit pushing code twiddler with delusions of creating enterprise systems).
The network myths or the Eight Fallacies of Distributed Computing
Deutsch claims that everyone who first builds a distributed application makes the following assumptions, all of which turn out to be false.
The network is reliable
Usually, the network is reliable. It is, in modern terminology, up for a significant percentage of the time, usually measured as a percentage of recurring nines. If a network is up 99.99% of the time, it might appear to be highly adequate. The trouble is that 99.99% means the network will be down, on average, 8.64 seconds per day.
In itself, 8.64 seconds a day is quite manageable. Applications will generally wait far longer than that before timing out. Unfortunately, networks are usually up for months on end, and when the down time comes, it makes up for each of those 8.64 seconds in one hit. This means a straight run of three months will be followed by network down for 13 minutes. During those 13 minutes, the network managers will run around panicking and pressing buttons, the helpdesk phones will reach a mutual cacophony matched only by repeating proclamations from helpdesk employees that the network is, indeed, down.
What terrible things happen during down time? Can you roll back the invoicing run and the bank update? Can those whose applications have hung get back to where they were before the dreaded down? Will they have to retype data they have not enjoyed typing once already? Will they be happier with their software systems?
Should you, as architect, demand 99.9999% uptime from the network people, even though each additional nine is increasing the cost of the network exponentially? Should you write your applications to be able to recover from a disappearing network, and if so, how long should they wait before doing something about a lost connection?
Latency is zero
Latency can be though of simply as the delay between a request and a reply. Across wide area networks, it is most often measured by a ping, or sending a small packet on a round trip to and from a remote server. A typical ping to the opposite side of the world across the internet on ground based lines takes around 250ms. Straighter wires or dedicated satellites may improve this number slightly, but until we master entanglement the measure is not likely to improve significantly.
Entangled particles are the result of quantum research. If two particles are entangled, then what happens to one, happens to the other instantaneously regardless of the distance between them. Entanglement will allow us to communicate across the galaxy without waiting for the traditional delays associated with speed of light transmission.
Entanglement is how those who have conquered space communicate. The SETIs are wasting their time looking for radio waves.
The causes of latency are:
1. Switches and bridges
Even within the local network, devices with store and forward mechanisms will slightly delay data packets.
2. Routers
Routers change header information. To do so, they must store, manipulate and then forward. A router is a gateway; there may be other gateways slowing the packet before it reaches the internet.
3. Transmission medium
Copper cannot transmit quite as fast as fibre-optics. Fat fibre-optics cannot transmit as fast as thin fibre-optics and require more signal boosting stations, each of which introduces delay.
4. Packet size
The larger the packet, the longer it takes to arrive. Smaller packets might arrive quicker, but you may need more of them. More than one packet usually takes longer to arrive than one packet. How do you recommend and control packet sizes when they are typically between one and four kilobytes.
5. Propagation
The speed of light is finite. A trip to the other side of the world and back at the speed of light on straight, ground based lines, takes 100ms. Propagation alone can have quite an effect on a user in Hong Kong accessing a US web application.
Time in seconds What’s happenin 0 User clicks link. Request travels to server. 0.1 Server receives request, builds page and send out first packet 0.2 First packet arrives. User’s computer sends out an acknowledgement 0.3 Acknowledgement arrives, next packet is sent out 0.4 Packet arrives at user, acknowledgement is sent 0.5 Acknowledgement is received. 6.5 30 more packets are sent, each of them requiring an acknowledgement This simplistic approach assumes an immediate response from the user and server machines, and no other network delays. The propagation delay means the Hong Kong user will have to wait 6.5 seconds for 32 packets whereas the US users will receive them instantaneously.In reality, the other latency and network effects magnify the propagation latency, and the Hong Kong user is more likely to wait far longer than 6.5 seconds for thirty packets of data.6. FirewallsThe firewall stores, analyses and forwards. It analyses for malicious content, and receives many more hits than it lets through, so is not fully employed speeding through your packets.7. Virus checkers
A virus checker has to scan through each incoming chunk of information and compare it to the signatures of thousands of viruses. This takes time.
The reason latency often escapes notice is because it is not usually present in development or test or demonstration environments. Only when an application rolls out to live does latency begin to take effect.
Bandwidth is infinite
The bandwidth of your network may be 100Mbit or 1GBit on paper, but when it comes down to moving messages around, you simply cannot use all of it.
Data must be divided up into packets, and each packet has a header and a footer, each taking up part of your bandwidth. With the size of acknowledgements, and the collision detection/retransmit nature of most networks, you will be lucky to get 70Mbit/s of pure data through a 100Mbit network. This is the first problem.
The second problem is that if you are streaming data with a specific minimum data rate, you must allow for an overlap between incoming packets, and played or displayed packets. You must ensure that packets are transmitted fast and frequently enough not to break the stream. It’s no good stopping a video or audio stream on a client while the next packet comes in. You will frustrate your viewer or listener beyond their ability to view and listen.
Streaming does not require acknowledgements, but internet communication usually requires one acknowledgment per packet. This can be changed, so that a number of packets can be sent out per acknowledgement, which reduces a few latency delays.
However, if we use fewer acknowledgements, there is more traffic as packets are fired off in the hope, rather than the semi-guarantee, of being received. More traffic means there is less available bandwidth.
The network is secure
There is always a way to get into your network and onto your servers. It can be done by technological know-how, persistence and by detective work.
Technological know-how will exploit a security hole in your operating system and/or software.
Persistence means looping through endless variations of access such as login names and passwords until access is granted. Lophtcrack is one of the best known methods of persistence supported by data.
Finally, detective work will look in notepads, gather social data, search through rubbish, observe building entry and egress patterns and get into the building with a username and password.
Topology doesn’t change
A piece of software might work fine one day, then not the next, because the bandwidth it requires to do its work has been reduced by a topology change. Suppose the Hong Kong office has been reduced in size and much of their work moved to the Beijing office. To avoid changing the Beijing line, Hong Kong is routed through Sydney as this is the cheapest option. Sydney has a slower line, and a different daily usage profile than the Hong Kong Office. Something has stopped working…
There is one administratorOver the course of its life, a network will have many administrators. Each one will have their own way of doing things. If your software comes without an ikiwisi wysiwyg setup and relies on some obscure administrative tweak, then it is absolutely guaranteed that one day it will stop working.
The downtime while the magic tweak is found will quickly erode the availability below 99.99%.
Transport cost is zero
It costs a great deal of money to move information around the world. Bandwidth may be getting cheaper, but that just means we use more of it. Thus, we maintain the cost of running our applications.
Software designed with cost in mind must take into account operational costs, which are often far higher than development costs, and that includes large and/or increasing data rates.
The network is homogenous
Most large organisations have non-homogenous networks. They may use Windows and Linux desktops talking to IBM or Unix servers. Every step of non-homogeneity is a headache for someone. If your applications must work across these networks, then they will rely on both ends of the network being up, and the bridge in between them being up. If each of the networks and the bridge have individual uptimes of 99.99%, the total uptime for one application being able to talk to its server through both networks and across the bridge is 99.99 x 99.99 x 99.99%. Suddenly, your network is only 99.97% reliable. A continuous up of only one month will mean a down of 13 minutes.
Here it is in a shorter, more easily digestible audio form :
Laws of Physics.mp3
Bandwidth is futile. You will be queued.
trever
February 27th, 2006