I have a home office that’s underneath an addition we made to the house a five or so years ago. My office is on the far side of the basement, beyond the boiler and generally behind a closed door. When I’m in my office, especially when I’m working, I like to play music. The primary consequence of this ever-so -far-away-office beyond the sound wall of the boiler is isolation from the noise upstairs (screaming kids, etc.) As you can imagine, this is really good for getting work done, [computer programming and systems analysis is what I do] however, my wife failed to see the benefit to this setup at times because it was very inconvenient when she wanted to ask me a quick question. She would have to yell down the stairs (I never heard her if the boiler was running or if I was playing music) and hope I’d hear her, come downstairs, or send a kid down (if they weren’t at school). Yelling didn’t work out because I couldn’t hear her or she couldn’t hear what I yelled back unless I got up and went to the door of my office (which was aggravating because I was interrupted, I had to get up and I was yelling). Going up and down the stairs is the pain for quick questions. Sending the kids to relay messages can work for simple things, but can quickly devolve into a not-so-fun telephone game. The answer seemed obvious, “Let’s get an intercom system!” I said. So, I went down to RadioShack (when they were still a thing) and I bought an intercom system with three intercom units. Problem solved, right? Sort of.
Fast-forward a few years and we only have two functioning intercoms left – one in my office and one in the kitchen. They work well for the most part, but they are not full-duplex and sometimes seem to get flaky or not work at all (solar flares? gremlins?). While certainly more convenient, the intercoms have proven to be an unreliable means of communication. For example, not too long ago my wife asked me a question over the intercom: “Did you turn off unintelligible“? “What?!” I replied. “Did you turn off the heat trace?” “Yes,” I replied. No response. “Yes, I did” I said again, expecting our traditional “OK”, in return but still getting no response. “Hello, can you hear me?” I said to the intercom in vain. Again, no response. “Is this thing working?” I said to myself. Agitated, I literally ran upstairs and said to my wife, “Yes, I turned off the heat trace! Is the intercom working; could you hear me?” We had an unspoken protocol of replying with “OK” to everything the other person said [though even in face-to-face conversations this would be considered odd] because the intercoms were unreliable means of communication. Not unlike the “radio talk” (or so we called it) system I learned while in the Army, which exists because radio communication is unreliable (especially in the battlefield). She heard me, but had walked away after hearing my initial “Yes,” without respecting our unspoken protocol of replying so that I would know she heard my answer to her question.
All this thinking about communication came as a result of an issue I had while developing a program for a client. The program needed to communicate with a web service that was run by third party. The program I wrote, using the Microsoft .NET Framework, automated a time consuming manual process of consuming some inventory. My program would send the details (what, why, etc.) of the inventory item that is to be consumed in a “consume inventory request” XML message to the web service. The web service should respond with “return message” (also in XML format) that would either indicate success or give a reason as to why the request was rejected (i.e. an error message). Once in a great while, the web service would fail to respond and an exception would be thrown by the .NET framework after a timeout period. This wasn’t unexpected, so my program would catch the exception, log it and try again later. Normally that would work just fine, except that often the web service was getting the “consume inventory request” and acting on it, but the return message was getting lost. This meant that inventory would get double consumed if my program simply tried again later. I dug into the .NET framework documentation and found a setting that would allow me to increase the amount of time my program would wait for a response. Increasing the timeout helped mitigate the problem, but didn’t fix it. I found a point after which increasing the timeout had no effect, because another internal maximum in the .NET frame had been reached, so it still timed-out before even what I had set [this is a good example of a “leaky abstraction“, not to go off on a tangent. Go ahead and follow that link, it’s an interesting and related topic. I’ll wait here for you].
I began thinking, how can I guarantee or at least confirm that my program’s “consume inventory request” message got through? Will yet another layer of abstraction help? Perhaps another library or protocol? As I did my research, I ran into another “problem” of computer science. You see, see in computer science there are many infamous problems: the halting problem, the traveling salesman problem and the Tower of Hanoi just to name a few of the most well known ones. Let me try and explain the problem with my own telling of a story. In antiquity, two generals command armies that are located on either side of a city in a valley. They must conquer the city in the valley and to do so successfully they must attack simultaneously or they will surely be defeated by the city’s defenders. To coordinate their attack, they agree to exchange a set of messages, sent by messengers, who must travel through the valley. One general will propose to attack, “Let’s attack at dawn tomorrow”. The second general will respond, “Agreed, we’ll attack at dawn tomorrow.” The first general must then reply to the second with “It is so, attack at dawn tomorrow.” The second general says “The journey through the valley is treacherous, surely one of our three messengers will be killed by the city’s defenders, disrupting our plans.” The first general states: “Then we shall send three messengers apiece, so that one will get through”. The second general worries aloud “I fear that even three may not be enough for such a dangerous task. I cannot stand to send more than 10 men, but even a phalanx might not suffice if fortune is against us.” The two generals eventually come to the conclusion that no amount of messengers can guarantee that the messages get through and that sending more than 10 men is too high a cost.
This is called the Two General’s problem. Basically, the Two General’s problem demonstrates how it’s impossible to guarantee that you can coordinate an activity by communicating via an unreliable connection. However, by accepting some level of risk that messages won’t get through you can engineer a solution that mitigates problems and is good enough in most situations.
Much like my intercom, the internet can be an unreliable means of communication. This isn’t to say the internet is unreliable. Overall, it’s quite reliable, just at times it can be unreliable. The internet came about as result of a DARPA project called Arpanet, which was to prove that you could build a reliable packet-switched network, which is more efficient and lacks the single point of failure of circuit-switched networks. One of the key pieces of technology to come out as a result of the Arpanet project was the TCP/IP networking protocol, which is used on the majority of computer networks today. TCP/IP accepts problems will occur and mitigates them.
Back to the issue with my program’s problems communicating with the web service, I realized it was impossible to guarantee that my programs “consume inventory request” messages were getting through and that there was already a lot of engineering trying to make sure that the message did get through. I realized that in all likelihood the messages were getting through and that the return message wasn’t getting lost. It wasn’t being generated in the first place, or at least, not in a timely enough fashion by the third party web service. I modified my program to create its own return message in the case of timeout exception (i.e. it didn’t get a return message). This pretend return message would contain all the relevant data and would emulate a rejection type return message. All rejection type return messages show on a report that is reviewed daily by the staff at my client so that corrections can be made and the “consume inventory request” flagged so it will be resent by my program. However, in the case of a timeout caused pretend rejection type return message they need to pick up the telephone so and verify with another human being on the other end that the “consume inventory request” was processed correctly.
So, what’s the moral of the story? Communication is hard? Don’t get involved in a land war in Asia? I think the moral is that if we accept some amount of risk in life we can mitigate most problems.