UPDATE: This English translation was produced by ChatGPT on 2025-05-17 from the original blog post dated 2006-06-24. The original Czech version appears below the translation.
We often talk about a “working application,” but today’s applications—or more precisely, software systems—are no longer as simple as single-user DOS applications once were. Today, we expect software systems to handle a high number of transactions per second, support hundreds or ideally thousands of users, and be multi-tiered and integrated from multiple interconnected subsystems using the best integration tools. This makes them scalable and robust. You may have heard these terms before, but who can really make sense of it all? You practically need a team of experts to run such a system.
Who exactly do you need? A software architect, analyst, developer, operating system administrator, database administrator, and network administrator. Great—but now you’re facing monthly costs so high that you’ll need a really solid business plan to cover the overhead of all these people and still make a profit. But even if you’re lucky enough to have such a team, are they capable of quickly and reliably identifying every operational issue in the system?
Let me illustrate with a specific case that detecting a malfunction in a software system is not always straightforward. For years, we have been developing and operating an online airline ticket reservation system for a customer. The system is divided into several mutually communicating layers. The layer closest to the customer (let’s call it Layer 1) is a classic web system that uses object-oriented techniques (inheritance) to extend a simpler web layer (Layer 2). Layer 2 contains all the logic for the user interface but not the full visual appearance, text, or other customizations—that's all handled by Layer 1. This setup is ideal, especially since there are many customized websites built on the same foundation.
Both of these layers provide only the graphical user interface, while the reservation logic itself resides on the application server (Layer 3). Layer 2 communicates with Layer 3 using XML over TCP/IP. Layer 3 simulates the work of a live travel operator and attempts to find optimal flight connections in the AMADEUS international reservation system based on user requests. Layer 3 communicates with AMADEUS via its proprietary API over TCP/IP. It also has to store certain information and uses a database server for this, also connected via TCP/IP.
You’ll agree that this can no longer be considered a simple software system. From a hardware perspective, at least three servers are used: one web server for Layers 1 and 2, one application server for Layer 3, and one database server. All of them must be connected via a network. And this doesn’t even include the AMADEUS servers located on the other side of the globe, or additional servers used for load balancing, backups, automatic email notifications, DNS servers, and so on.
I’ve described this “relatively simple” system just to show how difficult it can be to diagnose a suddenly appearing issue. We had been developing and maintaining the software for about five years, while a third party was responsible for its operation. About a year ago, after a period of dissatisfaction, the customer decided to end the cooperation with the third-party provider and asked us to take over system operations as well.
Anyone who has ever taken over an existing information system knows that it’s no walk in the park. For us as developers, it was a bit easier because we understood the system architecture, but even so, some "improvements" had been made on the production servers during operations, and no one really knew much about them. Fortunately, we were able to arrange a gradual handover of the servers and document those “enhancements” with the former operator.
I won’t hide the fact that the server transition has been going on for nine months already, but since the system works, there was no rush. The transition involves preparing new hardware and reinstalling everything from scratch—while also upgrading hardware, the OS, and required libraries.
Now we’re getting to the core of the story. At one point, the web and application servers were still in the old provider’s data center, while the database server had already been successfully moved to our own. The system operated in this setup for several weeks without issues.
Then the client called, complaining that the web reservation system was occasionally not working. After further diagnostics, we found that sometimes the system worked fine, sometimes it was slow, and sometimes it didn’t work at all. Log files showed issues between the application server (Layer 3) and the database server. The application server logs had many errors about failed database connections.
I started reviewing what had changed in the hours and days leading up to the incident. The day before, we had deployed a new version of the application server, but the changes were minor and should not have affected the database connection. Still, I reverted to the older version just in case—but that didn’t help.
So I focused on the database. Its logs showed numerous failed connection attempts from the application server. After too many failures, the database blocked further access from that source. I then tried modifying some configuration parameters—increasing the number of allowed connections, increasing the client connection timeout, and so on. After about two hours of trying everything, there was still no improvement.
Then I remembered an email from our ISP a week earlier, warning of a 5-minute outage scheduled at 5:00 AM. I immediately checked my inbox and confirmed that the outage did occur that very day—and it matched exactly with the first connection errors in the application server logs.
We had acknowledged the email at the time, but our systems typically handle such short outages gracefully. Our central monitoring also reported no issues. But you’ve probably guessed it: since the application and database servers were in different data centers (with different ISPs), it turned out to be a network problem.
I launched a more aggressive ping test (ping -s 1000 -i 0.01) from the application server to the database server—and discovered about 10% packet loss. That explained everything. With traceroute and ping, I quickly located the issue: it was between our data center’s router and our ISP’s switch. In 99% of such cases, the culprit is mismatched duplex settings on the network interfaces.
Our interface was set to auto-negotiate and had settled on 10 Mbps half-duplex. A call to our ISP confirmed that their side was set to 10 Mbps full-duplex. I changed our interface to 10 Mbps full-duplex—and the packet loss disappeared. Instantly, the software system was working again.
I wiped the sweat off my forehead and reflected on the situation. The root cause was that the ISP had not saved the switch configuration we had agreed upon months earlier (auto-negotiation enabled). After the power outage, the switch reverted to the old settings.
I’m lucky to be from a generation of IT professionals with a broad skill set—able to handle all these steps on my own. I can’t imagine a similar problem being resolved efficiently by a team of separate OS admins, DBAs, application developers, and network engineers.
Long live simple systems! I felt like shouting, “WHO CAN POSSIBLY UNDERSTAND ALL THIS?” But we’ll simply have to accept it—and all of us in IT must continuously educate ourselves so we can quickly identify the true causes of problems.
The worst situation is when you don’t know what’s causing an issue and just say, “Well, I guess it’s just something between heaven and earth.”