Event Description: At 12:30pm EST today, our system monitoring tools alerted our system administrator. It was a text message from Nagios, our network monitoring software, warning us that something was wrong.
We quickly discovered that one of our data centers, in Toronto, was unreachable from the Internet.
This particular data center is in a secure building in Toronto, in a large facility operated by Cogent Communications. It has backup generators, several days of diesel fuel, and racks and racks of batteries to keep the whole thing running for a few minutes while the generators can be started. It has massive amounts of air conditioning, multiple high speed connections to the Internet, and the kind of “right stuff” down-to-earth engineers who always do things the boring, plodding, methodical way instead of the flashy cool trendy way, so everything is pretty reliable.
Internet providers like Cogent Communications like to guarantee the uptime of their services in terms of a Service Level Agreement, otherwise known as an SLA. A typical SLA might state something like “99.99% uptime.” When you do the math, let’s see, there are 525,949 minutes in a year (or 525,600 if you are in the cast of Rent), so that allows them 52.59 minutes of downtime per year. If they have any more downtime than that, the SLA usually provides for some kind of penalty, but honestly, it’s often rather trivial… like, you get your money back for the minutes they were down. I remember once getting something like $10 off the bill once from a T1 provider because of a two day outage that cost us thousands of dollars. SLAs can be a little bit meaningless that way, and given how low the penalties are, a lot of network providers just started advertising 100% uptime.
Within 7 minutes we contacted the Cogent Communications Center (NOC) in Toronto. They ran some tests, started investigating, couldn’t find anything wrong, and by 12:45pm EST we sent our technician on site to investigate the problem as a precaution.
The servers seemed to be up. The problem was something with the Cogent network switch.
Suddenly at 1:19pm EST all services were restored. At 1:22pm EST we again contact Cogent to inquire about the problem and they advised us of the following Reason For Outage.
“Cogent Communications network experienced an event that occurred on June 22nd, 2015 at 12:30pm EST. affecting our primary Cogent Internet circuit. Service was interrupted due to a human error by a Cogent Communications employee. We have discussed the incident with the responsible group to ensure that this does not happen again. An internal ingestion has been requested.”
We are humbled by this interruption as we have been celebrating a significant improvement in our up-time as of recently. Really high availability becomes extremely costly. The proverbial “six nines” availability (99.9999% uptime) means no more than 30 seconds downtime per year. That’s really kind of ridiculous. Even the people who claim that they have built some big multi-million dollar super-duper ultra-redundant six nines system are going to wake up one day, I don’t know when, but they will, and something completely unusual will have gone wrong in a completely unexpected way, three EMP bombs, one at each data center, and they’ll smack their heads and have fourteen days of outage.
Think of it this way: If your six nines system goes down mysteriously just once and it takes you an hour to figure out the cause and fix it, well, you’ve just blown your downtime budget for the next century. Even the most notoriously reliable systems, like Bell’s long distance service, have had long outages (six hours in 1991) which put them at a rather embarrassing three nines … and Bell’s long distance service is considered “carrier grade,” the gold standard for uptime.
During 2015 and up until today, MeloTel has provided four nine record 99.99% uptime. However, as of today we are regretfully considered 99.9% uptime.
In the meantime, our customer service folks have been authorized to credit customers’ accounts if they feel like they were significantly affected by an outage. We let the customer decide how much they want to be credited, up to a whole month, because not every customer is even going to notice the outage, let alone suffer from it. I hope this system will improve our reliability to the point where the only outages we suffer are really the extremely unexpected.
Although we regret any service impacting event. Unfortunately they can happen. We pick ourselves up, dust off and continue to work to make your service with MeloTel better every day. We appreciate your patience and value your business.
John M. Meloche