Event Description: At 12:30pm EST today, our system monitoring tools alerted our system administrator. It was a text message from Nagios, our network monitoring software, warning us that something was wrong.
We quickly discovered that one of our data centers, in Toronto, was unreachable from the Internet.
This particular data center is in a secure building in Toronto, in a large facility operated by Cogent Communications. It has backup generators, several days of diesel fuel, and racks and racks of batteries to keep the whole thing running for a few minutes while the generators can be started. It has massive amounts of air conditioning, multiple high speed connections to the Internet, and the kind of “right stuff” down-to-earth engineers who always do things the boring, plodding, methodical way instead of the flashy cool trendy way, so everything is pretty reliable.
Internet providers like Cogent Communications like to guarantee the uptime of their services in terms of a Service Level Agreement, otherwise known as an SLA. A typical SLA might state something like “99.99% uptime.” When you do the math, let’s see, there are 525,949 minutes in a year (or 525,600 if you are in the cast of Rent), so that allows them 52.59 minutes of downtime per year. If they have any more downtime than that, the SLA usually provides for some kind of penalty, but honestly, it’s often rather trivial… like, you get your money back for the minutes they were down. I remember once getting something like $10 off the bill once from a T1 provider because of a two day outage that cost us thousands of dollars. SLAs can be a little bit meaningless that way, and given how low the penalties are, a lot of network providers just started advertising 100% uptime.
Within 7 minutes we contacted the Cogent Communications Center (NOC) in Toronto. They ran some tests, started investigating, couldn’t find anything wrong, and by 12:45pm EST we sent our technician on site to investigate the problem as a precaution.
The servers seemed to be up. The problem was something with the Cogent network switch.
Suddenly at 1:19pm EST all services were restored. At 1:22pm EST we again contact Cogent to inquire about the problem and they advised us of the following Reason For Outage.
“Cogent Communications network experienced an event that occurred on June 22nd, 2015 at 12:30pm EST. affecting our primary Cogent Internet circuit. Service was interrupted due to a human error by a Cogent Communications employee. We have discussed the incident with the responsible group to ensure that this does not happen again. An internal ingestion has been requested.”
We are humbled by this interruption as we have been celebrating a significant improvement in our up-time as of recently. Really high availability becomes extremely costly. The proverbial “six nines” availability (99.9999% uptime) means no more than 30 seconds downtime per year. That’s really kind of ridiculous. Even the people who claim that they have built some big multi-million dollar super-duper ultra-redundant six nines system are going to wake up one day, I don’t know when, but they will, and something completely unusual will have gone wrong in a completely unexpected way, three EMP bombs, one at each data center, and they’ll smack their heads and have fourteen days of outage.
Think of it this way: If your six nines system goes down mysteriously just once and it takes you an hour to figure out the cause and fix it, well, you’ve just blown your downtime budget for the next century. Even the most notoriously reliable systems, like Bell’s long distance service, have had long outages (six hours in 1991) which put them at a rather embarrassing three nines … and Bell’s long distance service is considered “carrier grade,” the gold standard for uptime.
During 2015 and up until today, MeloTel has provided four nine record 99.99% uptime. However, as of today we are regretfully considered 99.9% uptime.
In the meantime, our customer service folks have been authorized to credit customers’ accounts if they feel like they were significantly affected by an outage. We let the customer decide how much they want to be credited, up to a whole month, because not every customer is even going to notice the outage, let alone suffer from it. I hope this system will improve our reliability to the point where the only outages we suffer are really the extremely unexpected.
Although we regret any service impacting event. Unfortunately they can happen. We pick ourselves up, dust off and continue to work to make your service with MeloTel better every day. We appreciate your patience and value your business.
John M. Meloche
“Imprezzio Marketing has been with MeloTel for 4 years and their service has been outstanding over the years. Their customer support team is always extremely helpful and always responds and resolves any issues that we have. Thank you MeloTel for providing us with such great service. READ MORE"
“When TK Enterprises found MeloTel 2 years ago, it was in a crisis situation. Our previous VoIP provider had left us in the lurch and within days, MeloTel had us back up and running and we’ve never looked back since. We started with VoIP services and have worked with them to bring our company’s technology levels up to be more cutting edge and competitive while reducing costs and mitigating risk the entire time. READ MORE"
“Benjamin Verde Incorporated has been a loyal MeloTel customer since 2013. We are able to provide a variety of Artist and Entertainment related solutions such as Web Hosting, Artist Marketing and Artist Funding to clients across Canada. By utilizing MeloTel’s robust products and services our agents are able to provide fast and efficient customer care to various radio stations, record companies, independent artist as well as popular international brands via Benjamin Verde’s VXCO.net portal. READ MORE"
“As the leaders in promotional advertising with over 25 years experience we strive to satisfy all of our clients’ needs when it comes to marketing them and their businesses. We go over and above to make sure that all of their advertising demands are met with no exceptions. MeloTel and Synergy have been in business together since providing exceptional VoIP and technical services. READ MORE"
“I have been a MeloTel customer since 2009 and wanted to tell you how much my whole office appreciates your services. Life becomes a lot easier when someone is there to help when we need it. That kind of service is why we will remain a loyal customer for many years to come. Thank you! READ MORE"
“The transition to MeloTel has been exceptional on a number of levels. Cost savings of more than 50% compared to our old traditional multi-line system is a welcome benefit, but the service and features are what we are most pleased with. MeloTel gives us the capability to enhance our service to clients with features like “Find Me-Follow Me” which allows us to answer our main office lines from anywhere, at any time with our smart phones or soft phones. READ MORE"
“Toronto Cosmetic Surgery Institute is a plastic surgery clinic, located in downtown Toronto. On site we have a fully functional, state of the art surgical facility. MeloTel assisted us when we went paperless a couple years ago in replacing and restructuring our computer hardware and network. They have been providing us with IT support ever since. Whether it be remote sessions that are required, or on site visits, the service MeloTel provides is always quick and efficient. READ MORE"
“Global Mentoring Solutions (GMS) is a North American based, white labelled outsourced help desk provider that partners with major ISPs, eLearning Support Providers and mid-market IT Service Providers. With MeloTel’s telephone infrastructure services GMS is able to provide support for over 8,000 technical support interactions per month! GMS has been relying on MeloTel’s expertise and professional support services since 2010. READ MORE"