Building an outage-proof network with SD-WAN

On December 28, 2018, CenturyLink experienced a major network outage, disrupting residential, commercial and 911 services in many areas across the nation. The widespread outage affected many customers for more than 24 hours. We recently spoke with Scott Moeller, director of business solutions at Portfolio Communications about what it was like for businesses affected by the service disruption and what others can learn from their experience.

When did you first get a sense something big was happening to CenturyLink’s network?

I didn’t know it had even happened. My cell phone rang when the first customer called me. They didn’t have SD-WAN. He said, “Hey, Scott. Sorry to bother you, but is CenturyLink having a major problem, because we get our DNS from them and our DNS is down and now I’m starting to see our circuits drop.”

While I was on the phone with him, I saw an alert from CenturyLink pop up on my screen saying they had a major fiber cut somewhere. When a customer tells me their DNS is going down, that means something bigger than a breadbox is wrong. I knew this wasn’t a fiber cut. I did some digging and got the word out to every customer I knew that this was going on.

How did the outage affect your customers?

It was a disaster for the ones who didn’t have SD-WAN. That outage — and remember that was a global outage — killed their websites, it killed their toll-free services. Customers couldn’t even call them to report a repair ticket. Their email was down.

What were some of those initial calls like?

They were frantic calls for the people who didn’t have SD-WAN because the first thing they had to do was report it. When something goes wrong in an organization everyone points to IT and says “What did you do?” And then IT has to explain what’s going on.. When they could quickly point to CenturyLink and say it’s nothing we did, they were relieved but they were also like, okay, we’ve got to talk more about this because whatever we’re doing isn’t working.

What about the people who had SD-WAN already?

The people who had SD-WAN were totally relaxed. They weren’t affected in a negative way because, while they still had outages, their users didn’t know it. My contact at a large northwest based family drugstore said to me, “Scott, we can’t thank you enough. We see some CenturyLink services down but due to our Bigleaf SD-WAN, we don’t have a store down. Thank you for designing our network like this.” I got the same response from a large regional electrical parts retailer . I had those conversations all day long.

Second connections must have helped a lot of SD-WAN customers weather the storm.

Except for the ones who had put all of their network eggs into CenturyLink’s basket. I had heard through the grapevine that some customers had gone down as a whole regardless of their second connection because their SD-WAN was CenturyLink and CenturyLink’s platform went down. I purposely don’t pitch a behemoth as an SD-WAN for that very reason. I would prefer to have my customers know that they have two different connections from two different carriers and they have an independent party as their SD-WAN box and hopefully another independent party as their firewall.

It’s not just fiber cuts, though, right?

One of the reasons to have an SD-WAN is because of planned maintenance. Every carrier has them. They have to upgrade their software and their hardware. And customers have no control over when they occur. They are at the mercy of the carriers. If payroll’s going out, you can’t have a maintenance outage. But if you have a major carrier that says “Hey, we’re doing it anyway” well then you’re stuck from a business perspective. It’s a disaster.

I also remind companies that this latest CenturyLink outage happened three years ago, too. And Comcast had one last summer. And I went through who knows how many outages with Level3 before the merger with CenturyLink. There were probably six or seven a year that were nationwide.

Networks go down. If you’re a customer, it doesn’t matter if it’s because a squirrel chewed through the line or your carrier’s maintenance schedule conflicts with your business schedule. Down is down.

The business impact of an outage must be considerable.

CenturyLink had everybody down for 40 hours for the most part. I looked at one customer and I just said, you’ve got 17 sites that were down for 40 hours. How much business did you lose? Your SLA is going to get you $49 per store, but how much did you lose because you couldn’t sell, you couldn’t do any repair, you couldn’t do any marketing, you couldn’t do any business whatsoever. You had everybody sitting on their hands for 40 hours.

And the promise of an SLA does little in the heat of an outage.

Every single IT manager I work with does not want an SLA because they know that if their network goes down they get $5.70 because it’s prorated by the number of minutes you were down. And a $5.70 credit does not get your CEO off your back about an outage that shouldn’t have happened.

Any advice for IT managers on how to avoid the pain of these types of outages?

They have to find someone they trust, someone who has experienced these outages. Their carrier’s job is to sell them more services at the highest price possible. It’s not to help them fix whatever problem they are facing. Which means they need to find someone who can give them an independent analysis of what’s happening and present real business solutions based on the real world experience using those solutions.

Comments are closed.