This is a follow-up to the 1st post of this 2-post series on our Dynamic QoS Prioritization. This will be more of a technical deep-dive on QoS and how our implementation works.
Let’s dive into the details, through all 5 concepts discussed in the previous post.
Legacy network appliances (routers, firewalls, load-balancers) provide a self-contained device that attempts to provide useful control of traffic at one point in the network path. These devices provide high efficiency (there is no tunneling overhead) and sometimes low cost for basic versions, yet sacrifice in almost every other area. For more details on how they compare, check out this comparison against Bigleaf.
Then there are the newer Software Defined Networking (SDN) entrants in this space such as Bigleaf. Some have adopted the term “SD-WAN” to describe use of SDN across Wide Area Networks (WANs). Unfortunately, just like “Cloud” can mean many things from private VMs to public-facing SaaS services to Hosted VoIP, SDN and SD-WAN are marketing terms that vary widely in meaning. Some use them to describe simple features like cloud-based device administration, while others use them to mean fully separated control/data plane architectures, and everything in between.
So the question you need to ask is, what are the sacrifices or tradeoffs they are making? Buzzwords don’t matter, the experience for your users does. Unlike other offerings, we at Bigleaf sacrifice a little bit of speed and latency for vastly improved reliability, performance, and user experience.
We do this by tunneling all user traffic through our gateway clusters. This means there’s tunnel overhead (typically about 8%) and a geography-dependent latency increase (typically 5-20ms). Internet-based applications don’t even notice the tiny latency increase, and with broadband circuits so prevalent, the tunnel overhead is basically meaningless. However, what this tradeoff gains us is Seamless Failover of all applications, effective QoS across the public internet, and everything else you read about on this website, without caveats.
Typical load-balancers and firewalls decide if an internet circuit is up or down by pinging Google or some other IP address out the circuit. If the pings go away then the circuit is down.
First issue here: Up or down, on or off, that’s the granularity available. Real-time applications like VoIP and VDI require far more delicate treatment than this, as they are sensitive to even 1% packet loss.
Second issue: Varying internet paths. Thanks to internet routing protocols like BGP, once traffic leaves your office it can take many internet paths, it’s “The Web”! This is a neat tool for viewing how hugely internet paths can vary. Below is a screenshot showing an example of why this is an issue.
The big dot is your ISP, some of those other dots are the stuff you’re trying to interact with on the internet. Notice how there are a gazillion paths? Just because the path to Google is clean, does not mean that path to your business-critical applications is clean, or even up!
So SD-WAN fixes this right? Not in many cases. With most other offerings, the providers will tunnel some of your traffic back to their cloud servers, but not other traffic. This is a huge issue when quality comes in to play. As this visualization shows, the path tunneled back to their cloud datacenter(s) may be clean, while other paths are nasty or even offline.
Here at Bigleaf we recognized that we can’t sacrifice visibility of what the internet is doing to your application traffic. We absolutely have to know what’s going on at all times for all traffic. Because of this, we tunnel all traffic back through our gateway clusters, your traffic and our monitoring traffic. This ensures that we have fine-grained details on performance of the full internet path that your traffic is taking into the core of the internet. With Bigleaf, the path our monitoring traffic takes is the same as almost the entire path to your VoIP provider, to Google, to Salesforce, and everywhere else.
We monitor that path 10 times per second with custom monitoring packets that our on-site router and gateway clusters pass back and forth. This gives our SDN algorithms packet-loss, latency, jitter, and capacity data for each direction along the whole path, updated in real-time.
There is a small portion of the internet path that we don’t fully see and control – the path between our gateway clusters and the endpoints your traffic is flowing to. Typically that path is just a few hops away on the backbone of the internet (which tends to be the most reliable portion), and with many networks it’s only 1 hop away over connections that we control.
The state of QoS on most internet-facing routers and firewalls is sadly very broken. Users think they can check an “enable QoS” checkbox, put in a few rules, and have something that works. As mentioned in the previous post, inbound QoS is basically uncontrolled with on-prem-only solutions due to UDP traffic (and often TCP traffic too).
To get around this issue, we implement control at both ends of the internet path. For upload traffic we control everything at our on-premise router, nothing too special there. For download traffic though, we control all traffic in the core of the internet, at our gateway clusters. These gateway clusters are located in carrier hotels, essentially datacenters that are core internet peering points. We operate our own network rather than using cloud providers like Amazon where resources are shared. These decisions ensure that customers have the lowest latency to the endpoints they are trying to reach, and that we have complete autonomy to run the network in a way that provides maximum performance with no compromises.
In our gateway clusters and on-premise routers we classify user traffic into 6 different categories, rate-limit and queue traffic as needed to ensure proper QoS prioritization, and then send it out through our tunnels. Those categories are:
Because this is happening at both ends (your office and the core of the internet), we have full QoS control over almost the entire internet path. When we say that our QoS works you can believe it, and we’re glad to help you test it if you’d like.
The six QoS priorities above are useless without rules to classify traffic into them. There tends to be 3 widely used philosophies to QoS rules:
#1 obviously is no good. #2 is getting better, but there are lots of basics it leaves uncovered. Maybe business critical applications will work OK, but users may hate the rest of their internet and cloud experience. #3 could be effective, but do you want to maintain that, and do you want to pay for hardware powerful enough to run each traffic flow through thousands of rules?
We’ve come up with a better, more creative method. We have a base ruleset that covers almost all applications, not solely with specific rules but also with other methods that identify traffic beyond basic ports and protocols (but without the overhead of DPI). This ruleset provides an excellent experience for almost every customer and application situation.
However, we acknowledge that any fixed ruleset won’t meet every need, and it needs to change over time. That’s one huge benefit of Bigleaf’s SDN technology – it evolves. When we update the ruleset with new optimizations, those get implemented on your service automatically. You get the benefits, with no additional cost or work. And if you need something custom that our base ruleset doesn’t handle then we can also implement custom per-site rules.
This part is pretty crucial. Without real-time adaptation, nothing described above matters. If the network devices at each end of a path don’t have accurate speeds set, then they can’t buffer traffic and prioritize it – other hops along the path will do that, almost surely without regard to your desired QoS priorities.
Pretty much all routers/firewalls/load-balancers are rather dumb about speeds for QoS. They either assume that the speed or throughput capacity of a given network path is equivalent to the speed of the port that it’s connected to (e.g. a 100Mbps ethernet port), or that if a speed is set in the UI for the port (e.g. 40Mbps) that the speed will never change. Internet paths are often congested though. Cable circuits experience heavy congestion in the last-mile. DSL and Ethernet-Over-Copper circuits often experience middle-mile backhaul congestion, and all circuits are prone to varying bandwidth due to network failures and peering congestion.
So how should this be fixed? We spent a lot of time back when we started Bigleaf working on this problem, because it’s not easy to solve. A few SDN-type solutions run a bandwidth test at boot-up or device set-up to evaluate the circuit throughput. The problem with that is that throughput changes! Consider a typical 50M/10M Cable circuit. At varying times it may have capacity like this:
Theoretically you could just set the QoS rate-limiting settings to 39M/6M for this circuit and have success, but what if you set it wrong? And what about all the bandwidth you’re wasting during better times? That’s not good enough for us.
We created a patent-pending mechanism that automatically adjusts the QoS rate-limiting settings as circuit capacity changes. This ensures that for both download and upload, you get the most possible speed from each internet circuit, without sacrificing constant QoS that’s always prioritizing traffic, even during times of ISP congestion. Our devices at each end are the only devices buffering traffic along the path, so we control the QoS priority.
If an ISP circuit is so congested that there’s no “clean” bandwidth available, there’s just constant packet-loss, heavy latency, or bad jitter, then we’ll move your traffic off that circuit using our Intelligent Load Balancing. But for most situations Dynamic QoS is a game-changing feature that enables effective use of over-the-top services like VoIP and VDI across the public internet.