Kubernetes Application Resilience

Eliminate single points of failure, fast and cost-effectively

Did any of your Kubernetes applications get affected by recent AWS public cloud outages? Did you lose revenues? Did you lose customers? Many developers were angry and disappointed.  Many of the best application developers, including Netflix and Slack were affected.  (See publicized outages  https://downdetector.com/insights/ ).

 

Why do outages like these affect small and large organizations alike despite the best attempts to be fault tolerant?  There are many reasons, but the most common problem is that Kubernetes clusters, the cloud instances or VPCs they reside, and the regions where the VPCs are deployed are all single points of failure.  Yes, Kubernetes control and data planes are designed to be fault tolerant, but multiple master-worker nodes distributed across multiple availability zones are not protected against large, common failures. 

Kubernetes can survive one or multiple computer node failures, but can not survive whole cluster, or regional cloud or data center failures or misconfiguration. Kubenetes isn’t magic, and is still governed by the ancient unbreakable law, “You can’t rely on X to protect against X failing”.  In other words, you can’t rely on a Kubernetes cluster to protect against a Kubernetes cluster failure.  Likewise, you can’t rely on a cloud region to protect against a cloud region failure. But, don’t despair.  Almost as old as that unbreakable law, is another infallible solution, “To protect X, make a backup, and use the backup when X breaks”.  Also called 1:1 redundancy. 

CHALLENGES

How can one setup 1:1 redundancy with Kubernetes clusters?

Step-1: Create a copy of the ‘primary’ K8s cluster, called ‘backup’.  ‘Backup’ will be in a different region, maybe even in a different cloud provider than ‘primary’ to reduce the chances that they fail at the same time.

Step-2:  When someone tries to access ‘primary’, and can’t for any reason, send them to backup.

But, let’s dig a little deeper.  Who or what is the ‘somebody’ trying to access Kubernetes. Well, often, that somebody is a microservice. In other words, one container (pod, actually) trying to access another container within the Kubernetes cluster.  So, what must be done, is that when that, or any container fails to access a container in the same cluster, it should try to access that container in the other (e.g. backup) cluster. Sounds straightforward and pragmatic. The problem is, Kubernetes doesn’t support this inter-cluster microservice connectivity.  More fundamentally, your cloud VPCs in different regions or providers don’t even know how to reach each other.  And, even worse, it’s very likely that you are using the internet to connect the two clouds, so you also have to figure out how to secure and encrypt microservice communication data in flight.  Kubernetes does not do this for you. A list of the some of these challenges are:

  • Primary and Backup Clouds must be connected by a network, which requires IP Gateways, and Router configuration.

  • Those networks must be secured, which requires Firewall configuration for each microservice, and knowledge or source and destination IP addresses.

  • Microservice container communication must be encrypted, which requires key/cert creation, distribution, and encryption/decryption

  • Traffic Management and Global application load balancing must be done to ensure that requests are sent to working containers, which requires keep alive monitoring

  • Kubernetes API Gateway Ingresses must be programmed to router incoming requests.

SOLUTION 

To solve those challenges, you might consider numerous traditional methods and tools, and integrate them together, such as:

  • Multi-Cloud Network (MCN)

  • Application Service Mesh

  • API Gateway or Kubernetes Ingress

 

These new components are shown in red, in the diagram below:

However, doing it this way means that you have to integrate and support the solution long term. And it is likely not possible for your DevOps team to do this alone, as you don’t control site or cloud networking and security. Even if you did, this would take multiple disciplinary teams, an integration effort, and likely more time and cost than you have? 

Well, there is good news!


Introducing Nethopper!  Nethopper Multi-Cloud Application Network (MAN) allows you to easily distribute and securely connect your existing applications and services across multiple clouds. A MAN is a pre-integrated solution that solves all of the challenges mentioned above in a single tool. MAN is made for Application Ops teams, so you don’t need to involve or integrate with other teams and tools to make it work. Even better, Nethopper is offered as a Service, which is easy to use.  

You can see in the diagram below how simple the Nethopper service is compared to the old traditional approach mentioned before: 

Nethopper works with all data centers and cloud networks and Kubernetes.

 

How does this work:

  1. Register for Nethopper MAN as a Service (@ www.nethopper.io)

  2. Create your first MAN, and add both the old site and the new site

  3. Put all your containers in the backup cluster (do this yourself, or use Nethopper to do it). 

  4. Create a Multi-Cloud service for each ‘responding’ container

  5. Control and monitor your microservices communication from a single dashboard

Done!  Now, if any container or microservice is unreachable, Nethopper will balance the communication over to a working container in the other cluster.  A MAN solves the Kubernetes limitation of connecting microservice communication between clusters.  Nethopper also enforces security, by creating and distributing keys and doing encryption and decryption of all microservices.  And your app developers didn’t have to change a single line of code. Now, you are truly protected against cloud VPC, region, and provider failures.

 

BENEFITS

Simple, No-Code Kubernetes Cluster Resilience 

  • Active-Active 1:1 or N:1 protection

  • No extra software to develop, integrate or support

  • Ops team self-service

  • Centralized control and monitoring (single pane of glass)

 

Built-In Security

  • Secure your data and microservices communication

  • Authentication, and mTLS encryption.

 

Fast and Cost-Effective

  • Simplicity = Speed

  • Lower OpEx and Cloud costs

  • Pays for itself in a little as 2 weeks (ROI)

 

SUMMARY

Cloud instances, Kubernetes Cluster, Regions, and Cloud providers are all single points of failure that can, and do, take applications down.  You can eliminate these single points of failure by protecting your microservices.  A Multi-Cloud Application Network can help you do it as simple, fast, secure and lowest-cost possible.