April 4, 2024
What is K8s Cost Observability and how to achieve it
5 min read
The usage of Kubernetes (K8s) has become the predominant choice for computing in modern technology stacks. With compute costs accounting for a significant portion of cloud costs, ranging from 30% to 50%, businesses are veering towards cost optimization and more efficient adoption of Kubernetes within their organization.
There are advantages that act as a catalyst in this adoption of Kubernetes. K8s plays a crucial role in enabling scalable, fault-tolerant, and standardized deployment and management workflows for engineering teams. However, successfully adopting it introduces complexities and management overheads, which can result in configuration challenges, unexpected costs, and security vulnerabilities if not properly addressed.
Kubernetes cost observability serves as a bridge in mitigating the configuration and cost challenges by facilitating effective management of expenditure and configuration. It provides a 360-degree view facilitating seamless observability via cost optimization and performance management of existing workloads.
Components of a cluster: Setting the stage
A Kubernetes cluster has various components managed partly by cloud providers and partly by platform teams. Essential cluster assets include:
Nodes (or virtual machines): Physical or virtual machines in a Kubernetes cluster where containers are deployed and managed.
Storage (block storage or network file systems): Persistent storage solutions used by Kubernetes pods for data persistence, including block storage like AWS EBS, or network file systems like NFS.
Network gateways/load balancers: Components that manage traffic routing and load balancing within a Kubernetes cluster, directing requests to appropriate services and nodes.
Network ingress/egress services: Kubernetes resources that control inbound and outbound network traffic to and from the cluster, managing access to services from external networks and governing traffic leaving the cluster.
Additionally, Kubernetes entails overheads such as cluster management costs, especially notable when utilizing managed Kubernetes services like Amazon EKS, Azure Kubernetes Service (AKS) or Google Kubernetes Engine (GKE). Other ancillary services include:
Container registries. Some of the prominent ones are Amazon ECR, Azure Container Registry and Google Container Registry
Logging and metrics stores such as AWS CloudWatch, Azure's Monitor
Security and uptime monitoring services are also integral to Kubernetes operations.
Cluster shared services
Within the cluster are services, workloads and operators which are used by all the workloads hosted by the clusters. These components include Kubernetes management plane such as kube-system, data collectors such as Prometheus a variety of operators and workloads, like istio-system and argo-cd, to ensure efficient management and operation of services and workloads.
Key ingredients to achieving cost observability on K8s
To achieve cost observability, a melange of configurations, services and metrics need to come together and to provide granular visibility into K8s workloads. Each of these has their own role to play in the broader cost optimization exercise and enabling this requires some effort from teams, but pays off significantly over the long run.
Cluster Configuration Data: Understanding the configuration details of the cluster, such as the types and categories of virtual machines (VMs), their architecture, and the nature of their pricing i.e. whether they are reserved, spot, on-demand instances or similar.
Cost Reports for Cluster Assets and Ancillary Services: Cloud Cost and Usage Reports (CUR) in AWS, Azure Cost Management + Billing or Cloud Billings Report in GCP, provide insights into the billing and usage related metrics across cloud platforms.
Usage Metrics of the Cluster: Metrics provided by services like kube-state-metrics offer data points related to the control and data plane of the Kubernetes cluster.
Usage Metrics of Ancillary Services: Metrics related to network ingress/egress, management services, and associated costs.
Distribution Metrics for Shared Costs: Allocation of shared costs to different workloads using heuristics, which may range from simple proportional allocation to more sophisticated telemetry-based methods.
Standard Metadata of Workloads: Tags, labels, ownership information, and other metadata associated with workloads enable granular cost attribution and allocation.
The 'secret' recipe to achieving cloud cost observability
While at the surface, cloud cost observability might seem easy, the nuances of moving parts associated with them add layers of complexity. Effectively executing a cost optimization activity is often a herculean task for teams and successfully navigating this requires aggregating and disaggregating cost data using these ingredients, which allows for the true cost of each workload, namespace, etc., to be understood.
This process involves ingesting, standardizing, and transforming data streams, tagging them with appropriate business metadata, and democratizing the shared costs as well as the managing variable cost allocation heuristics intelligently.
Note: A good primer to start thinking about these heuristics is to build on the incredible template created by OpenCost
For businesses getting started, platforms like Amnic aid in automating this process and injecting intelligence, resulting in a granular cluster cost stream that forms the basis of cost observability.
Measuring cluster cost performance using cluster cost stream
A quick glance at the provisioned vs. requested vs. used resources can provide some insight into the overall efficiency of the cluster. This is where platforms like Amnic can play a huge role. Utilizing tools like Amnic can offer quantitative measures of efficiency, enabling tracking over time across the organization and facilitating an iterative approach to effective cluster management. Engineering teams can leverage this data to reconfigure, retool, or re-architect their services to achieve better cost scalability.
The analysis of cluster cost streams also provides meaningful data for engineering teams who can fine-tune the configuration of their services and the cluster. By extracting insights from this process, cloud operations teams can fully exploit the elasticity of Kubernetes and the cloud ecosystem. Advanced tools like Horizontal Pod Autoscaler (hpa) and Karpenter can swiftly translate these insights into tangible benefits.
This data can then be used by cloud operations teams for the continuous optimization of their clusters, whether through manual intervention, semi-automated processes, or 'auto-magically'.
Analyzing Resource Utilization: It's essential to understand how your organization uses its cloud resources. This involves analyzing data usage patterns and identifying areas where resources are underutilized or wasted.
Right-Sizing Resources: Right-sizing involves matching the capacity of your resources to the workload demand. This prevents over-provisioning (and therefore overspending) or under-provisioning which could lead to performance issues.
No better time to start your cluster cost observability than now
Cluster cost observability is not a trivial problem, however if addressed effectively, it can lead to substantial benefits for the business and promote accountability within engineering organizations regarding the cost of their products.
This often leads to a predicament; build vs buy. The first step here is to gain a comprehensive view of the suite of solutions available in the market and their features. This may not be as black and white as fully managed or fully in-house solutions. Rather than ceding control to third-party cost optimization platforms that operate ‘auto-magically’, engineering teams have the option of agentless platforms such as Amnic. These platforms enable the generation of cost data streams while maintaining complete control over how optimization measures are implemented. Given the rapid advancements in the open-source ecosystem concerning bin-packing, auto-scalers, and provisioners like Karpenter, organizations can strike the appropriate balance tailored to their specific organizational requirements.
The overall journey to successfully understand K8s costs journey requires multiple checkpoints across infrastructure, cost optimization and workflow management. However, given the array of solutions in the market, getting started should not take teams more than a few minutes.
If you’d like to try a solution, click here to see what Amnic has to offer.