What ChatGPT doesn’t say about Kubernetes in production

Generative AI is already proving helpful across many relatively basic use cases, but how does it hold up when tasked with more technical guidance?

ship wheel captain leadership
Thinkstock

Like many technology organizations, when ChatGPT was publicly released, we wanted to compare its answers to those of a regular web search. We experimented by asking technical questions and requesting specific content. Not all answers were efficient or correct, but our team appreciated the ability to provide feedback to improve responses.

We then got more specific and asked ChatGPT for advice using Kubernetes. ChatGPT provided a list of 12 best practices for Kubernetes in production, and most of them were correct and relevant. But when asked to expand that list to 50 best practices, it quickly became clear that the human element remains extremely valuable.

How we use Kubernetes

As background, JFrog has run its entire platform on Kubernetes for more than six years, utilizing managed Kubernetes services from major cloud providers including AWS, Azure, and Google Cloud. We operate in more than 30 regions globally, each with multiple Kubernetes clusters.

In our case, Kubernetes is primarily used to run workloads and runtime tasks rather than storage. The company employs managed databases and object storage services provided by cloud providers. The Kubernetes infrastructure consists of thousands of nodes, and the number dynamically scales up or down based on auto-scaling configurations.

JFrog's production environment includes hundreds of thousands of pods, the smallest unit of deployment in Kubernetes. The exact number fluctuates as pods are created or terminated; there are currently around 300,000 pods running globally in our production setup, which is a substantial workload to manage.

We frequently release new application versions, patches, and bug fixes. We’ve implemented a built-in system to roll out these updates, including proper canary testing before full deployment, allowing us to maintain a continuous release cycle and ensure service stability.

As most who have used the service know, ChatGPT clearly displays a disclaimer that the data it’s based on isn’t completely up-to-date. Knowing that and considering the above backdrop to illustrate our needs, here are 10 things ChatGPT won’t tell you about managing Kubernetes in production (until OpenAI updates its data and algorithms, that is).

Node sizing is an art

Node sizing involves finding a balance between using smaller nodes to reduce “blast radius” and using larger nodes for better application performance. The key is to use different node types based on workload requirements, such as CPU or memory optimization. Adjusting container resources to match the CPU-to-memory ratio of the nodes optimizes resource utilization.

That said, finding the right number of pods per node is a balancing act, considering the varying resource consumption patterns of each application or service. Spreading the load across nodes using techniques like pod topology spread constraints or pod anti-affinity to optimize resource usage helps accommodate shifting workload intensities. Load balancing and load spreading are vital for larger enterprises using Kubernetes-based cloud services.

How to protect the control plane

Monitoring the Kubernetes control plane is crucial, particularly in managed Kubernetes services. While cloud providers offer solid control and balance, you need to be aware of their limits. Monitoring and alerting should be in place to ensure the control plane performs optimally—a slow control plane can significantly impact cluster behavior, including scheduling, upgrades, and scaling operations. Even in managed services, there are limits that need to be considered.

Overuse of the managed control plane can lead to a catastrophic crash. Many of us have been there, and it serves as a reminder that control planes can become overwhelmed if they’re not properly monitored and managed.

How to maintain application uptime

Prioritizing critical services optimizes application uptime. Pod priorities and quality of service classes identify high-priority applications that need to run at all times; understanding priority levels informs the optimization of stability and performance.

Meanwhile, pod anti-affinity prevents multiple replicas of the same service from being deployed on the same node. This avoids a single point of failure, meaning if one node experiences issues, other replicas won’t be affected.

You should also embrace the practice of creating dedicated node pools for mission-critical applications. For example, a separate node pool for ingress pods and other important services like Prometheus can significantly improve service stability and the end-user experience.

You need to plan to scale

Is your organization prepared to handle double the deployments to provide the necessary capacity growth without any negative impact? Cluster auto-scaling in managed services can help on this front, but it’s important to understand cluster size limits. For us, a typical cluster is around 100 nodes; if that limit is reached, we spin up another cluster instead of forcing the existing one to grow.

Application scaling, both vertical and horizontal, should also be considered. The key is to find the right balance to better utilize resources without overconsumption. Horizontal scaling and replicating or duplicating workloads is generally preferable, with the caveat that it could impact database connections and storage.

You also need to plan to fail

Planning for failures has become a way of life across various aspects of application infrastructure. To make sure you’re prepared, develop playbooks to address different failure scenarios such as application failures, node failures, and cluster failures. Implementing strategies like high-availability application pods and pod anti-affinity helps ensure coverage in case of failures.

Every organization needs a detailed disaster recovery plan for cluster failures, and they should also practice that plan periodically. When recovering from failures, controlled and gradual deployment helps to avoid overwhelming resources.

How to secure your delivery pipeline

The software supply chain is continuously vulnerable to errors and malicious actors. You need control over each step of the pipeline. By the same token, you must resist relying on external tools and providers without carefully considering their trustworthiness.

Maintaining control over external sources involves measures such as scanning binaries that originate from remote repositories and validating them with a software composition analysis (SCA) solution. Teams should also apply quality and security gates throughout the pipeline to ensure higher trust, both from users and within the pipeline itself, to guarantee higher quality in the delivered software.

How to secure your runtime

Using admission controllers to enforce rules, such as blocking the deployment of blacklisted versions, helps secure your Kubernetes runtime. Tools such as OPA Gatekeeper help enforce policies like allowing only controlled container registries for deployments.

Role-based access control is also recommended for securing access to Kubernetes clusters, and other runtime protection solutions can identify and address risks in real time. Namespace isolation and network policies help block lateral movement and protect workloads within namespaces. You may also consider running critical applications on isolated nodes to mitigate the risk of container escape scenarios.

How to secure your environment

Securing your environment means assuming that the network is always under attack. Auditing tools are recommended to detect suspicious activities in the clusters and infrastructure, as are runtime protections with full visibility and workload controls.

Best-of-breed tools are great, but a strong incident response team with a clear playbook in place is required in case of alerts or suspicious activities. Similar to disaster recovery, regular drills and practices should be conducted. Many organizations also offer bug bounties, or employ external researchers who attempt to compromise the system to uncover vulnerabilities. The external perspective and objective research can provide valuable insights.

Continuous learning is a must

As systems and processes evolve, teams should embrace continuous learning by collecting historical performance data to evaluate and apply action items. Look for small, continuous improvements; what was relevant in the past may not be relevant anymore.

Proactively monitoring performance data can help identify a memory or CPU leak in one of your services or a performance bug in a third-party tool. By actively evaluating data for trends and abnormalities, you can improve the understanding and performance of your system. This proactive monitoring and evaluation lead to more effective results versus reacting to real-time alerts.

Humans are the weakest link

Automation where possible minimizes human involvement, and sometimes that’s a good thing—humans are the weakest link when it comes to security. Explore a range of available automation solutions and find the best match for your individual processes and definitions.

GitOps is a popular approach to introduce changes from development to production, providing a well-known contract and interface for managing configuration changes. A similar approach uses multiple repositories for different types of configurations, but it’s vital to maintain a clear separation between development, staging, and production environments, even though they should be similar to each other.

Looking to the future

AI-powered solutions hold promise for the future because they help to alleviate operational complexity and they automate tasks related to managing environments, deployments, and troubleshooting. Even so, human judgment is irreplaceable and should always be taken into account.

Today’s AI engines rely on public knowledge, which may contain inaccurate, outdated, or irrelevant information, ultimately leading to incorrect answers or recommendations. Using common sense and remaining mindful of the limitations of AI is paramount.

Stephen Chin is VP of developer relations at JFrog, chair of the CDF governing board, member of the CNCF governing board, and author of The Definitive Guide to Modern Client Development, Raspberry Pi with Java, Pro JavaFX Platform, and the upcoming DevOps Tools for Java Developers title from O’Reilly. He has keynoted numerous conferences around the world including swampUP, Devoxx, JNation, JavaOne, Joker, and Open Source India. Stephen is an avid motorcyclist who has done evangelism tours in Europe, Japan, and Brazil, interviewing hackers in their natural habitat. When he is not traveling, he enjoys teaching kids how to do embedded and robot programming together with his teenage daughter.

Generative AI Insights provides a venue for technology leaders—including vendors and other third parties—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Copyright © 2023 IDG Communications, Inc.