Skip to content

Troubleshooting Guide for Kubernetes

This guide provides systematic instructions for diagnosing and resolving common alerts and issues encountered within Kubernetes environments. It covers problems related to Pod statuses and resource utilization, offering step-by-step diagnostic procedures and potential solutions.

1. Pod Status Issues

1.1 Pod in Pending State

When a pod remains in the Pending state, it means Kubernetes is unable to schedule it onto a node. This can happen due to resource constraints, storage availability issues, or node selector/affinity requirements that cannot be satisfied. The Pending state indicates that the Kubernetes scheduler is aware of the pod but hasn't been able to place it on a suitable node in the cluster.

Diagnosis:

  1. Check pod events:
    Bash
    kubectl describe pod <pod-name> -n <namespace>
    

Common Causes and Solutions:

  • Insufficient Resources: If the pod cannot be scheduled due to resource constraints.
    • Check cluster capacity: kubectl describe nodes.
    • Consider scaling up the cluster or reducing pod resource requests.
  • Storage Issues: If waiting for persistent volume.
    • Verify PVC status: kubectl get pvc -n <namespace>.
    • Check storage class availability.
  • Node Selector/Affinity: If pod cannot be scheduled on available nodes.
    • Review pod's node selector and affinity rules.
    • Adjust node labels if needed.

1.2 Pod in Error State

When a pod is in the Error state, it indicates that the pod has failed to start correctly or has encountered a critical issue during runtime. This could be due to application errors, missing dependencies, configuration problems, or insufficient permissions. Unlike Pending state, an Error state means the pod was scheduled to a node but couldn't start or run properly.

Diagnosis:

  1. Check pod logs:
    Bash
    kubectl logs <pod-name> -n <namespace>
    

Common Solutions:

  • Review application logs for specific errors.
  • Check container configuration.
  • Verify environment variables and secrets.
  • Ensure all required dependencies are available.

1.3 Pod in CrashLoopBackoff State

A pod in CrashLoopBackoff state indicates that it is repeatedly crashing after starting, and Kubernetes is enforcing a progressively longer delay between restart attempts. This usually points to a recurring application failure that prevents the container from running stably. Common causes include application errors during startup, misconfiguration, memory problems, or dependency issues that cause the application to exit shortly after initializing.

Diagnosis:

  1. Check recent pod logs:
    Bash
    kubectl logs <pod-name> -n <namespace> --previous
    

Common Solutions:

  • Review application startup logs.
  • Check resource limits and requests.
  • Verify configuration files and environment variables.
  • Ensure all required services are accessible.

2. Resource Utilization Issues

2.1 High Cluster CPU Utilization

High cluster CPU utilization occurs when the overall CPU usage across all nodes approaches maximum capacity. This can lead to performance degradation, scheduling delays, and potential service disruptions. When CPU usage is consistently high, it may indicate that the cluster needs additional resources or that workloads need to be optimized.

Diagnosis:

  1. Check current CPU usage:
    Bash
    kubectl top nodes
    

Recommended Actions:

  • Scale up the cluster by adding more nodes.
  • Review and optimize resource requests or limits.
  • Identify and address CPU-intensive workloads.
  • Consider horizontal pod autoscaling.

2.2 High Cluster Memory Utilization

High cluster memory utilization happens when the collective memory usage across all nodes nears maximum capacity. This can result in pods being evicted, OOM (Out of Memory) kills, and degraded cluster performance. Sustained high memory usage indicates either a need for additional cluster resources or the presence of memory-intensive applications that may require optimization.

Diagnosis:

  1. Check memory usage:
    Bash
    kubectl top nodes
    

Recommended Actions:

  • Scale up cluster nodes.
  • Review memory requests/limits.
  • Check for memory leaks in applications.
  • Consider implementing memory limits.

2.3 High Pod CPU Utilization

High pod CPU utilization occurs when a specific pod consumes excessive CPU resources, potentially affecting its performance and the performance of other workloads on the same node. This may be due to application inefficiencies, unexpected workload spikes, or inadequate resource allocation for the pod's requirements.

Diagnosis:

  1. Check pod CPU usage:
    Bash
    kubectl top pods -n <namespace>
    

Recommended Actions:

  • Scale up the pod's CPU requests/limits:
    • Using Privacera Manager: Set the variable in custom-vars with format <SERVICE_NAME>_K8S_CPU_REQUEST and <SERVICE_NAME>_K8S_CPU_LIMIT. Example: PORTAL_K8S_CPU_LIMIT or SOLR_K8S_CPU_LIMITS.
  • Implement horizontal pod autoscaling.
  • Optimize application performance.
  • Consider distributing load across more pods.

2.4 High Pod Memory Utilization

High pod memory utilization happens when a specific pod's memory consumption approaches or exceeds its allocated limits. This can lead to the pod being terminated by the OOM killer, causing application disruptions and restart cycles. It typically indicates either memory leaks, inadequate memory allocation, or unexpected application behavior.

Diagnosis:

  1. Check pod memory usage:
    Bash
    kubectl top pods -n <namespace>
    

Recommended Actions:

  • Increase pod memory limits.
    • Using Privacera Manager: Set the variable in custom-vars with format <SERVICE_NAME>_K8S_MEM_REQUEST and <SERVICE_NAME>_K8S_MEM_LIMIT. Example: PORTAL_K8S_MEM_LIMIT or SOLR_K8S_MEM_LIMITS.
  • Check for memory leaks.
  • Optimize application memory usage.
  • Consider implementing memory limits.

Finding Service-Specific Variable Names

To locate the correct variable names for a specific service's resource configuration:

  1. Navigate to the service's Kubernetes template directory:

    Bash
    ~/privacera/privacera-manager/ansible/privacera-docker/roles/templates/<SERVICE_NAME>/kubernetes
    
    For example, for the Portal service:
    Bash
    ~/privacera/privacera-manager/ansible/privacera-docker/roles/templates/portal/kubernetes
    

  2. In this directory, you'll find deployment or statefulset template files that contain the service's resource configuration variables.

  3. Look for variables following these naming patterns:

    • CPU configuration: <SERVICE_NAME>_K8S_CPU_REQUEST and <SERVICE_NAME>_K8S_CPU_LIMIT
    • Memory configuration: <SERVICE_NAME>_K8S_MEM_REQUEST and <SERVICE_NAME>_K8S_MEM_LIMIT

2.5 High Node Storage Utilization

High node storage utilization occurs when the local storage (typically used for the container runtime, logs, and ephemeral storage) on a node is nearing capacity. This can prevent new containers from being created, cause image pull failures, and lead to node instability. Common causes include accumulated container images, large application logs, or extensive ephemeral storage usage by pods.

Diagnosis:

  1. Check node storage:
    Bash
    kubectl describe nodes | grep -A 5 "Allocated resources"
    

Recommended Actions:

  • Clean up unused images: kubectl get pods --all-namespaces -o jsonpath="{.items[*].spec.containers[*].image}" | tr -s '[[:space:]]' '\n' | sort | uniq -c.
  • Implement image garbage collection.
  • Consider increasing node storage.
  • Review and clean up old logs.

2.6 High PVC Utilization

High Persistent Volume Claim (PVC) utilization indicates that the storage allocated to a persistent volume is approaching capacity. This can lead to application failures when the storage becomes full, preventing writes to the volume. Applications using databases, message queues, or those that generate significant data are particularly vulnerable to PVC space constraints.

Diagnosis:

  1. Check PVC usage:
    Bash
    kubectl get pvc -n <namespace>
    

Recommended Actions:

  • Clean up unnecessary data.
  • Increase PVC size if possible.
  • Implement data retention policies.
  • Consider implementing storage quotas.

General Troubleshooting Tips

  1. Always start by checking pod events and logs.
  2. Verify resource requests and limits.
  3. Ensure all required services and dependencies are running.
  4. Check network policies and service connectivity.
  5. Review application configuration.
  6. For all the cluster releted issues, consult your Kubernetes Cluster Administrator.

Comments