Cassandra troubleshooting guide

You're viewing Apigee and Apigee hybrid documentation.
There is no equivalent Apigee Edge documentation for this topic.

This topic discusses steps you can take to troubleshoot and fix problems with the Cassandra datastore. Cassandra is a persistent datastore that runs in the cassandra component of the hybrid runtime architecture. See also Runtime service configuration overview.

Cassandra pods are stuck in the Pending state

Symptom

When starting up, the Cassandra pods remain in the Pending state.

Error message

When you use kubectl to view the pod states, you see that one or more Cassandra pods are stuck in the Pending state. The Pending state indicates that Kubernetes is unable to schedule the pod on a node: the pod cannot be created. For example:

kubectl get pods -n NAMESPACE

NAME                                     READY   STATUS      RESTARTS   AGE
adah-resources-install-4762w             0/4     Completed   0          10m
apigee-cassandra-default-0               0/1     Pending     0          10m
...

Possible causes

A pod stuck in the Pending state can have multiple causes. For example:

Cause	Description
Insufficient resources	There is not enough CPU or memory available to create the pod.
Volume not created	The pod is waiting for the persistent volume to be created.
Missing Amazon EBS CSI driver	For EKS installations, the required Amazon EBS CSI driver is not installed.

Diagnosis

Use kubectl to describe the pod to determine the source of the error. For example:

kubectl -n NAMESPACE describe pods POD_NAME

For example:

kubectl describe pods apigee-cassandra-default-0 -n apigee

The output may show one of these possible problems:

If the problem is insufficient resources, you will see a Warning message that indicates insufficient CPU or memory.
If the error message indicates that the pod has unbound immediate PersistentVolumeClaims (PVC), it means the pod is not able to create its Persistent volume.

Resolution

Insufficient resources

Modify the Cassandra node pool so that it has sufficient CPU and memory resources. See Resizing a node pool for details.

Persistent volume not created

If you determine a persistent volume issue, describe the PersistentVolumeClaim (PVC) to determine why it is not being created:

List the PVCs in the cluster:

kubectl -n NAMESPACE get pvc

NAME                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-apigee-cassandra-default-0   Bound    pvc-b247faae-0a2b-11ea-867b-42010a80006e   10Gi       RWO            standard       15m
...

Describe the PVC for the pod that is failing. For example, the following command describes the PVC bound to the pod apigee-cassandra-default-0:

kubectl apigee describe pvc cassandra-data-apigee-cassandra-default-0

Events:
  Type     Reason              Age                From                         Message
  ----     ------              ----               ----                         -------
  Warning  ProvisioningFailed  3m (x143 over 5h)  persistentvolume-controller  storageclass.storage.k8s.io "apigee-sc" not found

Note that in this example, the StorageClass named apigee-sc does not exist. To resolve this problem, create the missing StorageClass in the cluster, as explained in Change the default StorageClass.

Missing Amazon EBS CSI driver

If the hybrid instance is running on an EKS cluster, make sure the EKS cluster is using the Amazon EBS container storage interface (CSI) driver. See Amazon EBS CSI migration frequently asked questions for details.

Cassandra pods are stuck in the CrashLoopBackoff state

Symptom

When starting up, the Cassandra pods remain in the CrashLoopBackoff state.

Error message

When you use kubectl to view the pod states, you see that one or more Cassandra pods are in the CrashLoopBackoff state. This state indicates that Kubernetes is unable to create the pod. For example:

kubectl get pods -n NAMESPACE

NAME                                     READY   STATUS            RESTARTS   AGE
adah-resources-install-4762w             0/4     Completed         0          10m
apigee-cassandra-default-0               0/1     CrashLoopBackoff  0          10m
...

Possible causes

A pod stuck in the CrashLoopBackoff state can have multiple causes. For example:

Cause	Description
Data center differs from previous data center	This error indicates that the Cassandra pod has a persistent volume that has data from a previous cluster, and the new pods are not able to join the old cluster. This usually happens when stale persistent volumes persist from the previous Cassandra cluster on the same Kubernetes node. This problem can occur if you delete and recreate Cassandra in the cluster.
Kubernetes upgrade	A Kubernetes upgrade may affect the Cassandra cluster. This can happen when the Anthos worker nodes hosting the Cassandra pods are upgraded to a new OS version.

Diagnosis

Check the Cassandra error log to determine the cause of the problem.

List the pods to get the ID of the Cassandra pod that is failing:
```
kubectl get pods -n NAMESPACE
```
Check the failing pod's log:
```
kubectl logs POD_ID -n NAMESPACE
```

Resolution

Look for the following clues in the pod's log:

Data center differs from previous data center

If you see this log message:

Cannot start node if snitch's data center (us-east1) differs from previous data center

Check if there are any stale or old PVC in the cluster and delete them.

If this is a fresh install, delete all the PVCs and re-try the setup. For example:

kubectl -n NAMESPACE get pvc
kubectl -n NAMESPACE delete pvc cassandra-data-apigee-cassandra-default-0

Anthos upgrade changes security settings

Check the Cassandra logs for this error message:

/opt/apigee/run.sh: line 68: ulimit: max locked memory: 
  cannot modify limit: Operation not permitted

If the Hybrid instance is multi-region, decommission the impacted Hybrid instance and re-expand into the impacted region.
If the Hybrid instance is a single region, perform a rolling restart on each Cassandra pod in the Hybrid instance.

Create a client container for debugging

This section explains how to create a client container from which you can access Cassandra debugging utilities such as cqlsh. These utilities allow you to query Cassandra tables and can be useful for debugging purposes.

Create the client container

To create the client container, follow these steps:

The container must use the TLS certificate from the apigee-cassandra-user-setup pod. This is stored as a Kubernetes secret. Fetch the name of the secret that stores this certificate:
```
kubectl get secrets -n apigee --field-selector type=kubernetes.io/tls | grep apigee-cassandra-user-setup | awk '{print $1}'
```
This command returns the name of the secret. For example: apigee-cassandra-user-setup-rg-hybrid-b7d3b9c-tls. You will use this below in the secretName field in the YAML file.

Open a new file and paste the following pod spec into it:

apiVersion: v1
kind: Pod
metadata:
  labels:
  name: CASSANDRA_CLIENT_NAME   # For example: my-cassandra-client
  namespace: apigee
spec:
  containers:
  - name: CASSANDRA_CLIENT_NAME
    image: "gcr.io/apigee-release/hybrid/apigee-hybrid-cassandra-client:YOUR_APIGEE_HYBRID_VERSION" # For example, 1.10.4.
    imagePullPolicy: Always
    command:
    - sleep
    - "3600"
    env:
    - name: CASSANDRA_SEEDS
      value: apigee-cassandra-default.apigee.svc.cluster.local
    - name: APIGEE_DML_USER
      valueFrom:
        secretKeyRef:
          key: dml.user
          name: apigee-datastore-default-creds
    - name: APIGEE_DML_PASSWORD
      valueFrom:
        secretKeyRef:
          key: dml.password
          name: apigee-datastore-default-creds
    volumeMounts:
    - mountPath: /opt/apigee/ssl
      name: tls-volume
      readOnly: true
  volumes:
  - name: tls-volume
    secret:
      defaultMode: 420
      secretName: YOUR_SECRET_NAME    # For example: apigee-cassandra-user-setup-rg-hybrid-b7d3b9c-tls
  restartPolicy: Never

Save the file with a .yaml extension. For example: my-spec.yaml.

Apply the spec to your cluster:

kubectl apply -f YOUR_SPEC_FILE.yaml -n apigee

kubectl exec -n apigee CASSANDRA_CLIENT_NAME -it -- bash

Connect to the Cassandra cqlsh interface with the following command. Enter the command exactly as shown:
```
cqlsh ${CASSANDRA_SEEDS} -u ${APIGEE_DML_USER} -p ${APIGEE_DML_PASSWORD} --ssl
```

Deleting the client pod

Use this command to delete the Cassandra client pod:

kubectl delete pods -n apigee cassandra-client

Misconfigured region expansion: all Cassandra nodes under one datacenter

This situation occurs in a multi-region expansion on GKE and GKE on-prem (Anthos) platforms. Try to avoid trying to create all your Cassandra nodes in the same datacenter.

Symptom

Cassandra nodes fail to create in the datacenter for the second region.

Error Message

failed to rebuild from dc-1: java.lang.RuntimeException : Error while rebuilding node: Stream failed

Resolution

Repair the misconfigured region expansion with the following steps:

Update the Cassandra replicaCount to 1 in the overrides.yaml file for the second datacenter. For example:
```
cassandra:
  . . .
  replicaCount: 1
```
Apply the setting with apigeectl apply:
```
$APIGEECTL_HOME/apigeectl apply -f 2ND_DATACENTER_OVERRIDES.yaml
```
Use kubectl exec to access the remaining Cassandra pod with the following command:
```
kubectl exec -it -n apigee apigee-cassandra-default-0 -- /bin/bash
```

Decommission the remaining Cassandra pod with the following command:

nodetool -u CASSANDRA_DB_USER -pw CASSANDRA_DB_PASSWORD decommission

Delete the Cassandra pods from the second datacenter using apigeectl delete with the --datastore argument. For example:
```
$APIGEECTL_HOME/apigeectl delete -f 2ND_DATACENTER_OVERRIDES.yaml --datastore
```
Change your Kubernetes context to the cluster for your first datacenter:
```
kubectl config use-context FIRST_DATACENTER_CLUSTER
```
Verify there are no Cassandra nodes in a down state in the first datacenter.
```
nodetool -u CASSANDRA_DB_USER -pw CASSANDRA_DB_PASSWORD status
```

Verify the misconfigured Cassandra nodes (intended for the second datacenter) have been removed from the first datacenter. Make sure the IP addresses that are displayed in the nodetool status output are only the IP addresses for the Cassandra pods intended for your first datacenter. For example, in the following output the IP address 10.100.0.39 should be for a pod in your first datacenter.

kubectl exec -it -n apigee apigee-cassandra-default-0 -- /bin/bash
nodetool -u CASSANDRA_DB_USER -pw CASSANDRA_DB_PASSWORD status

  Datacenter: dc-1
  ================
  Status=U/D (Up/Down) | State=N/L/J/M (Normal/Leaving/Joining/Moving)
  --  Address      Load      Tokens  Owns (effective)  Host ID                               Rack
  UN  10.100.0.39  4.21 MiB  256     100.0%            a0b1c2d3-e4f5-6a7b-8c9d-0e1f2a3b4c5d  ra-1

Verify overrides.yaml file for the second datacenter contains the datacenter name setting under the cassandra section. For example:
```
cassandra:
  datacenter: DATA_CENTER_2
  rack: "RACK_NAME" # "ra-1" is the default value.
  . . .
```
Update the cassandra:replicaCount setting in the overrides.yaml file for the second datacenter to the desired number. For example:
```
cassandra:
  datacenter: DATA_CENTER_2
  . . .
  replicaCount: 3
```
Note: The value of cassandra:replicaCount must be a multiple of 3. Use the same value for replicaCount that your specified for your first datacenter.
Apply the overrides.yaml file for the second datacenter with the --datastore argument. For example:
```
$APIGEECTL_HOME/apigeectl apply -f 2ND_DATACENTER_OVERRIDES.yaml --datastore
```
Use kubectl exec to access one of the new Cassandra pods in the second datacenter and verify there are two datacenters:
```
 "nodetool -u CASSANDRA_DB_USER -pw CASSANDRA_DB_PASSWORD status"
```

Additional resources

See Introduction to Apigee X and Apigee hybrid playbooks.