Error Accessing Datastore

Symptom

Deployment of API proxy revisions via the Edge UI or Edge management API call fail with the error "Error while accessing datastore".

Error Messages

Error in deployment for environment qa.

The revision is deployed, but traffic cannot flow. Error while accessing datastore;Please retry later

Possible Causes

The typical causes for this issue are:

  1. Cause Details For
    Network Connectivity Issue between Message Processor and Cassandra Communication failure between the Message Processor and Cassandra due to network connectivity issues or firewall rules. Edge Private Cloud users
    Deployment errors due to Cassandra restarts Cassandra node(s) was unavailable because it was restarted as part of routine maintenance. Edge Private Cloud users
    Spike in read request latency on Cassandra If the Cassandra node(s) is performing a large number of concurrent reads, then it may respond slowly due to spike in read request latency. Edge Private Cloud users
    API Proxy Bundle larger than 15MB Cassandra has been configured to not allow API proxy bundles larger than 15MB in size. Edge Private Cloud users

    Network Connectivity Issue between Message Processor and Cassandra

    Diagnosis

    Note: Only Edge Private Cloud users can perform the following steps. If you are on Edge Public Cloud, contact Apigee Support.

    1. Undeploy and redeploy the API proxy. If there was a temporary connectivity issue between the Message Processor and Cassandra, then the error might go away.

      WARNING: Don't undeploy if the errors are seen in the Production environment.

    2. If the problem persists, then execute the below management AP call to check the deployment status and check if there are any errors on any components:
      curl -u sysadmin@email.com https://management:8080/v1/o/<org>/apis/<api>/deployments
      

      Sample deployment status output showing Error while accessing datastore on one of the Message Processors

      {
      "environment" : [ {
      "aPIProxy" : [ {
      "name" : "simple-python",
      "revision" : [ {
      "configuration" : {
      "basePath" : "/",
      "steps" : [ ]
      },
      "name" : "1",
      "server" : [ {
      "status" : "deployed",
      "type" : [ "message-processor" ],
      "uUID" : "2acdd9b2-17de-4fbb-8827-8a2d4f3d7ada"
      }, {
      "error" : "Error while accessing datastore;Please retry later",
      "errorCode" : "datastore.ErrorWhileAccessingDataStore",
      "status" : "error",
      "type" : [ "message-processor" ],
      "uUID" : "42772085-ca67-49bf-a9f1-c04f2dc1fce3"
      } 
      "state" : "error"
      } 
      
    3. Restart the Message Processor(s) that show the deployment error. If there was a temporary network issue, then the error should go away:
      /opt/apigee/apigee-service/bin/apigee-service edge-message-processor restart
      
    4. Repeat step #2 to see if the deployment succeeds on the Message Processor that was restarted. If no errors found, then that indicates the issue is resolved.
    5. Check if the message processor is able to connect to each Cassandra node on port 9042 and 9160:
      1. If telnet is available, then use telnet:
        telnet <Cassandra_IP> 9042
        telnet <Cassandra_IP> 9160
        
      2. If telnet is not available, use netcat to check the connectivity as follows:
        nc -vz <Cassandra_IP> 9042
        nc -vz <Cassandra_IP> 9160
        
      3. If you get the response "Connection Refused" or "Connection timed out", then engage your network operations team.
    6. If the problem persists, then check if each of the Cassandra nodes are listening on the port 9042 and port 9160:
      netstat -an | grep LISTEN | grep 9042
      netstat -an | grep LISTEN | grep 9160
      
    7. If the Cassandra nodes are not listening on port 9042 or 9160, then restart the specific Cassandra node(s):
      /opt/apigee/apigee-service/bin/apigee-service apigee-cassandra restart
      
    8. If the problem persists, then engage your network operations team.

Resolution

Work with your network operations team and get the network connectivity issue fixed between Message Processor and Cassandra.

Deployment errors due to Cassandra restarts

Cassandra nodes are usually restarted periodically as part of routine maintenance. If API proxies are deployed during the Cassandra maintenance work, then the deployments fail due to inaccessibility to the Cassandra datastore.

Note: Only Edge Private Cloud users can perform the following steps. If you are on Edge Public Cloud, contact Apigee Support.

Diagnosis

  1. Check if the Cassandra nodes were restarted during the time of the deployment.This can be done by checking the Cassandra log or most recent startup time logs of the Cassandra node:

    grep "shutdown" /opt/apigee/var/log/apigee-cassandra/system.log

Resolution

  1. Ensure Cassandra is up and running.
  2. Check if Message Processors are able to connect to Cassandra datastore on port 9042 and 9160.

Spike in read request latency on Cassandra

A high number of reads on Cassandra is dependant on individual use cases and traffic patterns on the proxies that contain policies that require read access from Cassandra.

For example, if a GET call to refresh_token grant type is called for OAuth policies, and the refresh token is associated with many access tokens, then this may result in high amounts of reads from Cassandra. This can cause increase in the read request latency on Cassandra.

Diagnosis

Note: Only Edge Private Cloud users can perform the following steps. If you are on Edge Public Cloud, contact Apigee Support.

  1. If you have installed the Beta Monitoring dashboard, look at the Cassandra dashboard, and review the "Read Requests" chart for the period of the problem. Also review the chart for "Read Request Latencies".
  2. Alternate tool to check the read requests and read latencies is the nodetool cfstats command. See Cassandra documentation to get more details to use this command.

Resolution

Note: Only Edge Private Cloud users can perform the following steps. If you are on Edge Public Cloud, contact Apigee Support.

  1. Try the deployment again once Cassandra performance is back to normal. Make sure the entire Cassandra ring is normal.
  2. (Optional) Do a rolling restart on Message Processors to be sure connectivity is established.
  3. For a long term solution, review the API traffic patterns that would possibly contribute to higher reads in the Cassandra datastore. Contact Apigee Support for assistance in troubleshooting this issue.
  4. If the existing Cassandra node(s) are not adequate to handle the incoming traffic, then either increase the hardware capacity or the number of the Cassandra datastore nodes appropriately.

API Proxy Bundle larger than 15MB

The size of API proxy bundles are restricted to 15MB on Cassandra. If the size of the API proxy bundle is greater than 15 MB, then you will see "Error while accessing datastore" when you attempt to deploy the API proxy.

Diagnosis

Note: Only Edge Private Cloud users can perform the following steps. If you are on Edge Public Cloud, contact Apigee Support.

  1. Check the Message Processor logs (/opt/apigee/var/log/edge-message-processor/logs/system.log) and see if there are any errors occurred during deployment of the specific API Proxy.
  2. If you see an error similar to the one shown in the figure below, then the deployment error is because the API proxy bundle size is > 15 MB.
    2016-03-23 18:42:18,517 main ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.fetchDynamicCompositeColumns() : Error while querying columnfamily : [api_proxy_revisions_r21, adevegowdat@v1-node-js] for rowkey:{}
    com.netflix.astyanax.connectionpool.exceptions.TransportException: TransportException: [host=None(0.0.0.0):0, latency=159(486), attempts=3]org.apache.thrift.transport.TTransportException: Frame size (20211500) larger than max length (16384000)!
            at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:197) ~[astyanax-thrift-1.56.43.jar:na]
            at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) ~[astyanax-thrift-1.56.43.jar:na]
    ...<snipped>
            Caused by: org.apache.thrift.transport.TTransportException: Frame size (20211500) larger than max length (16384000)!
            at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:137) ~[libthrift-0.9.1.jar:0.9.1]
            at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) ~[libthrift-0.9.1.jar:0.9.1]
            at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) ~[libthrift-0.9.1.jar:0.9.1]
    ...<snipped>
    

Resolution

The API proxy bundle will be large if there are too many resource files. Use the following solutions to address this issue:

Solution #1: Move resource files to the Environment or Organization level

  1. Move any of the resource files, such as NodeJS Script files and modules, JavaScript files, JAR files to the environment or organization level. For more information on resource files, see the Edge documentation.
  2. Deploy the API proxy and see if the error goes away.

If the problem persists or you cannot move the resource files to environment or organization level for some reason, then apply solution #2.

Solution #2: Increase the API proxy bundle size on Cassandra

Note: Only Edge Private Cloud users can perform the following steps. If you are on Edge Public Cloud, contact Apigee Support.

Follow these steps to increase the size of the Cassandra property thrift frame transport size, that controls the maximum size of the API proxy bundle allowed in Edge:

  1. Create the following file, if it does not exist:
    /opt/apigee/customer/application/cassandra.properties
    
  2. Add the following line to the file, replacing <size> with the size setting needed for the large bundle:
    conf_cassandra_thrift_framed_transport_size_in_mb=<size>
    
  3. Restart Cassandra:
    /opt/apigee/apigee-service/bin/apigee-service edge-management-server restart
    
  4. Repeat the steps #1 through #3 on all Cassandra nodes in the cluster.

If the problem persists, contact Apigee Support for further assistance.