Unable to Start Zookeeper

Symptom

Unable to start the ZooKeeper process.

Error messages

When you attempt to start the ZooKeeper process, the following error message is returned indicating that ZooKeeper could not be started:

+ apigee-service apigee-zookeeper status
apigee-service: apigee-zookeeper: Not running (DEAD)
apigee-all: Error: status failed on [apigee-zookeeper]

Possible causes

The following table lists possible causes of this issue:

Cause For
Misconfigured ZooKeeper myid Edge Private Cloud users
ZooKeeper port in use Edge Private Cloud users
Incorrect process ID in apigee-zookeeper.pid file Edge Private Cloud users
ZooKeeper Leader Election Failure Edge Private Cloud users

Click a link in the table to see possible resolutions to that cause.

Misconfigured ZooKeeper myid

The following sections provide an overview of the myid file and describe how to diagnose and resolve misconfiguration issues.

Overview of the myid file

On each ZooKeeper node, there are two files:

  1. The /opt/apigee/apigee-zookeeper/conf/zoo.cfg file which contains a list of IPs for all the ZooKeeper nodes in the cluster.

    For example, if the /opt/apigee/apigee-zookeeper/conf/zoo.cfg contains the IPs of 3 ZooKeeper nodes part of the cluster as follows:

    server.1=11.11.11.11:2888:3888
    server.2=22.22.22.22:2888:3888
    server.2=33.33.33.33:2888:3888
    
  2. The /opt/apigee/data/apigee-zookeeper/data/myid file contains a single line of text which corresponds to the server number of that particular ZooKeeper node. The myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.

    For example, on ZooKeeper server.1, the /opt/apigee/data/apigee-zookeeper/data/myid file should just contain the text 1 as shown below:

    $ cat myid
    1
    

Diagnosis

  1. Check the ZooKeeper log /opt/apigee/var/log/apigee-zookeeper/zookeeper.log for errors.
  2. If you see the WARN message similar to “Connection broken for id #, my id = #”, as shown in the below figure, then the possible cause for this issue could be that the server # in the myid file is misconfigured or corrupted.
    [myid:2] - WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@762] -
      Connection broken for id 2, my id = 2, error = java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.
          run(QuorumCnxManager.java:747)
    
  3. Check the /opt/apigee/apigee-zookeeper/conf/zoo.cfg file and note down the server.# for the current ZooKeeper node.
  4. Check the /opt/apigee/data/apigee-zookeeper/data/myid file and see if the text in this file matches the server.# noted in step #2.
  5. If there is a mismatch, then you have identified the cause for ZooKeeper failing to start.

Resolution

If myid file is incorrectly configured, then edit the myid file and replace the value to a correct text representing the server.# parameter in the zoo.cfg.

ZooKeeper port in use

Diagnosis

  1. Check ZooKeeper log /opt/apigee/var/log/apigee-zookeeper/zookeeper.log for errors.
  2. If you notice the exception java.net.BindException: Address already in use while binding to the port #2181, as shown in the figure below, it indicates that the ZooKeeper port 2181 is being used by another process. Hence, the ZooKeeper could not be started.
    2017-04-26 07:00:10,420 [myid:3] - INFO  [main:NIOServerCnxnFactory@94] -
      binding to port 0.0.0.0/0.0.0.0:2181
    2017-04-26 07:00:10,421 [myid:3] - ERROR [main:QuorumPeerMain@89] -
      Unexpected exception, exiting abnormally
      java.net.BindException: Address already in use
        at sun.nio.ch.Net.bind0(Native Method)
        at sun.nio.ch.Net.bind(Net.java:433)
        at sun.nio.ch.Net.bind(Net.java:425)
        at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:130)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
    
  3. Use the below netstat command to confirm that the ZooKeeper port 2181 is indeed being used by another process:
    netstat -an | grep 2181
    

Resolution

If the ZooKeeper port 2181 is still in use, then follow the below steps to address this issue:

  1. Use the netstat command to find the process that is holding onto port 2181. Kill the process that is using the ZooKeeper port 2181:
    $ netstat -antp | grep 2181
    tcp        0      0 0.0.0.0:2181            0.0.0.0:*
    LISTEN      28016/java <defunct>
    $ kill -9 28016
    
  2. Clean up pid and lock files if they exist:
    /opt/apigee/var/run/apigee-zookeeper/apigee-zookeeper.pid
    /opt/apigee/var/run/apigee-zookeeper/apigee-zookeeper.lock
    
  3. Restart the ZooKeeper:
    /opt/apigee/apigee-service/bin/apigee-service apigee-zookeeper restart
    

Incorrect process ID in apigee-zookeeper.pid file

When you attempt to stop/restart the ZooKeeper, it may fail because the apigee-zookeeper.pid file contains older/incorrect pid and not that of the currently running ZooKeeper process. This may happen if the ZooKeeper process terminated unexpectedly or abruptly for some reason and the apigee-zookeeper.pid file was not deleted.

Diagnosis

  1. Get the process id of the currently running ZooKeeper process by running the ps command:
    ps -ef | grep zookeeper
    
  2. Check if the /opt/apigee/var/run/apigee-ZooKeeper/apigee-zookeeper.pid file exists. If it exists, then note down the process id written into this file.
  3. Compare the process ids taken from step #1 and #2. If they are different, then the cause for this issue is having the incorrect process id in the apigee-zookeeper.pid file.

Resolution

  1. Edit the apigee-zookeeper.pid file and replace the incorrect process id with the correct process id obtained from ps command (step #1 above).
  2. Restart the ZooKeeper:
    /opt/apigee/apigee-service/bin/apigee-service apigee-zookeeper restart
    

ZooKeeper Leader Election Failure

Diagnosis

To diagnose:

  1. Check the ZooKeeper log /opt/apigee/var/log/apigee-zookeeper/zookeeper.log for errors.
  2. Check if there were any configuration changes which may cause ZooKeeper election of the leader to fail.
  3. Check the /opt/apigee/apigee-zookeeper/conf/zoo.cfg and make sure all ZooKeepers in the cluster have the proper number and IP addresses for the server.# parameter. Also note that for the leader election to succeed there needs to be at least 3 voters minimum and the number of voters should be odd numbered. If there are too little voters, like only 2 voters, it cannot come to a quorum to decide a leader among only 2 voters.

Resolution

Typically, ZooKeeper election failure is caused by a misconfigured myid. Use the resolution in Misconfigured ZooKeeper myid to address the election failure.

If the problem persists and further diagnosis is needed, contact Apigee Support.