此页面由 Cloud Translation API 翻译。

Zookeeper 连接丢失错误

<ph type="x-smartling-placeholder"></ph> 您正在查看 Apigee Edge 文档。
转到 Apigee X 文档。信息

问题

ZooKeeper 连接问题可能表现为不同的症状，例如：

API 代理部署错误
Management API 调用失败并显示 5XX 错误
路由器或消息处理器无法启动
Google Analytics 组件会在 system.logs 中报告 ZooKeeper 连接丢失情况

错误消息

下面提供了使用异常链接时可能会出现的错误消息示例 ZooKeeper 节点的连接丢失。

部署 API 代理时，管理服务器日志中会返回以下错误因 ZooKeeper 连接丢失而失败：

org: env: main INFO ZOOKEEPER - ZooKeeperServiceImpl.exists() :
Retry path existence path:
  /regions/dc-1/pods/analytics/servers/692afe93-8010-45c6-b37d-e4e05b6b2eb5/reachable,
  reason: KeeperErrorCode = ConnectionLoss
org: env: main ERROR ZOOKEEPER - ZooKeeperServiceImpl.exists() :
  Could not detect existence of path:
  /regions/dc-1/pods/analytics/servers/692afe93-8010-45c6-b37d-e4e05b6b2eb5/reachable ,
  reason: KeeperErrorCode = ConnectionLoss
org: env: main ERROR KERNEL.DEPLOYMENT - ServiceDeployer.startService() :
  ServiceDeployer.deploy() : Got a life cycle exception while starting service
  [ServerRegistrationService, Error while checking path existence for path :
  /regions/dc-1/pods/analytics/servers/692afe93-8010-45c6-b37d-e4e05b6b2eb5/reachable] :
  com.apigee.zookeeper.ZooKeeperException{ code = zookeeper.ErrorCheckingPathExis tence,
  message = Error while checking path existence for path :
  /regions/dc-1/pods/analytics/servers/692afe93-8010-45c6-b37d-e4e05b6b2eb5/reachable,
  associated contexts = []} 2015-03-25 10:22:39,811
org: env: main ERROR KERNEL - MicroKernel.deployAll() : MicroKernel.deployAll() :
Error in deploying the deployment : EventService com.apigee.zookeeper.ZooKeeperException:
Error while checking path existence for path :
  /regions/dc-1/pods/analytics/servers/692afe93-8010-45c6-b37d-e4e05b6b2eb5/reachable
  at com.apigee.zookeeper.impl.ZooKeeperServiceImpl.exists(ZooKeeperServiceImpl.java:339)
  ~[zookeeper-1.0.0.jar:na] at com.apigee.zookeeper.impl.ZooKeeperServiceImpl.exists(
  ZooKeeperServiceImpl.java:323) ~[zookeeper-1.0.0.jar:na] at ... snipped

在启动期间，路由器和消息处理器将连接到 ZooKeeper。如果有 ZooKeeper 出现连接问题，则这些组件将无法从以下状态开始错误：

2017-08-01 23:20:00,404  CuratorFramework-0 ERROR o.a.c.f.i.CuratorFrameworkImpl
  - CuratorFrameworkImpl.logError() : Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
  at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:710) [curator-framework-2.5.0.jar:na]
  at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:827) [curator-framework-2.5.0.jar:na]
  at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:793) [curator-framework-2.5.0.jar:na]
  at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:57) [curator-framework-2.5.0.jar:na]
  at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:275) [curator-framework-2.5.0.jar:na]
  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_131]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
  at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]

Edge 界面可能会显示以下错误，指示它无法检查 API 代理的部署状态：
```
Error Fetching Deployments
Error while checking path existence for path: path
```

可能的原因

下表列出了此问题的可能原因：

原因	适用于
不同数据中心的网络连接问题	Edge Private Cloud 用户
ZooKeeper 节点未处理请求	Edge Private Cloud 用户

点击表格中的链接可查看针对该问题的可能解决方案。

不同数据中心的网络连接问题

诊断

ZooKeeper 集群的节点可能跨多个区域/数据中心，例如 DC-1 和 DC-2。典型的 Apigee Edge 2 DC 拓扑具有以下特点：

在 DC-1 中，ZooKeeper 将服务器 1、2 和 3 作为投票者
在 DC-2 中，ZooKeeper 4 和 5 作为选民，ZooKeeper 6 作为观察者。

如果 DC-1 区域出现故障或 DC-1 和 DC-2 之间的网络连接中断，则： ZooKeeper 节点无法在 DC-2 中选出新的主要节点，并且它们无法与主要节点通信节点。ZooKeeper 观察者无法选出新的队长，DC-2 中剩余的两名选民也没有选择通过至少 3 个选民节点的最小票数选出新的领导者。因此，DC-2 中的动物园管理员无法处理任何请求DC-2 中剩余的 ZooKeeper 节点将继续循环重新与 ZooKeeper 选民进行连接以寻找领先变体。

分辨率

按指定顺序应用以下解决方案解决此问题。

如果您在尝试这些解决方案后仍无法解决问题，请请与 Apigee 支持团队联系。

解决方案 1

与网络管理员合作修复数据中心。
当 ZooKeeper 集成能够跨数据中心进行通信并选择 ZooKeeper 主节点，节点应运行状况良好并能够处理请求。

解决方案 2

如果网络连接需要一些时间才能修复，解决方法是重新配置关闭区域中的 ZooKeeper 节点。例如，重新配置 ZooKeeper 因此该区域中的 3 个 ZooKeeper 节点都是投票者，并删除 server.#来自 DC-1 地区的动物管理员的zoo.cfg。
1. 在以下示例中，zoo.cfg 为 DC-1 使用 us-ea 的 2 个区域配置节点主机名（表示美国东区）和 DC-2 使用 us-wo 主机名（表示美国西区）。（注意：仅显示相关配置）：
```
server.1=zk01ea.us-ea.4.apigee.com:2888:3888
server.2=zk02ea.us-ea.4.apigee.com:2888:3888
server.3=zk03ea.us-ea.4.apigee.com:2888:3888
server.4=zk04wo.us-wo.4.apigee.com:2888:3888
server.5=zk05wo.us-wo.4.apigee.com:2888:3888
server.6=zk06wo.us-wo.4.apigee.com:2888:3888:observer
```
  在上面的示例中，按如下方式重新配置 zoo.cfg：
```
server.1=zk04wo.us-wo.4.apigee.com:2888:3888
server.2=zk05wo.us-wo.4.apigee.com:2888:3888
server.3=zk06wo.us-wo.4.apigee.com:2888:3888
```
2. 使用包含配置的代码，创建一个包含以下内容的 /opt/apigee/customer/application/zookeeper.properties 文件：
```
conf_zoo_quorum=server.1=zk04wo.us-wo.4.apigee.com:2888:3888\
\nserver.2=zk05wo.us-wo.4.apigee.com:2888:3888\
\nserver.3=zk06wo.us-wo.4.apigee.com:2888:3888\
```
在上文中，来自美国东部的节点被移除，美国-西节点被提升到 :observer 注释被移除时支持投票者。
备份 /opt/apigee/apigee-zookeeper/conf/zoo.cfg 及之前的数据 /opt/apigee/customer/application/zookeeper.properties。
这些文件将在网络连接恢复后用于恢复默认设置数据中心之间。
为观察者节点停用观察者表示法。为此，将以下配置添加到 /opt/apigee/customer/application/zookeeper.properties 的顶部：
```
conf_zoo_peertype=
```
按如下方式修改 /opt/apigee/data/apigee-zookeeper/data/myid 文件：
- 对于 server.1，将 myid 内的条目从 4 更改为 1。
- 对于 server.2，请将 myid 从 5 更改为 2。
- 对于 server.3，将 myid 从 6 更改为 3。
在重新配置 ZooKeeper 的区域中重启 ZooKeeper 节点集群。
对 DC-2。
验证节点是否处于主要节点：
```
$ echo srvr | nc zk04wo.us-wo.4.apigee.com 2181
> echo srvr | nc zk05wo.us-wo.4.apigee.com 2181
> echo srvr | nc zk06wo.us-wo.4.apigee.com 2181
```
此命令的输出将包含一行内容为“mode”后跟“leader”如果是是领先变体，或“关注者”代表关注者。

重新建立数据中心之间的网络后，ZooKeeper 配置可还原 DC-2 中的 ZooKeeper 节点上的更改。

注意：在重新建立网络连接和 ZooKeeper 集群之后需要在两个数据中心重新连接的情况下恢复为默认值，请还原更改您对文件/opt/apigee/customer/application/zookeeper.properties和 /opt/apigee/data/apigee-zookeeper/data/myid 文件，然后重启 ZooKeeper。

解决方案 3

如果集群中的 ZooKeeper 节点未启动，请将其重启。

检查 ZooKeeper 日志，确定 ZooKeeper 节点发生故障的原因。

ZooKeeper 日志位于以下目录中：

$ cd /opt/apigee/var/log/apigee-zookeeper
$ ls -l
total 188
-rw-r--r--. 1 apigee apigee   2715 Jul 22 19:51 apigee-zookeeper.log
-rw-r--r--. 1 apigee apigee  10434 Jul 17 19:51 config.log
-rw-r--r--. 1 apigee apigee 169640 Aug  1 19:51 zookeeper.log

请与 Apigee 支持团队联系并提供 ZooKeeper 日志用于对任何可能已停止的 ZooKeeper 节点进行问题排查。

ZooKeeper 节点未处理请求

集成中的 ZooKeeper 节点可能运行状况不佳，无法响应客户端请求。可能的原因如下：

该节点已停止，但并未重启。
节点已重新启动，但未启用自动启动功能。
节点上的系统负载导致其关闭或运行状况不佳。

诊断

在每个 ZooKeeper 节点上执行以下 ZooKeeper 健康检查命令，检查输出： <ph type="x-smartling-placeholder">

$ echo "ruok" | nc localhost 2181

输出示例：

$ echo "ruok" | nc localhost 2181
imok

echo srvr | nc localhost 2181

检查模式以确定 ZooKeeper 节点是主要节点还是关注者。

一体化单个 ZooKeeper 节点的输出示例：

$ echo srvr | nc localhost 2181
ZooKeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
Latency min/avg/max: 0/0/88
Received: 4206601
Sent: 4206624
Connections: 8
Outstanding: 0
Zxid: 0x745
Mode: standalone
Node count: 282

$ echo mntr | nc localhost 2181

此命令列出了 ZooKeeper 变量，可用于检查对 ZooKeeper 集群执行相应操作

输出示例：

$ echo mntr | nc localhost 2181
zk_version 3.4.5-1392090, built on 09/30/2012 17:52 GMT
zk_avg_latency 0
zk_max_latency 88
zk_min_latency 0
zk_packets_received     4206750
zk_packets_sent 4206773
zk_num_alive_connections 8
zk_outstanding_requests 0
zk_server_state standalone
zk_znode_count 282
zk_watch_count 194
zk_ephemerals_count 1
zk_approximate_data_size 22960
zk_open_file_descriptor_count 34
zk_max_file_descriptor_count 4096

$ echo stat | nc localhost 2181

此命令会列出性能和已连接客户端的相关统计信息。

输出示例：

$ echo stat | nc localhost 2181
ZooKeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
Clients:
 /10.128.0.8:54152[1](queued=0,recved=753379,sent=753385)
 /10.128.0.8:53944[1](queued=0,recved=980269,sent=980278)
 /10.128.0.8:54388[1](queued=0,recved=457094,sent=457094)
 /10.128.0.8:54622[1](queued=0,recved=972938,sent=972938)
 /10.128.0.8:54192[1](queued=0,recved=150843,sent=150843)
 /10.128.0.8:44564[1](queued=0,recved=267332,sent=267333)
 /127.0.0.1:40820[0](queued=0,recved=1,sent=0)
 /10.128.0.8:53960[1](queued=0,recved=150844,sent=150844)

Latency min/avg/max: 0/0/88
Received: 4206995
Sent: 4207018
Connections: 8
Outstanding: 0
Zxid: 0x745
Mode: standalone
Node count: 282

$ echo cons | nc localhost 2181

此命令提供了有关 ZooKeeper 连接的更多详细信息。

输出示例：

$ echo cons | nc localhost 2181
/127.0.0.1:40864[0](queued=0,recved=1,sent=0)
/10.128.0.8:54152[1](queued=0,recved=753400,sent=753406,sid=0x15d521a96d40007,
  lop=PING,est=1500321588647,to=40000,lcxid=0x972e9,lzxid=0x745,lresp=1502334173174,
  llat=0,minlat=0,avglat=0,maxlat=26)
/10.128.0.8:53944[1](queued=0,recved=980297,sent=980306,sid=0x15d521a96d40005,
  lop=PING,est=1500321544896,to=40000,lcxid=0xce92a,lzxid=0x745,lresp=1502334176055,
  llat=0,minlat=0,avglat=0,maxlat=23)
/10.128.0.8:54388[1](queued=0,recved=457110,sent=457110,sid=0x15d521a96d4000a,
  lop=PING,est=1500321673852,to=40000,lcxid=0x4dbe3,lzxid=0x745,lresp=1502334174245,
  llat=0,minlat=0,avglat=0,maxlat=22)
/10.128.0.8:54622[1](queued=0,recved=972967,sent=972967,sid=0x15d521a96d4000b,
  lop=PING,est=1500321890175,to=40000,lcxid=0xccc9d,lzxid=0x745,lresp=1502334182417,
  llat=0,minlat=0,avglat=0,maxlat=88)
/10.128.0.8:54192[1](queued=0,recved=150848,sent=150848,sid=0x15d521a96d40008,
  lop=PING,est=1500321591985,to=40000,lcxid=0x8,lzxid=0x745,lresp=1502334184475,
  llat=3,minlat=0,avglat=0,maxlat=19)
/10.128.0.8:44564[1](queued=0,recved=267354,sent=267355,sid=0x15d521a96d4000d,
  lop=PING,est=1501606633426,to=40000,lcxid=0x356e2,lzxid=0x745,lresp=1502334182315,
  llat=0,minlat=0,avglat=0,maxlat=35)
/10.128.0.8:53960[1](queued=0,recved=150848,sent=150848,sid=0x15d521a96d40006,
  lop=PING,est=1500321547138,to=40000,lcxid=0x5,lzxid=0x745,lresp=1502334177036,
  llat=1,minlat=0,avglat=0,maxlat=20)

如果最后 3 个健康检查命令中的任何一个显示以下消息：

$ echo stat | nc localhost 2181
    This ZooKeeper instance is not currently serving requests

这表明特定的 ZooKeeper 节点未处理请求。

检查特定节点上的 ZooKeeper 日志，并尝试查找导致 ZooKeeper 停止运营。ZooKeeper 日志位于以下目录中：

$ cd /opt/apigee/var/log/apigee-zookeeper
$ ls -l
total 188
-rw-r--r--. 1 apigee apigee   2715 Jul 22 19:51 apigee-zookeeper.log
-rw-r--r--. 1 apigee apigee  10434 Jul 17 19:51 config.log
-rw-r--r--. 1 apigee apigee 169640 Aug  1 19:51 zookeeper.log

分辨率

逐个重启集群中的所有其他 ZooKeeper 节点。
在每个节点上重新运行 ZooKeeper 健康检查命令，看看能否获得预期的输出。

请联系 Apigee 支持团队，排查导致或重启无法解决问题。