Monitoring best practices

Edge for Private Cloud v4.18.05

Monitoring alerts

Apigee Edge allows you to forward alerts to syslogs or external monitoring systems/tools when an error or a failure occurs due to failure of an event. These alerts can be system-level or application-level alerts/events. Application level alerts are mostly custom alerts that are created based on events generated. The network administrator usually configures the custom conditions. For more information on alerts, contact Apigee Support.

Setting alert thresholds

Set a threshold after which an alert needs to be generated. What you set depends on your hardware configuration. Threshold should be set in relation to your capacity. For example, Apigee Edge might be too low if you only have 6GB capacity. You can assign threshold with equal to (=) or greater than (>) criterion. You can also specify a time interval between two consecutive alerts generation. You can use the hours/minutes/seconds option.

Criteria for Setting System-level Alerts

The following table describes the criteria:

Alert Suggested Threshold Description

Low memory

500MB

Memory is too low to start a component

Low disk space (/var/log)

8GB

Disk space has fallen too low.

High load

3+

Processes waiting to run have increased unexpectedly

Process stopped

N/A, a Boolean value of true or false

Apigee Java process in the system has stopped

Checking on Apigee-specific and Third-party Ports

Monitor the following ports to make sure they're active

  • Port 4526, 4527 and 4528 on Management Server, Router and Message Processor
  • Port 1099, 1100 and 1101 on Management Server, Router and Message Processor
  • Port 8081 and 15999 on Routers
  • Port 8082 and 8998 on Message Processors
  • Port 8080 on Management Server

Check the following third-party ports to make sure they’re active:

  • Qpid port 5672
  • Postgres port 5432
  • Cassandra port 7000, 7199, 9042, 9160
  • ZooKeeper port 2181
  • OpenLDAP port 10389

In order to determine which port each Apigee component is listening for API calls on, issue the following API calls to the Management Server (which is generally on port 8080):

curl -v -u username:password http://host:port/v1/servers?pod=gateway&region=dc-1
curl -v -u username:password http://host:port/v1/servers?pod=central&region=dc-1
curl -v -u username:password http://host:port/v1/servers?pod=analytics&region=dc-1

The output of these commands will contain sections similar to that shown below. The http.management.port section gives the port number for the specified component.

{
  "externalHostName" : "localhost",
  "externalIP" : "111.222.333.444",
  "internalHostName" : "localhost",
  "internalIP" : "111.222.333.444",
  "isUp" : true,
  "pod" : "gateway",
  "reachable" : true,
  "region" : "default",
  "tags" : {
    "property" : [ {
      "name" : "Profile",
      "value" : "Router"
    }, {
      "name" : "rpc.port",
      "value" : "4527"
    }, {
      "name" : "http.management.port",
      "value" : "8081"
    }, {
      "name" : "jmx.rmi.port",
      "value" : "1100"
    } ]
  },
  "type" : [ "router" ],
  "uUID" : "2d4ec885-e20a-4173-ae87-10be38b35750"
}

Viewing Logs

Log files keep track of messages regarding the event/operation of the system. Messages appear in the log when processes begin and complete or when an error condition occurs. By viewing log files, you can obtain information about system components, for example, CPU, memory, disk, load, processes, so on, before and after attaining a failed state. This also allows you to identify and diagnose the source of current system problems or help you predict potential system problems.

For example, a typical system log of a component contains following entries as seen below:

TimeStamp = 25/01/13 19:25 ; NextDelay = 30
Memory
HeapMemoryUsage = {used = 29086176}{max = 64880640} ;
NonHeapMemoryUsage = {init = 24313856}{committed = 57278464} ;
Threading
PeakThreadCount = 53 ; ThreadCount = 53 ;
OperatingSystem
SystemLoadAverage = 0.25 ;

You can edit the /opt/apigee/conf/logback.xml file to control the logging mechanism without having to restart a server. The logback.xml file contains the following property that sets the frequency that the logging mechanism checks the logback.xml file for configuration changes:

<configuration scan="true" scanPeriod="30 seconds" >

By default, the logging mechanism checks for changes every minute. If you omit the time units to the scanPeriod attribute, it defaults to milliseconds.

The following table tells the log files location of Apigee Edge Private Cloud components.

Components Location

Management Server

opt/apigee/var/log/edge-management-server

Router

opt/apigee/var/log/edge-router

Message Processor

opt/apigee/var/log/edge-message-processor

Qpid Server

opt/apigee/var/log/edge-qpid-server

Apigee Postgres Server

opt/apigee/var/log/edge-postgres-server

Edge UI

opt/apigee/var/log/edge-ui

ZooKeeper

opt/apigee/var/log/apigee-zookeeper

OpenLDAP

opt/apigee/var/log/apigee-openldap

Cassandra

opt/apigee/var/log/apigee-cassandra

Qpidd

opt/apigee/var/log/apigee-qpidd

PostgreSQL database

opt/apigee/var/log/apigee-postgresql

Enabling debug logs for the Message Processor and Edge UI

To enable debug logs for Message Processor:

  1. On the Message Processor node, edit /opt/apigee/customer/application/messsage-processor.properties. If that file does not exist, create it.
  2. Add the following property to the file:
    conf_system_log.level=DEBUG
  3. Restart the Message Processor:
    /opt/apigee/apigee-service/bin/apigee-service edge-message-processor restart

To enable debug logs for Edge UI:

  1. On the Edge UI node, edit /opt/apigee/customer/application/ui.properties. If that file does not exist, create it.
  2. Add the following property to the file:
    conf_application_logger.application=DEBUG
  3. Restart the Edge UI:
    /opt/apigee/apigee-service/bin/apigee-service edge-ui restart

Monitoring Tools

Monitoring tools such as Nagios, Collectd, Graphite, Splunk, Sumologic, and Monit can help you monitor your entire enterprise environment and business processes.

Component Nagios Collectd Splunk

System-level checks

CPU utilization

Free/used memory

Disk space usage

Network statistics

Processes

API checks

JMX

Java

Log files

Critical events

Rate Limit hit

Backend server (Hybris or SharePoint) cannot be reached

FaaS (STS) cannot be reached

Warning events

SMTP server cannot be reached

SLAs violated