Monitoring Best Practices

Edge for Private Cloud v. 4.17.01

Monitoring Alerts

Apigee Edge allows you to forward alerts to syslogs or external monitoring systems/tools when an error or a failure occurs due to failure of an event. These alerts can be system-level or application-level alerts/events. Application level alerts are mostly custom alerts that are created based on events generated. The network administrator usually configures the custom conditions. For more information on alerts, contact Apigee Support.

Setting Alert Thresholds

Set a threshold after which an alert needs to be generated. What you set depends on your hardware configuration. Threshold should be set in relation to your capacity. For example, Apigee Edge might be too low if you only have 6GB capacity. You can assign threshold with equal to (=) or greater than (>) criterion. You can also specify a time interval between two consecutive alerts generation. You can use the hours/minutes/seconds option.

Criteria for Setting System-level Alerts

The following table describes the criteria:

Alert	Suggested Threshold	Description
Low memory	500MB	Memory is too low to start a component
Low disk space (/var/log)	8GB	Disk space has fallen too low.
High load	3+	Processes waiting to run have increased unexpectedly
Process stopped	N/A, a Boolean value of true or false	Apigee Java process in the system has stopped

Checking on Apigee-specific and Third-party Ports

Monitor the following ports to make sure they’re active

Port 4526, 4527 and 4528 on Management Server, Router and Message Processor
Port 1099, 1100 and 1101 on Management Server, Router and Message Processor
Port 8081 and 15999 on Routers
Port 8082 and 8998 on Message Processors

Port 8080 on Management Server Check the following third-party ports to make sure they’re active:

Qpid port 5672
Postgres port 5432
Cassandra port 7000, 7199, 9042, 9160
ZooKeeper port 2181
OpenLDAP port 10389

In order to determine which port each Apigee component is listening for API calls on, issue the following API calls to the Management Server (which is generally on port 8080):

curl -v -u <username>:<password> http://<host>:<port>/v1/servers?pod=gateway&region=dc-1
curl -v -u <username>:<password> http:// <host>:<port>/v1/servers?pod=central&region=dc-1
curl -v -u <username>:<password> http:// <host>:<port>/v1/servers?pod=analytics&region=dc-1

The output of these commands will contain sections similar to that shown below. The "http.management.port" section gives the port number for the specified component.

{
  "externalHostName" : "localhost",
  "externalIP" : "111.222.333.444",
  "internalHostName" : "localhost",
  "internalIP" : "111.222.333.444",
  "isUp" : true,
  "pod" : "gateway",
  "reachable" : true,
  "region" : "default",
  "tags" : {
    "property" : [ {
      "name" : "Profile",
      "value" : "Router"
    }, {
      "name" : "rpc.port",
      "value" : "4527"
    }, {
      "name" : "http.management.port",
      "value" : "8081"
    }, {
      "name" : "jmx.rmi.port",
      "value" : "1100"
    } ]
  },
  "type" : [ "router" ],
  "uUID" : "2d4ec885-e20a-4173-ae87-10be38b35750"
}

Viewing Logs

Log files keep track of messages regarding the event/operation of the system. Messages appear in the log when processes begin and complete or when an error condition occurs. By viewing log files, you can obtain information about system components, for example, CPU, memory, disk, load, processes, so on, before and after attaining a failed state. This also allows you to identify and diagnose the source of current system problems or help you predict potential system problems.

For example, a typical system log of a component contains following entries as seen below:

TimeStamp = 25/01/13 19:25 ; NextDelay = 30
Memory
HeapMemoryUsage = {used = 29086176}{max = 64880640} ;    
NonHeapMemoryUsage = {init = 24313856}{committed = 57278464} ;
Threading
PeakThreadCount = 53 ; ThreadCount = 53 ;
OperatingSystem
SystemLoadAverage = 0.25 ;

You can edit the /opt/apigee/conf/logback.xml file to control the logging mechanism without having to restart a server. The logback.xml file contains the following property that sets the frequency that the logging mechanism checks the logback.xml file for configuration changes:

<configuration scan="true" scanPeriod="30 seconds" >

By default, the logging mechanism checks for changes every minute. If you omit the time units to the scanPeriod attribute, it defaults to milliseconds.

The following table tells the log files location of Apigee Edge Private Cloud components.

Components	Location
Management Server	opt/apigee/var/log/edge-management-server
Router	opt/apigee/var/log/edge-router
Message Processor	opt/apigee/var/log/edge-message-processor
Qpid Server	opt/apigee/var/log/edge-qpid-server
Apigee Postgres Server	opt/apigee/var/log/edge-postgres-server
Edge UI	opt/apigee/var/log/edge-ui
ZooKeeper	opt/apigee/var/log/apigee-zookeeper
OpenLDAP	opt/apigee/var/log/apigee-openldap
Cassandra	opt/apigee/var/log/apigee-cassandra
Qpidd	opt/apigee/var/log/apigee-qpidd
PostgreSQL database	opt/apigee/var/log/apigee-postgresql

Enabling debug logs for the Message Processor and Edge UI

To enable debug logs for Message Processor:

On the Message Processor node, edit /opt/apigee/customer/application/messsage-processor.properties. If that file does not exist, create it.
Add the following property to the file:
conf_system_log.level=DEBUG
Restart the Message Processor:
> /opt/apigee/apigee-service/bin/apigee-service edge-message-processor restart

To enable debug logs for Edge UI:

On the Edge UI node, edit /opt/apigee/customer/application/ui.properties. If that file does not exist, create it.
Add the following property to the file:
conf_application_logger.application=DEBUG
Restart the Edge UI:
> /opt/apigee/apigee-service/bin/apigee-service edge-ui restart

Monitoring Tools

Monitoring tools such as Nagios, Collectd, Graphite, Splunk, Sumologic, and Monit can help you monitor your entire enterprise environment and business processes.

Component		Nagios	Collectd	Splunk
System-level checks	CPU utilization	?	?
	Free/used memory	?	?
	Disk space usage	?	?
	Network statistics	?	?
Processes		?
API checks		?
JMX		?
Java			?
Log files				?
Critical events	Rate Limit hit			?
	Backend server (Hybris or SharePoint) cannot be reached			?
	FaaS (STS) cannot be reached			?
Warning events	SMTP server cannot be reached			?
Warning events	SLAs violated			?