View UDCA metrics

The Prometheus service collects and processes metrics (as described in Metrics collection) for the UDCA service just as it does for other Hybrid services.

The following table describes labels that Prometheus uses in the UDCA metrics data. These labels are used in the metrics log entries.

Label Description
organization The name of the organization in which the UDCA service is running.
environment The name of the environment in which the UDCA service is running.
dataset The type of data. Possible values are:
  • api: Analytics data
  • event: Deployment status data
  • trace: Trace data
service

The name of the upstream service. UDCA uses these services when performing actions such as file upload and authentication. Possible values include:

  • DATALOCATION: An endpoint that provides the location to which UDCA should upload data.
  • CLOUD_STORAGE: The actual storage location where data is uploaded by the UDCA.
  • TOKEN_GENERATOR: The endpoint from which the UDCA gets it JWT access and refresh tokens.
state The state of a data file. Possible values include:
  • FAILED: The UDCA attempted to upload the file but experienced an error.
  • READY_TO_UPLOAD: The file is on disk and ready to be uploaded. Does not include files that are currently being uploaded.
  • UPLOADED: The UDCA successfully uploaded this file.
status_code HTTP status codes returned by upstream services used by UDCA.

The following table describes some common UDCA metric log entries:

Metric Description
uploaded_file_count{dataset=, organization=, environment=} count

A count of the files that UDCA uploaded to the Apigee services.

Note that:

  • The event dataset value should keep growing.
  • The api dataset value should keep growing if org/env has constant traffic.
  • The trace dataset value should increase when you use the Apigee trace tools to debug or inspect your requests.
upstream_http_error_count{dataset=, organization=, environment=, service=, status_code=} count

A count of the number of upstream HTTP errors that UDCA encounters.

4xx and 5xx value should be close to 0 and not increase over time. Few errors like 5xx or 429 might occur over time but should not be constant.

local_file_count{dataset=,state=} value

A count of the number of files on disk in the data collection pod.

Ideally, the value will be close to 0. A consistent high value indicates that files are not being uploaded or that the UDCA is not able to upload them fast enough.

This value is computed every 60 seconds and does not reflect the state of the UDCA in real time.

local_file_latest_ts{dataset=,state=} value

The timestamp, in milliseconds since the start of the Unix Epoch, for latest file on disk by state.

This is computed every 60 seconds and does not reflect the state in real time. If the UDCA is up to date and there are no files waiting to be uploaded when this metric is computed, then this value will be 0.

local_file_oldest_ts{dataset=,state=} value

The timestamp, in milliseconds since the start of the Unix Epoch, for the oldest file in the dataset.

This is computed every 60 seconds and does not reflect the state in real time. If the UDCA is up to date and there are no files waiting to be uploaded when this metric is computed, then this value will be 0.

If this value keeps increasing, then old files are still on disk.

disk_usage_bytes{dataset=,state=} value

The space occupied by the data files on the data collection pod's disk, in bytes.

An increase in this value over time:

  • ready_to_upload implies agent is lagging behind.
  • failed implies files are piling up on disk and not being uploaded.

    This value is computed every 60 seconds.

retry_cache_size{dataset=} value

A count of the number of files, by dataset, that UDCA is retrying to upload.

After 3 retries for each file, UDCA moves the file to the /failed subdirectory and removes it from this cache.

An increase in this value over time implies that the cache is not being cleared, which happens when files are moved to the /failed subdirectory after 3 retries.

upstream_http_latency_seconds_bucket{service=, dataset=, organization=, environment= le=value_in_seconds} count

The upstream latency of services, in seconds.

Buckets will be 100ms, 250ms, 500ms, 1s, 2s, 4s, 8s, 16s, 32s, and 64s.

Histogram for latency from upstream services.

upload_latency_seconds_bucket{dataset=, organization=, environment=, le=value_in_seconds} count The total time, in seconds, that UDCA spent uploading a data file.

Buckets will be 100ms, 250ms, 500ms, 1s, 2s, 4s, 8s, 16s, 32s, and 64s.

The metrics will display a histogram for total upload latency, including all upstream calls.

total_latency_seconds_bucket{dataset=, organization=, environment=, le=value_in_seconds} count

The time interval, in seconds, between the data file being created and the data file being successfully uploaded.

Buckets will be 100ms, 250ms, 500ms, 1s, 2s, 4s, 8s, 16s, 32s, and 64s.

Histogram for total latency from file creation time to successful upload time.

uploaded_file_size_bucket{dataset=, organization=, environment=, le=value_in_seconds} count

The size of the file being uploaded to the Apigee services, in bytes.

Buckets will be 1KB, 10KB, 100KB, 1MB, 10MB, 100MB, and 1GB.

Histogram for file size by dataset, organization and environment.