Skip to content

Monitoring

Here you will find information about what monitoring is available on services deployed with SF-Operator.

  1. Concepts
  2. Accessing the metrics
  3. Statsd
  4. Predefined alerts

Concepts

SF-Operator use the prometheus-operator to expose and collect service metrics. SF-Operator will automatically create a PodMonitor for the following services:

Below is a table of available metrics (1) per service.

  1. Metrics exposed by Node Exporter can be used to monitor disk usage.
Service Statsd metrics Prometheus metrics
Git Server ✅ (node exporter only)
Log Server ✅ (node exporter only)
MariaDB ✅ (node exporter only)
Nodepool
ZooKeeper ✅ (node exporter only)
Zuul

The PodMonitor is set with the label key sf-monitoring (and a value equal to the monitored service name); that key can be used for filtering metrics.

You can list the PodMonitors this way:

kubectl get podmonitors

For services that expose statsd metrics, a sidecar container running Statsd Exporter is added to the service pod, so that these metrics can be consumed by a Prometheus instance.

Accessing the metrics

If enabled in your cluster, metrics will automatically be collected by the cluster-wide Prometheus instance. Check with your cluster admin about getting access to your metrics.

If this feature isn't enabled in your cluster, you will need to deploy your own Prometheus instance to collect the metrics on your own. To do so, follow the prometheus-operator's documentation.

You will then need to set the proper PodMonitorSelector in the Prometheus instance's manifest:

  # assuming Prometheus is deployed in the same namespace as SF
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchExpressions:
    - key: sf-monitoring
      operator: Exists

Statsd

Statsd Exporter mappings

Statsd Exporter sidecars are preconfigured to map every statsd metric issued by Zuul (1) and Nodepool (2) into prometheus-compatible metrics.

  1. Zuul's statsd_mapping.yaml
  2. Nodepool's statsd_mapping.yaml

Forwarding

It is possible to use the relayAddress property in a SoftwareFactory CRD to define a different statsd collector for Zuul and Nodepool, for example an external graphite instance.

Predefined alerts

SF-Operator defines some metrics-related alert rules for the deployed services.

Note

The alert rules are defined for Prometheus. Handling these alerts (typically sending out notifications) requires another service called AlertManager. How to manage AlertManager is out of scope for this documentation. You may need to configure or install an AlertManager instance on your cluster, or configure Prometheus to forward alerts to an external AlertManager instance.

The following alerting rules are created automatically at deployment time:

Alert name Severity Description
OutOfDiskNow critical The Log server has less than 10% free storage space left
OutOfDiskInThreeDays warning Assuming a linear trend, the Log server's storage space will fill up in less than three days
ConfigUpdateFailureInPostPipeline critical A config-update job failed in the post pipeline, meaning a configuration change was not applied properly to the Software Factory deployment's services
NotEnoughExecutors warning Lack of resources is throttling performance in the last hour; in that case some jobs are waiting for an available executor to run on
NotEnoughMergers warning Lack of resources is throttling performance in the last hour; in that case some merge jobs are waiting for an available merger to run on
NotEnoughTestNodes warning Lack of resources is throttling performance in the last hour; in that case Nodepool could not fulfill node requests
DIBImageBuildFailure warning the disk-image-builder service (DIB) failed to build an image
HighFailedStateRate critical Triggers when more than 5% of nodes on a provider are in failed state over a period of one hour
HighNodeLaunchErrorRate critical Triggers when more than 5% of node launch events end in an error state over a period of one hour
HighOpenStackAPIError5xxRate critical Triggers when more than 5% of API calls on OpenStack return a status code of 5xx (server-side error) over a period of 15 minutes

If statsd metrics prefixes are set for clouds defined in Nodepool's clouds.yaml, SF-Operator will also create the following alert for each cloud with a set prefix:

Alert name Severity Description
HighOpenStackAPIError5xxRate_<CLOUD NAME> critical Triggers when more than 5% of API calls on cloud return a status code of 5xx (server-side error) over a period of 15 minutes

Note that these alerts are generic and might not be relevant to your deployment's specificities. For instance, it may be normal to hit the NotEnoughTestNodes alert if resource quotas are in place on your Nodepool providers.

You are encouraged to create your own alerts, using these ones as a base.