Monitoring¶
Here you will find information about what monitoring is available on services deployed with SF-Operator.
Concepts¶
SF-Operator use the prometheus-operator to expose and collect service metrics. SF-Operator will automatically create a PodMonitor for the following services:
Below is a table of available metrics (1) per service.
- Metrics exposed by Node Exporter can be used to monitor disk usage.
Service | Statsd metrics | Prometheus metrics |
---|---|---|
Git Server | ❌ | ✅ (node exporter only) |
Log Server | ❌ | ✅ (node exporter only) |
MariaDB | ❌ | ✅ (node exporter only) |
Nodepool | ✅ | ✅ |
ZooKeeper | ❌ | ✅ (node exporter only) |
Zuul | ✅ | ✅ |
The PodMonitor
is set with the label key sf-monitoring
(and a value equal to the monitored service name); that key can be used for filtering metrics.
You can list the PodMonitors this way:
For services that expose statsd metrics, a sidecar container running Statsd Exporter is added to the service pod, so that these metrics can be consumed by a Prometheus instance.
Accessing the metrics¶
If enabled in your cluster, metrics will automatically be collected by the cluster-wide Prometheus instance. Check with your cluster admin about getting access to your metrics.
If this feature isn't enabled in your cluster, you will need to deploy your own Prometheus instance to collect the metrics on your own. To do so, follow the prometheus-operator's documentation.
You will then need to set the proper PodMonitorSelector
in the Prometheus instance's manifest:
# assuming Prometheus is deployed in the same namespace as SF
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchExpressions:
- key: sf-monitoring
operator: Exists
Statsd¶
Statsd Exporter mappings¶
Statsd Exporter sidecars are preconfigured to map every statsd metric issued by Zuul (1) and Nodepool (2) into prometheus-compatible metrics.
- Zuul's statsd_mapping.yaml
- Nodepool's statsd_mapping.yaml
Forwarding¶
It is possible to use the relayAddress
property in a SoftwareFactory CRD to define a different statsd collector for Zuul and Nodepool, for example an external graphite instance.
Predefined alerts¶
SF-Operator defines some metrics-related alert rules for the deployed services.
Note
The alert rules are defined for Prometheus. Handling these alerts (typically sending out notifications) requires another service called AlertManager. How to manage AlertManager is out of scope for this documentation. You may need to configure or install an AlertManager instance on your cluster, or configure Prometheus to forward alerts to an external AlertManager instance.
The following alerting rules are created automatically at deployment time:
Alert name | Severity | Description |
---|---|---|
OutOfDiskNow |
critical | The Log server has less than 10% free storage space left |
OutOfDiskInThreeDays |
warning | Assuming a linear trend, the Log server's storage space will fill up in less than three days |
ConfigUpdateFailureInPostPipeline |
critical | A config-update job failed in the post pipeline, meaning a configuration change was not applied properly to the Software Factory deployment's services |
NotEnoughExecutors |
warning | Lack of resources is throttling performance in the last hour; in that case some jobs are waiting for an available executor to run on |
NotEnoughMergers |
warning | Lack of resources is throttling performance in the last hour; in that case some merge jobs are waiting for an available merger to run on |
NotEnoughTestNodes |
warning | Lack of resources is throttling performance in the last hour; in that case Nodepool could not fulfill node requests |
DIBImageBuildFailure |
warning | the disk-image-builder service (DIB) failed to build an image |
HighFailedStateRate |
critical | Triggers when more than 5% of nodes on a provider are in failed state over a period of one hour |
HighNodeLaunchErrorRate |
critical | Triggers when more than 5% of node launch events end in an error state over a period of one hour |
HighOpenStackAPIError5xxRate |
critical | Triggers when more than 5% of API calls on OpenStack return a status code of 5xx (server-side error) over a period of 15 minutes |
If statsd metrics prefixes are set for clouds defined in Nodepool's clouds.yaml
, SF-Operator will also create the following alert
for each cloud with a set prefix:
Alert name | Severity | Description |
---|---|---|
HighOpenStackAPIError5xxRate_<CLOUD NAME> |
critical | Triggers when more than 5% of API calls on cloud |
Note that these alerts are generic and might not be relevant to your deployment's specificities.
For instance, it may be normal to hit the NotEnoughTestNodes
alert if resource quotas are in place
on your Nodepool providers.
You are encouraged to create your own alerts, using these ones as a base.