Considerations

The teams that develop and support our products have a limited capacity to do work. We seek to maintain a balance between Rate of Delivery and Quality of Service within our capacity constraints. By automating key aspects of the work of teams we can use our limited capacity more effectively.

Building instrumentation into the Solutions that comprise our Products allows teams to automate the monitoring and alerting for the product. Automatic alerts give early warning of nascent, potential problems in the operation of the Product. The teams can act immediately to correct the problem. In many cases they will be able to resolve the problem before any operational impact has been noticed.

The Product Teams spend less time manually monitoring the system and responding to operational impacts. Other teams, including first-line support teams have reduced call loads because the problem is resolved before it becomes noticeable. Quality of Service is improved directly through increased availability. Transparency and the need to potentially take corrective action means that we should still count the problem as an incident for the system. Rate of Delivery can be sustained at a higher level because the team spends less time on operational support.

Levels


Green

Effective Automated Alerting

Instrumentation has been designed into the Product and its Solutions. The team identifies critical alerts and their characteristics. Events are filtered to allow the team to quickly identify and understand the critical alerts when they occur.

The team routinely monitors its critical alerts. Where false positives are identified these are used to refine event filtering. Where critical alerts are missed filtering is changed to expose new types of events.

The team responds to critical alerts so that operational impact can be avoided or substantially reduced. The team seeks route causes of alerts and seeks to remove them so that the alerts cannot recur.


Amber

Limited Automated Alerting

Some instrumentation has been designed into the Product and its Solutions. The team identifies some common critical alerts and their characteristics. Events are partially filtered to allow the team to quickly identify and understand the identified common critical alerts when they occur.

The team routinely monitors its critical alerts. False positives are more likely to be identified because of high volumes of unfiltered alerts. Critical alerts are more likely to be missed because of inadequate filtering. Improving filtering is difficult because of the volume of alerts to be considered.

The team normally responds to identified critical alerts so that operational impact can be avoided or substantially reduced. The team inconsistently seeks to remove route causes of alerts so that the alerts cannot recur.


Red

No Automated Alerting

There is little or no reliable instrumentation of the Product and its Solutions. There is no consistent analysis of critical alerts. Events are not filtered making it hard to identify specific types of alert.

The overwhelming volume of alerts means the team is unable to monitor critical alerts. Critical alerts are missed with resulting operational impact. There is no data that can be used to help improve filtering of alerts.

The team does not respond to critical alerts until they have created an incident. The team inconsistently seeks to remove route causes of incidents so that the incidents cannot recur.