EOC
- Engineer On CallIMOC
- Incident Manager On CallCMOC
- Communications Manager On Call#alerts
and #alerts-general
are an important source of information about the health of the environment and should be monitored during working hours.production
tracker. See production queue usage for more details.The Situation Room Permanent Zoom
. The Zoom link is in the #incident-management
topic.The Situation Room Permanent Zoom
as soon as possible.#production
. If the alert is flappy, create an issue and post a link in the thread. This issue might end up being a part of RCA or end up requiring a change in the alert rule./pd trigger
in the #production
channel@advocates
handle at the start of an incident.@sre-oncall
- at mention this usergroup in Slack and it will ping the current oncall.#production
Slack Channel will tell you this with /chatops run oncall prod
./incident report
in Slack (e.g #production
) and follow the prompts. Please ensure that the severity of the incident warrants paging in the EOC and that you, as the reporter, stay online until EOC has had a chance to come online and get up to speed./incident declare
in Slack (e.g #production
) and follow the prompts. The incident declaration is orchestrated through IMA (incident management automation) and has the following capabilities:Degraded
as any sustained 5 minute time period where a service is below its documented Apdex SLO or above it's documented error ratio SLO.Outage
(Status = Disruption) as a 5 minute sustained error rate above the Outage line on the error ratio graph#incident-management
room in Slack.#incident-management
channel for internal updatesNear misses are like a vaccine. They help the company better defend against more serious errors in the future, without harming anyone or anything in the process.
~Near Miss
label.