There’s no worse feeling than when your production system goes down. The business relies on your system’s availability. Something happened, a bug, bad code push, a customer inserted crazy data, or whatever.
Now everyone is looking at you to fix it. You are completely dependent upon your team, operations and engineering to come together, diagnose, address root cause, and deploy a fix ASAP.
Your ass is on the line and you are pretty much helpless.
What can you do to help?
Here are my tips:
- Make sure you have the right people on the scene. Have at least 1 engineer and ops person on the issue together. Open a dedicated skype room or google hangout where information can flow freely.
- Quickly assess the severity of the service degradation.
- Notify your management chain, product team, and various other relevant internal stakeholders ASAP. Be honest.
- Provide cover for the team diagnosing the issue. Limit distractions.
- Get out of the way. Your job is to ensure the right people are on the issue, and the org is up to date on the status.
- Once the issue is identified and a patch is deployed, communicate out to the org what happened.
- Afterwards, gather the team together and hold a quick post mortem to find out what went wrong. Some key questions:
- What services were affected?
- What actually happened?
- What is the root cause?
- How can this be prevented in the future? Is additional logging, instrumentation needed to diagnose the issue more quickly in the future?
- Thanks the team for their teamwork, and quick resolve.
- Send out a service incident report to the company that is transparent. Describe the information gathered from the post mortem and explain it in simple terms. Remember, the rest of the company wants to know that you have things under control, and you are taking the necessary steps to ensure it won’t happen again. Most people understand that things go wrong and people make mistakes.
What other steps do you take?