I’ve been openly skeptical of many newer fads in workplace culture, namely unlimited vacation time (see my rant here), working remotely (no articles yet, but I just don’t think they work for many types of product teams), and open office floor plans.
It makes sense that as the market demand for top software talent increases, so will too the office perks, including chefs, beer, games, etc.. But, why is it that so much emphasis is placed on open environments, vs. giving engineers a dedicated space where they can have quiet and focus?
Continue reading “Open office floor plans are a bad idea”
I’m now on my second company that offers a policy of ‘unlimited vacation time’. As an outsider coming from a company with a rigid time off policy and time card system, this sounds very alluring. But, does it actually work?
Continue reading “Top 5 Reasons Why ‘Unlimited Vacation Time’ Policy is a Scam”
There’s no worse feeling than when your production system goes down. The business relies on your system’s availability. Something happened, a bug, bad code push, a customer inserted crazy data, or whatever.
Now everyone is looking at you to fix it. You are completely dependent upon your team, operations and engineering to come together, diagnose, address root cause, and deploy a fix ASAP.
Your ass is on the line and you are pretty much helpless.
What can you do to help?
Here are my tips:
- Make sure you have the right people on the scene. Have at least 1 engineer and ops person on the issue together. Open a dedicated skype room or google hangout where information can flow freely.
- Quickly assess the severity of the service degradation.
- Notify your management chain, product team, and various other relevant internal stakeholders ASAP. Be honest.
- Provide cover for the team diagnosing the issue. Limit distractions.
- Get out of the way. Your job is to ensure the right people are on the issue, and the org is up to date on the status.
- Once the issue is identified and a patch is deployed, communicate out to the org what happened.
- Afterwards, gather the team together and hold a quick post mortem to find out what went wrong. Some key questions:
- What services were affected?
- What actually happened?
- What is the root cause?
- How can this be prevented in the future? Is additional logging, instrumentation needed to diagnose the issue more quickly in the future?
- Thanks the team for their teamwork, and quick resolve.
- Send out a service incident report to the company that is transparent. Describe the information gathered from the post mortem and explain it in simple terms. Remember, the rest of the company wants to know that you have things under control, and you are taking the necessary steps to ensure it won’t happen again. Most people understand that things go wrong and people make mistakes.
What other steps do you take?