When ramping new hires up, it’s very tempting to quickly throw them into the fire, fix bugs, start building features, etc. After they’ve completed their orientation and filled out their paperwork, what better way for them to learn the system?
It’s critically important that your engineers know how the business operates, who the customers are, their needs, and how your product fills that need.
The company I currently work for provides a SaaS offering that is VERY workflow intensive. We have 20+ roles in the system with around 5 major different personas, across 3 different applications. I made the mistake in the first paragraph and am now regretting it. We were under high growth at the time, hiring as fast as we could, and our backlog was growing.
Now, these engineers have been on board for several months and know nothing about the product. When building new features, they don’t have the customer in mind.
Bottom line, when onboarding new employees focus on the product and end users first, THEN have them learn the code. This may take a week or more, depending on your product, but it will pay dividends down the road.
There’s no worse feeling than when your production system goes down. The business relies on your system’s availability. Something happened, a bug, bad code push, a customer inserted crazy data, or whatever.
Now everyone is looking at you to fix it. You are completely dependent upon your team, operations and engineering to come together, diagnose, address root cause, and deploy a fix ASAP.
Your ass is on the line and you are pretty much helpless.
What can you do to help?
Here are my tips:
- Make sure you have the right people on the scene. Have at least 1 engineer and ops person on the issue together. Open a dedicated skype room or google hangout where information can flow freely.
- Quickly assess the severity of the service degradation.
- Notify your management chain, product team, and various other relevant internal stakeholders ASAP. Be honest.
- Provide cover for the team diagnosing the issue. Limit distractions.
- Get out of the way. Your job is to ensure the right people are on the issue, and the org is up to date on the status.
- Once the issue is identified and a patch is deployed, communicate out to the org what happened.
- Afterwards, gather the team together and hold a quick post mortem to find out what went wrong. Some key questions:
- What services were affected?
- What actually happened?
- What is the root cause?
- How can this be prevented in the future? Is additional logging, instrumentation needed to diagnose the issue more quickly in the future?
- Thanks the team for their teamwork, and quick resolve.
- Send out a service incident report to the company that is transparent. Describe the information gathered from the post mortem and explain it in simple terms. Remember, the rest of the company wants to know that you have things under control, and you are taking the necessary steps to ensure it won’t happen again. Most people understand that things go wrong and people make mistakes.
What other steps do you take?