10 Questions to Ask Before You Release To Production

Below are the top 10 questions I try to ensure I have answered before we release a new feature into production.

New Feature Release Checklisticon_deployment

    1. What is the release strategy?  Here are 3 options:
      • Big Bang – This is the simplest form of release, where you roll out the feature to your entire user base.  This can either be done with a Feature Gate (or switch), or via the deployment itself.
      • Incremental – This is where you initially roll out the feature to a small subset of users.  You may choose to roll it out to a specific region, user role, or other demographic.  Once you have some confidence after the initial rollout, you increase the user base (at whatever rate you desire) until it eventually gets to 100%.
      • Parallel – This one isn’t always an option, but it’s great to use when possible.  Let’s say you are making some performance improvements to a complex DB query.  Rather than using the big bang and incremental approaches described above, or implementing some crazy regression testing strategy, you can use the parallel live technique.  With this approach, you run the query simultaneously through the legacy and new code paths.  If the results are the same you just return them.  If they are different you return the legacy result to the user and log the difference for review later.  This approach allows you to test out complex refactoring’s with no risk to your users.
    2. What about feature gating (a.k.a. feature flags or dark launch)?  How will the feature be enabled in production?  Can it be turned on for additional users without a redeploy?  Check out this Forbes article HERE for some more info on how Google and Facebook leverage feature gating.  Another interesting article from HBR on feature gating HERE.
    3. What is the rollback plan?  What happens if this feature causes issues in production?  Can it be disabled with a feature switch?  Or is another deployment necessary?  If so, what’s the estimated downtime?  How will we know if there is an issue?  Will an alarm go off or do we need to manually test?
    4. Was the code peer reviewed?  This one might sound like a no-brainer to some, but you’d be surprised how many companies still don’t perform peer reviews.  If you aren’t using github which has reviews built right into their pull request mechanism, there are still countless other tools out there that make peer reviews painless.  One I’ve seen used successfully in the past is Code Collaborator.  Also, it’s a good idea to rotate the reviewers or spot check them, as some developers will treat them with a rubber stamp.
    5. Did you write any unit tests?  If you are writing a new feature, you should add in unit tests where possible.  If you are making tweaks or refactoring an existing feature it would be great to add in some tests now while you have spent the time to grok this section of the code base.
    6. Was any load testing performed?  Besides standalone load tests (e.g. JMeter, and many others), another great technique to use is HTTP replay, where you replay your production traffic through a staging environment.  We’ve seen countless times where our canned JMeter tests didn’t catch all the corner cases that a production user’s data could.
    7. Did QA approve the feature?  Ideally the answer to this question is automated by using some project tracking tool like JIRA.
    8. Did the PO (product owner) and UI/UX review the implementation?  I can’t tell you how many times I’ve seen a seemingly simple feature get misinterpreted between the various stakeholders, including product owner, UI/UX, customer, and engineer.  It’s a great idea to just pull this folks over to your desk for a quick review of what is being rolled out.
    9. What about DevOps?  Is any new infrastructure necessary for the release?  Are there any release timing dependencies?  What about config settings?
    10. Do we need to communicate release notes to customers?  In many companies communication with the customer is handled by product mgmt.  However, if you’re in a startup you may need to handle this one yourself.

 

That’s it!  I try to make sure I have a good idea of the answers to these questions for all new features that get rolled out.  Unfortunately sometimes things slip through the cracks.  However, I’ve noticed that most issues that arise downstream could have been avoided by first getting answers to the questions above before you ship.

EDIT:  I recently created a Production Readiness Checklist that condenses this blog post into a single, good looking sheet of paper that you can pass out to your team or hang on the wall in the office.  You can get a copy of this checklist by signing up for our mailing list on the right!

 

 

Monitoring Tools for Cloud Applications

There are a million tools out there for monitoring cloud applications.  The following is a list of the tools my current company has selected to monitor all levels of our stack.  Overall we are very happy with this suite.  Please post in the comments if you can recommend others!

Machine Level

Nagios – Nagios is used to monitor your IT infrastructure.  This tool will generate alerts (via email, text, etc.) when your CPU, threads, disk space, etc. exceeds a given threshold.  The great thing about nagios is that it’s completely free!

Logging

Papertrail – Papertrail is a log aggregator and search tool.  Like most applications, our system runs across a wide variety of EC2 instances on AWS, across many different deployment environments.  Papertrail is great because it aggregates all your logs together in one place.  No need to figure out which machine your user is on, ssh into it, etc.  Just use the web search on papertrail and then easily drill down into your logs.  Well worth the money!

Exceptions

Sentry – Sentry is a great tool that will parse your system logs and aggregate exception errors.  It will create a nice dashboard of all of your system exceptions, group similar ones together.  It allows you to track who is looking into the exceptions, assign them to developers, and even export them to JIRA.  Overall it’s just a great way to give visibility to the system exceptions without having to go through your logs manually.




Performance

New Relic – New Relic is a beast.  They have been rolling out tool after tool.  The one we use most is APM (Application Performance Monitoring), which gives you alerting on system errors and performance bottlenecks.  It provides a great way to drill down into your most time consuming transactions and help guide refactoring.  They provide a host of other tools but we don’t use them that often.

API

Runscope – Runscope is frekkin awesome!  It provides an easy way to create tests that validate REST API’s.  We’ve found this tool to be key for all integration points across our organization.  If a team provides a service that others can take a dependency on, we ensure that a runscope test is in place.  If the API is ever broken, everyone will know immediately.  The tests can be run globally, and there is also some performance monitoring included as well.

 

I know I’m probably missing a bunch of tools, but these are the ones we currently use, and they seem to cover all the bases.  Also, check out my other article “When your production system goes down”, for my personal strategy for handling prod issues, when the alarms do start ringing!

Please comment and tell me what tools you like to use?

 

Firedrill! What to do when your production system goes down

There’s no worse feeling than when your production system goes down.  The business relies on your system’s availability.  Something happened, a bug, bad code push, a customer inserted crazy data, or whatever.

Now everyone is looking at you to fix it.  You are completely dependent upon your team, operations and engineering to come together, diagnose, address root cause, and deploy a fix ASAP.

Your ass is on the line and you are pretty much helpless.

What can you do to help?

Here are my tips:

  • Make sure you have the right people on the scene.  Have at least 1 engineer and ops person on the issue together.  Open a dedicated skype room or google hangout where information can flow freely.
  • Quickly assess the severity of the service degradation.
  • Notify your management chain, product team, and various other relevant internal stakeholders ASAP.  Be honest.
  • Provide cover for the team diagnosing the issue.  Limit distractions.
  • Get out of the way.  Your job is to ensure the right people are on the issue, and the org is up to date on the status.
  • Once the issue is identified and a patch is deployed, communicate out to the org what happened.
  • Afterwards, gather the team together and hold a quick post mortem to find out what went wrong.  Some key questions:
    • What services were affected?
    • What actually happened?
    • What is the root cause?
    • How can this be prevented in the future?  Is additional logging, instrumentation needed to diagnose the issue more quickly in the future?
  • Thanks the team for their teamwork, and quick resolve.
  • Send out a service incident report to the company that is transparent.  Describe the information gathered from the post mortem and explain it in simple terms.  Remember, the rest of the company wants to know that you have things under control, and you are taking the necessary steps to ensure it won’t happen again.  Most people understand that things go wrong and people make mistakes.

What other steps do you take?