There are a million tools out there for monitoring cloud applications. The following is a list of the tools my current company has selected to monitor all levels of our stack. Overall we are very happy with this suite. Please post in the comments if you can recommend others!
Machine Level
Nagios – Nagios is used to monitor your IT infrastructure. This tool will generate alerts (via email, text, etc.) when your CPU, threads, disk space, etc. exceeds a given threshold. The great thing about nagios is that it’s completely free!
Logging
Papertrail – Papertrail is a log aggregator and search tool. Like most applications, our system runs across a wide variety of EC2 instances on AWS, across many different deployment environments. Papertrail is great because it aggregates all your logs together in one place. No need to figure out which machine your user is on, ssh into it, etc. Just use the web search on papertrail and then easily drill down into your logs. Well worth the money!
Exceptions
Sentry – Sentry is a great tool that will parse your system logs and aggregate exception errors. It will create a nice dashboard of all of your system exceptions, group similar ones together. It allows you to track who is looking into the exceptions, assign them to developers, and even export them to JIRA. Overall it’s just a great way to give visibility to the system exceptions without having to go through your logs manually.
Performance
New Relic – New Relic is a beast. They have been rolling out tool after tool. The one we use most is APM (Application Performance Monitoring), which gives you alerting on system errors and performance bottlenecks. It provides a great way to drill down into your most time consuming transactions and help guide refactoring. They provide a host of other tools but we don’t use them that often.
API
Runscope – Runscope is frekkin awesome! It provides an easy way to create tests that validate REST API’s. We’ve found this tool to be key for all integration points across our organization. If a team provides a service that others can take a dependency on, we ensure that a runscope test is in place. If the API is ever broken, everyone will know immediately. The tests can be run globally, and there is also some performance monitoring included as well.
I know I’m probably missing a bunch of tools, but these are the ones we currently use, and they seem to cover all the bases. Also, check out my other article “When your production system goes down”, for my personal strategy for handling prod issues, when the alarms do start ringing!
Please comment and tell me what tools you like to use?