Rules for alerting and automating responses
- I’d like my ops team to get notified when my production cluster is overloaded.
- I’d like to restart service X when it stops responding.
- I’d like to get notified when specific files on my servers are accessed.
- Scope; on which machine or groups of machines does the rule apply.
- Condition; which metrics should be considered, how often and what’s the decision threshold.
- Action; what to do when the condition is met.
The scope of a rule could be either of:
- Every machine you monitor through Mist.io. This type applies to machines that exist when you set a rule, as well as new machines you might provision over time. Such rules are ideal for basic checks and save you the trouble of explicitly setting them each time you provision a new machine.
- A specific machine. Some machines are more special than others, e.g. they might run specific applications, have unique hardware configuration etc. Rules on specific machines help you deal with such cases on a machine-by-machine basis.
- Machines with a specific Mist.io tag. Usually machines are grouped in some kind of logical groups, e.g. machines that belong to the QA team, machines that are part of the production setup etc. Mist.io makes no assumptions here and lets you organize and manage such groups using tags. You can view more details on tags here. Once you have set your tags, you can take advantage of them in rules and gain fine grained control over who should receive alerts and what actions should be taken.
The condition of a rule has three key components:
- Metric; this is the monitoring metric which will be checked. For example this could be your load, CPU usage, disk I/O etc.
- Threshold; a simple operator and a value to check for, e.g. if load is over 2.
- Frequency; this configures how often should the threshold operation return true within a time window in order for the rule to trigger. For example you might want to avoid false positives when checking load by widening your check to be performed on the average load values within 2 minutes.
Assuming that rule is triggered then Mist.io could handle a variety of actions to take:
- Alert; In this case Mist.io will email an entire team in your Mist.io organization, specific members of your teams and/or an email outside your Mist.io organization. Specifically the last option is useful for sending alerts to pager addresses, e.g. like the ones from PagerDuty, VictorOps, etc, that can handle existing and more complicated pager policies.
- Reboot; This option will send a reboot call to the relevant machine. This option requires caution since the services running on these machines will not be available until the machine comes back online. For this reason the reboot action is recommended for systems that are either very highly available or not mission critical.
- Destroy; This option will kill the relevant machine. As above this should be treated with caution. It’s mostly relevant for cases where you’d like to be proactive with your spending by killing machines that are not utilized, e.g. someone in your development team forgot an XL instance running without actually using it.
- Run; This option will execute a script from Mist.io’s script section. For example you could be tracking memory usage of your web server and restart it gracefully whenever it goes over a certain threshold.
Rules come with their own RESTful API. You can check out the details of the relevant API calls at https://mist.io/swagger/#/rules.