I have over 100 instances of web servers with php usingapc, and we sometimes (about once a week throughout the fleet) see that damage occurs in one of the caches, which leads to an error reporting message.
As soon as this happens, the application will be dead on this node, any transactions directed to it will fail.
I wrote a simple wrapper around tail -Fthat can detect a pattern anytime it appears in a log file, and evaluate the shell command (using bash eval ) to respond. I have this with the commandsalt-call salt stackto start processing a custom module that disables nginx , heats (updates) the cache and, of course, restarts the web server. (Actually, I have two forms of this shell, bashand Python).
This is good, and the frequency of events is such that it is unlikely to be a problem. However, my boss, quite reasonably, is concerned about the general mode rejection pattern ... that a regular expression may appear in too many of these magazines at once and occupy the city of the entire site.
My first thought was to wrap mine salt-callinredis(we already have the Redis infrastructure used for caching and some other data structures). This will be implemented as an expiration integer. The check will call INCR, check the result, and sleep if more than N is returned (or if the Redis server is unavailable). If the result was below the threshold, then it will be sent salt-call, and after the server is restored and started, the decrement will be called. (The expiration of the Redis key will kill any outdated increments, possibly in a day or even several hours ... our alert system has already informed us about servers with downstream servers, and our response time is more than enough for such a time frame).
Saltstack , . (: redis-cli Python Redis, , , salt-call ). - . ( Redis PHP script).
HOWTO Saltstack? , , - . , ( , , , .., ).