This is a quick knowledge sharing post before it gets out of my head :)

I’m sure many (if not most) of you use Alertmanager as the go-to alerting system with Prometheus

I really like the simplicity of Alertmanager’s configuration file and how nicely you can plug it into your configuration generation.

The deployment pattern, however, is always a confusion for new adopters. I am going to try to solve some of that confusion in this post.

One Alertmanager to rule them all

You start with one Prometheus and a corresponding Alertmanager. The alerting flow looks like this:

Prometheus -> Alertmanager -> Slack

This is where life is simple.

Next, you add some HA to your Prometheus instance. No problems here either

Prometheus 1 and 2 -> Alertmanager -> Slack

Oh, but you indeed also need HA for your alerting system:

Prometheus 1 and 2 -> Alertmanager 1,2,3 -> Slack

At this stage, you have introduced a new alertmanager functionality, which is Gossip. Some of the interesting defaults for this protocol can be found here: https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L98-L106

It is important to note that for this trick to work, each Prometheus individually must fire its alerts to all Alertmanagers

Gossip Protocol Deep Dive

The Gossip protocol used by Alertmanager is based on the SWIM protocol (Scalable Weakly-consistent Infection-style Process Group Membership Protocol). Key features include:

  • Failure detection through periodic ping/ack messages
  • State synchronization through gossip messages
  • Configurable parameters for tuning performance vs. consistency

Advanced Deployment Patterns

Multi-Datacenter Setup

For organizations with multiple datacenters, Alertmanager can be deployed in a way that:

  • Maintains local alerting within each DC
  • Forwards critical alerts to a global Alertmanager cluster
  • Implements cross-DC redundancy

Federation Setup

In large organizations with multiple Prometheus instances, Alertmanager can be deployed in a federated manner:

  • Each team maintains their own Alertmanager instance
  • A central Alertmanager handles organization-wide alerts
  • Implements hierarchical alert routing

Best Practices

  1. Dead Man’s Snitch: Implement a dead man’s switch using a constantly firing alert to ensure your alerting system is working. If the alert stops firing, it indicates a problem with your monitoring system
  2. Monitoring Alertmanager: Always monitor your Alertmanager instances using Prometheus itself
  3. Configuration Management: Use version control and CI/CD for Alertmanager configurations
  4. Alert Deduplication: Configure proper grouping and inhibition rules to reduce alert noise
  5. Capacity Planning: Monitor alert volume and scale Alertmanager accordingly
  6. Disaster Recovery: Implement backup and restore procedures for Alertmanager state

Troubleshooting Common Issues

  • Silent Alerts: Check Alertmanager logs and Prometheus alert rules
  • Duplicate Alerts: Verify Gossip protocol configuration and network connectivity
  • Delayed Alerts: Monitor Alertmanager processing latency and scale as needed
  • Configuration Errors: Use amtool to validate configuration files

Conclusion

Proper Alertmanager deployment is crucial for maintaining reliable alerting in your monitoring stack. By understanding these patterns and best practices, you can build a robust alerting system that scales with your organization’s needs.

Remember that the right deployment pattern depends on your specific requirements and infrastructure. Start simple and evolve your architecture as your needs grow.