Debian Conference 2013 - Munin

In doubt, just graph it !

Steve Schnepp

Munin Project Lead

Agenda

A Brief History
Design principles
New features in 2.0

CGI
SSH Transport
Async

Scalability (master/nodes/data)
Limitations of 2.0
Roadmap of 2.2

A Brief History

2002 - Born as LRRD

2004 - Renamed as Munin

2007 - Hacked zooming for 1.2

2009 - 1.4 came out

2010-2011 - Slowly took over leadership
2012 - Released 2.0

2.0 for its 10 years !
In wheezy since Sept 2012

2013 - Released 2.1

2.1 is unstable
Means internals will change with minor versions
Oct 2013 is the target for 2.2

Design Principles

"Simple things should be simple, complex things should be possible." -- Alan Kay

Very easy to use

Sane out-of-box behaviour
Complete plug-and-play

Our users: mostly the 1 server+node type...
... but some are running bigger installs

These are the growing market

New features in 2.0

Full CGI implementation

FastCGI
RRDCacheD

Native SSH transport

Avoids opening new ports
Secure, usually even more integrated than TLS

Async proxy

Loose connections
Speeds up polls
Various update rates

Scalability

Scaling the master

Handling more munin-nodes

Scaling the nodes

Handling a huge number of plugins
Handling slow plugins

Scaling the data

Keep more RRD data
Increase RRD precision (sub 5 min)

Scaling the master (1/2)

Use FastCGI
Use RRDcacheD

to escape the I/O hell
... even on SSD !
never read from the RRD files in cron

Have RAM. Lots of it.

RRDcacheD can make use of big buffers
Multiply the number of workers...
... but do not swap. Just limit the workers

Scaling the master (2/2)

Beware of shared hardware

Munin loves to annihilate any hardware
It is designed to be highly scalable...
... but not in a very efficient manner

Use the async proxy

Enables a very fast collection
Lowers the number of update workers needed
Avoids data loss when munin-update is too slow

Scaling the node

Handling a huge number of plugins

async proxy has the --fork option. It enables to fetch all

Handling slow plugins

The plugins can poll themselves ...
... or just use the --fork option :)

Scaling the data

Keep more data in the RRD

Configured via custom graph_data_size (on RRD create)
Handled automatically by RRD
Very fast, but can use quite a lot of space.

Increase RRD precision (sub 5 min)

Called supersampling, it's the plugin that polls itself, and sends the whole data back each poll
The async proxy can also be used for that, it should just work out-of-box by just setting a different update_rate
Always use the default RRAs precision, to have 1 px in the default graphs that maps to 1 RRA step

Limitations of 2.0

CGI of HTML is still very ugly

Usage of a Storable is very slow on reload

The UI itself doesn't really scale

The node "namespace" is essentially "flat"
The UI is very static, not what one expects in 2013
Comparison pages are useless on large installs
It also lack proper ACL. No filter either.

Roadmap of 2.2

Move from Storable to SQL

DBI-based : SQLite by default, PostgreSQL possible
Enables dynamic HTML UI, and ACL
... will require a deep rewrite of core code
RRD will stay as RRD. Only meta-data is concerned

Full async-aware updates

No more 5 min mandatory polls, but it will still be default.
Nodes can push directly their data to the master
Real time monitoring !
Collectd, beware ! We are coming your way.

New, full HTML5 UI

Grouping of nodes. Custom and auto-hinted
Node & graphs aliasing

Feels like a Xmas list. Let's make it happen.

Thanks & Questions

Thanks
Questions