Large scale performance graphing
Distributed Cacti
(It’s about collecting performance data by 1+N poller servers on network)
Standard open source download
Cacti is an open-source, web-based network monitoring and graphing tool designed as a front-end application for the open-source, industry-standard data logging tool RRDtool. Cacti allows a user to poll services at predetermined intervals and graph the resulting data. It is generally used to graph time-series data of metrics such as CPU load and network bandwidth utilization.
The front end can handle multiple users, each with their own graph sets, so it is sometimes used by web hosting providers (especially dedicated server, virtual private server, and collocation providers) to display bandwidth statistics for their customers. It can be used to configure the data collection itself, allowing certain setups to be monitored without any manual configuration of RRDtool. Cacti can be extended to monitor any source via shell scripts and executables.
Architecture
Standard installation uses a mySQL database on the server and provides a web front-end using php scripts. The graph data is saved and retrieved by RRDtool.
Internals
Configuring and creating a graph
1. Create a data query template to get data either by SNMP OIDs or by running a script on the server. This can also know as data source templates.
Methods:
- scripts called Data Input Method in Cacti,
- script server scripts providing better performance,
- SNMP data queries to query SNMP tables like e.g. interfaces and CPU
- script data queries to query non-SNMP table structures
2. Create a data template which is used to define the structure for storing data
This Template will be applied to specific hosts to create real RRD files.
3. Create a graph template which is used to create a “raw rrdtool graph statement”
This Template will be applied to specific hosts to create real RRD graphs. Each graph template has items which is a complex one. Each item will create parts of an RRDTool graph statement. Typically, this will include
- the (reference to the) DEF needed,
- LINEx/AREA/STACK along with a color for graph elements or
- GPRINTs for legends,
- (reference to a ) CDEF ( this is the math) and textual elements
4. Create a device template which is collection of graph template and/or data source template in case the source results in multiple graph items (like storage volume information) which is supposed to apply to a device.
The graph is result of a real RRDTool graph statement, created when applying a Data Template to a Device. The simple form of applying it is by using a device template or by manually selecting graph template and/or data template for a device.
Possible operational issue in a large enterprise environment
- Entire process is very complex. Someone needs technical knowledge to finally create a device template (like a template that creates a collection of graph for a Solaris or Linux device).
- Even user select a device template (that already created by some guru) the graph will not generated unless the user select the list of graph that need to be created and apply. For big system like a storage with pile of volumes or a switch with ton of interfaces that need some patience.
- Each device need to add manually and in a busy schedule someone may not have enough time to add multiple devices one-by-one. It would be easier if we just scan a network and device will be added automatically.
- Finally the bigger the list of device and related services the more data need to collect. This become a bottleneck as we need graph resolution with 5 (five) minutes interval. That result a polling process overrun in a big enterprise environment.
- Each device, once added, need to add to required branch of a graph tree for grouping and make the viewer’s life easier to find it.
Feature to consider
- A system that scans and detect devices on the network when we supply a subnet.
- A system that apply device template and also create related graph automatically.
- A system that even put a device on a particular branch of a tree as per rule.
- A system that scalable and grow as our devices on the network grow.
- A system that is scalable in such a way that it can be expanded, to guaranty a polling/collection cycle ( for all devices ) of less than 5 (five) minutes, without any overlapping of polling and/or polling of same device by multiple poller server.
- A system that provide summery of various utilization (like CPU, Memory etc.), of all kind of devices, like an on-line inventory of performance.
- A system that can be distributed to multiple small virtual machines.
New Design (Large Scale Distributed Cacti)
How? (the concept, so and so)
#1 Master Poller: The main problem doing distributed polling is, the poller not only just collect data but also do other things before and after the polling cycle. One practical example are the tasks demand by other plugins using api_plugin_hook. Some needs to be done before and some needs to be done after the polling cycle.
So, if all servers shared same code (and it must be) then every server will do same task and almost same time (obviously every good setup has NTP and time sync etc.).
That makes us to define one poller as master poller. We choose poller server id “2” as master. That means any server that has an id 2 in the database/table will become master poller.
#2 One time RRD update: Next, each of the poller will run spine (compiled binary instead of a script) to collect data quickly. We keep those responses to a MEMORY table so poller can do next item quickly. But, if each poller start updating the RRD file on the NFS share after every polling cycle, it will be killing time on managing several thousand file handles (you know each access to file system will slow down things a bit) and will difficult to control and debug.
So, we use boost and, as api_plugin_hook execute on master poller only. That means all update from poller cache will flushed to RRD files centrally, only on master server, after each polling process.
#3 Managing time: We only have 5 (FIVE) minutes to do all the tasks. So, to give Master Poller server some of the time to execute pre and post polling tasks, we consider that actual data collection must not exceed 4 minutes.
So, we consider that master poller will clean up old poller cache and wait few seconds to let other start (you know NTP may not be that accurate some time) polling.
Master poller will start post polling hook only after 3 and ½ minutes of polling and not before that. If it finishes polling/collecting data before that time then it must wait. That also indicate, we must note, we have to distribute hosts such a way that every poller must complete polling within 4 minutes. So, after 4 minutes, BOOST will flushed all data to RRD and any data collected after that will be ignored and cleaned up before next polling cycle.
#4 Who polles what: Next job is to distribute device/hosts among pollers. This is required so that data belongs to a particular device must not be polled in more than one places. Duplicate data in poller cache, more traffic to device etc.. too much of complications.
I found that someone already worked on a “Multipoller Server” plugins but that is incomplete. I took that code and use the device/host distribution functionality after some modification. Not a bad thing to start, it is now providing basic functionality.
What changed technically?
Install the following as per standard documented process:
- Apache/php and mySql on a Linux system with time server/NTP Client.
- Cacti Version: 0.8.8a, Plugin Architecture Version: 3.1 configure with Spine on every 5 minutes.
- Settings 7
- Boost 1 and configure Poller to Enable direct population of poller_output_boost table by spine and Enable Boost Server.
- Hmib 1.4
- Autom 35
- Aggregate 0.75
- Discovery 1.5
- Multipollerserver 0.2.2a
- Create a NFS share from a storage system and mount it as RRD path.
- Create another NFS share from a storage system and mount it as cacti log path.
Once everything is ready, clone it to a VM and create a template to fulfil future requirement of poller server. Remember to configure the database credential of the clone to connect to the mySql database of primary server (configure database connection with FQDN and not as localhost) and also remember to configure same NFS mount for RRD path and cacti log path. That said, all rrd files and cacti log from different servers should go to same place. Then shutdown and disable mySql service/daemon and apache-httpd server service/daemon on cloned image before you create a template for poller server.
Now, if your primary server is cacti-poller01 and you created two VMs to work as poller server named cacti-poller02 and cacti-poller03, we need to change some code to get it working as follows. Please use DNS to register all server and ping FQDN from each.
The code will be same across all servers. Only difference is, poller only run a scheduled/cron job and no other related service/daemon.
You can modify files in one server then distribute/copy it to others. It is better you disable all related service in master server during edit.
File: poller.php:
Rule #1 Poller Server name is just host name and NOT FQDN:
/* Check the pollerserver */
$server_remote_name = gethostname();
$server_remote_name = split_on_dot($server_remote_name);
Rule #2 Master poller, the one with id=2, do other tasks for plugins:
// Master poller only do routine for pre-polling
if ($poller_server_id["id"] == 2) {
// Clean up poller cache
$issues = db_fetch_assoc ("SELECT DISTINCT poller_output.local_data_id, poller_output.rrd_name,poller_item.poller_id FROM poller_output
INNER JOIN poller_item ON poller_item.local_data_id = poller_output.local_data_id
WHERE poller_item.poller_id = ".$poller_server_id['id']."");
$count = db_fetch_cell("SELECT COUNT(*) FROM poller_output");
if (sizeof($issues)) {
cacti_log("INFO: Pollerserver: ".gethostname()." Poller Output Table Clean Up. Issues Found: $count", TRUE, "POLLER");
db_execute("TRUNCATE TABLE poller_output");
}
// Here we do the other plugins task and wait for other poller to catch up.
api_plugin_hook('poller_top'); sleep(2);
} else {
sleep(1); //delay a sec in case other poller has a bit advance clock.
} // end routine for master poller
Rule #3 Master poller wait, till time exceed 3 ½ minutes of polling, before start processing other (post polling) tasks:
// for master poller
if ($poller_server_id['id'] == 2) {
// give some time to other poller to complete
$time_remain = 300 - $loop_time;
$time_to_wait = $time_remain - 90;
if ( $time_remain > 90 ) { cacti_log("INFO: Pollerserver: ".$server_remote_name." Waiting on Master for " . $time_to_wait . " seconds", TRUE, "POLLER"); sleep($time_to_wait); }
// process other commands
api_plugin_hook('poller_bottom');
} // if poller master
Understanding the log
The comprehensive log of all pollers can be viewed from System Utilities ->View Cacti Log File
The event at the top should be the last one occurred. Every poller logged the number of hosts and data source it probed in how many seconds (it must not show any RRDsProcessed as that was separated and given to BOOST for single RRDUpdates at the end of polling).
Only the master poller add three more informations. First, it logged how much time it will wait to pass the 3 ½ minutes. Second, it will log how many RRD files it updated at the end of polling using BOOST. And third is, how many rows of Issues it cleaned up from poller cache (this should be always at least 1 less than BOOST update, as summary data is not polled but generated). So, you can have an idea that BOOST is writing RRDs only after all poller finished polling, if you see a poller finishing polling after BOOST update then, reduce number of hosts on that faulting poller by moving some to other less loaded poller or add more poller servers and move some hosts to new poller. Remember, the load we are measuring here is by polling time reported in the log.
Example of log:
Always watch for WARNING: in the log. The Host ID normally linked to the actual host and you can go to Device definition using that link or the host name reported. The error on the polling must be rectified or faulting host must be removed to reduce wastage of polling time.
Example of Warning:
Weekend jobs:
Job#1 Scan given subnets (Crontab entry: 0 0 * * 6 cacti /usr/bin/php /var/www/cacti/plugins/discovery/findhosts.php)
File: plugins/discovery/findhosts.php
Host description need hostname (added):
// set description as hostname if sysName not blank.
$resolved_hostname = gethostbyaddr($ip);
if ($snmp_sysName[0] != ''){
$description = strtolower( $snmp_sysName[0] . '.' . $snmp_sysName[1] . '.' . $snmp_sysName[2] . '.' . $snmp_sysName[3] );
} else {
$description = strtolower( $device['hostname'] );
}
if( $resolved_hostname != ""){$description = $resolved_hostname;}
// done setting hostname as description
Adding subnet to scan (restricted to weekends only and run from cron):
Go to Settings menu and Misc tab.
Browse down to Discover plugin section and add required IP/Subnet range.
Things to do after weekend scan (Monday):
#1. Assign a poller to discovered hosts
1.a Go to Multi-Pollerserver menu from Console.
Note the number of hosts already assign to each poller to estimate the load then select ‘Choose Devices..’
1.b Sort by pollerserver column (just click on it) to get all hosts with NONE poller or you can search/filter if you know the names.
1.c Select required Device/Host using the check-box at right of each row.
1.d Once selected, go to bottom right of the page and select required poller to assign and click ‘Go’
Repeat this process after going Next page until all required hosts are assigned to a poller.
#2. It is recommended that you remove the subnet from scanning (from Misc - ‘Discover’ settings) once you have the required hosts added.
hi Sanjay, I am using cacti server 0.8.8h. I configured cacti multi poller plugin (0.8.8h) on cacti server.after configuring I added remote poller, that poller added successfully but does not work only a local system poller work. please help me out
Thank you for the comprehensive post Sanjay. I have worked with Cacti for a number of years but I have not tried the Distributed polling yet. Your post has pointed me to the right direction.