In the real estate world, the mantra is location, location, location. In the network and server administration world, the mantra is visibility, visibility, visibility. If you don't know what your network and servers are doing at every second of the day, you're flying blind. Sooner or later, you're going to meet with disaster.
Fortunately, many good tools, both commercial and open source, are available to shine much-needed light into your environment. Because good and free always beat good and costly, I've compiled a list of my favorite open source tools that prove their worth day in and day out in networks of any size. From network and server monitoring to trending, graphing, and even switch and router configuration backups, these utilities will see you through.
First, there was MRTG. Back in the heady 1990s, Tobi Oetiker saw fit to write a simple graphing tool built on a round-robin database scheme that was perfectly suited to displaying router throughput. MRTG begat RRDTool, which is the self-contained round-robin database and graphing solution in use in a staggering number of open source tools today. Cacti is the current standard-bearer of open source network graphing, and it takes the original goals of MRTG to whole new levels.
Cacti is a LAMP application that provides a complete graphing framework for data of nearly every sort. In some of my more advanced installations of Cacti, I collect data on everything from fluid return temperatures in data center cooling units to free space on filer volumes to FLEXlm license utilization. If a device or service returns numeric data, it can probably be integrated into Cacti. There are templates to monitor a wide variety of devices, from Linux and Windows servers to Cisco routers and switches -- basically anything that speaks SNMP. There are also collections of contributed templates for an even greater array of hardware and software.
While Cacti's default collection method is SNMP, local Perl or PHP scripts can be used as well. The framework deftly separates data collection and graphing into discrete instances, so it's easy to rework and reorganize existing data into different displays. In addition, you can easily select specific timeframes and sections of graphs simply by clicking and dragging. In some of my installations, I have data going back several years, which proves invaluable when determining if current behavior of a network device or server is truly anomalous or, in fact, occurs regularly.
Using the PHP Network Weathermap plug-in for Cacti, you can easily create live network maps showing link utilization between network devices, complete with graphs that appear when you hover over a depiction of a network link. In many places where I've implemented Cacti, these maps wind up running 24/7 on 42-inch LCD monitors mounted high on the wall, providing the IT staff with at-a-glance updates on network utilization and link status.
Cacti is an extensive performance graphing and trending tool that can be used to track nearly any monitored metric that can be plotted on a graph. It's also infinitely customizable, which means it can get complex in places.
Nagios is a mature network monitoring framework that's been in active development for many years. Written in C, it's almost everything that system and network administrators could ask for in a monitoring package. The Web GUI is fast and intuitive, and the back end is extremely robust.
As with Cacti, a very active community supports Nagios, and plug-ins exist for a massive array of hardware and software. From basic ping tests to integration with plug-ins like WebInject, you can constantly monitor the status of servers, services, network links, and basically anything that speaks IP. I use Nagios to monitor server disk space, RAM and CPU utilization, FLEXlm license utilization, server exhaust temperatures, and WAN and Internet link latency. It can be used to ensure that Web servers are not only answering HTTP queries, but that they're returning the expected pages and haven't been hijacked, for example.
Network and server monitoring is obviously incomplete without notifications. Nagios has a full email/SMS notification engine and an escalation layout that can be used to make intelligent decisions on who and when to notify, which can save plenty of sleep if used correctly. In addition, I've integrated Nagios notifications with Jabber, so the instant an exception is thrown, I get an IM from Nagios detailing the problem in addition to an SMS or email, depending on the escalation settings for that object. The Web GUI can be used to quickly suspend notifications or acknowledge problems when they occur, and it can even record notes entered by admins.
As if this wasn't enough, a mapping function displays all the monitored devices in a logical representation of their placement on the network, with color-coding to show problems as they occur.
The downside to Nagios is the configuration. The config is best done via command line and can present a significant learning curve for newbies, though folks who are comfortable with standard Linux/Unix config files will feel right at home. As with many tools, the capabilities of Nagios are immense, but the effort to take advantage of some of those capabilities is equally significant.
Don't let the complexity discourage you -- Nagios has saved my bacon more times than I can possibly recall. The benefits of the early-warning systems provided by this tool for so many different aspects of the network cannot be overstated. It's easily worth your time and effort.
Icinga started out as a fork of Nagios, but has recently been rewritten as Icinga 2. Both versions are under active development and available today, and Icinga 1.x is backward-compatible with Nagios plug-ins and configurations. Icinga 2 has been developed to be smaller and sleeker, and it offers distributed monitoring and multithreading frameworks that aren't present in Nagios or Icinga 1. You can migrate from Nagios to Icinga 1 and from Icinga 1 to Icinga 2.
Like Nagios, Icinga can be used to monitor anything that speaks IP, as deep as you can go with SNMP and custom plug-ins and add-ons.
There are several Web UIs for Icinga, and one major differentiator from Nagios is the configuration, which can be done via the Web UI rather than through configuration files. For those who'd rather manage their configurations outside of the command line, this is a significant benefit.
Icinga integrates with a variety of graphing and monitoring packages such as PNP4Nagios, inGraph, and Graphite, providing solid performance visualizations. Icinga also has extended reporting capabilities.
If you've ever had to search for a device on your network by telnetting into switches and doing MAC address lookups, or you simply wish you could tell where a certain device is physically located (or, perhaps more important, where it was located), then you should take a good look at NeDi.
NeDi is a LAMP application that regularly walks the MAC address and ARP tables on your network switches, cataloging every device it discovers in a local database. It's not as well-known as some other projects, but it can be a very handy tool in corporate networks where devices are moving around constantly.
You can log into the NeDi Web GUI and conduct searches to determine the switch, switch port, or wireless AP of any device by MAC address, IP address, or DNS name. NeDi collects as much information as possible from every network device it encounters, pulling serial numbers, firmware and software versions, current temps, module configurations, and so forth. You can even use NeDi to flag MAC addresses of devices that are missing or stolen. If they appear on the network again, NeDi will let you know.
Discovery runs from cron at set intervals. Configuration is straightforward, with a single config file that allows for a significant amount of customization, including the ability to skip devices based on regular expressions or network-border definitions. You can even include seed lists of devices to query if the network is separated by undiscoverable boundaries, as in the case of an MPLS network. NeDi usually uses Cisco Discovery Protocol or Link Layer Discovery Protocol, discovering new switches and routers as it rolls through the network, then connecting to them to collect their information. Once the initial configuration has been set, running a discovery is fairly quick.
NeDi integrates with Cacti to some degree, and if provided with the credentials to a functional Cacti installation, device discoveries will link to the associated Cacti graphs for that device.
The Ntop project -- now known as Ntopng, for "next generation" -- has come a long way over the past decade. Call it Ntop or Ntopng, what you get is a top-notch network traffic monitor married to a fast and simple Web GUI. It's written in C and completely self-contained. You run a single process configured to watch a specific network interface, and that's about all there is to it.
Ntop provides easily digestible graphs and tables showing current and past network traffic, including protocol, source, destination, and history of specific transactions, as well as the hosts on either end. You'll also find an impressive array of network utilization graphs, live maps, and trends, along with a plug-in framework for an array of add-ons such as NetFlow and sFlow monitors. There's even the Nbox, a hardware monitor that embeds Ntop.
Ntop even incorporates a lightweight Lua API framework that can be used to support extensions via scripting languages. Ntop can also store host data in RRD files for persistent data collection.
One of the handiest uses of Ntopng is on-the-spot traffic checkups. When one of my Cacti-driven PHP Weathermaps suddenly shows a collection of network links running in the red, I know that those links exceed 85 percent utilization, but I don't know why. By switching to an Ntopng process watching that network segment, I can pull a minute-by-minute report of the top talkers and immediately know which hosts are responsible and what traffic they're pushing.
That kind of visibility is invaluable, and it's very easy to come by. Essentially, you can run Ntopng on any interface that's been configured at the switch level to monitor another port or VLAN. That's it.
Zabbix is a full-scale network- and system-monitoring tool that combines several functions into a single Web-based console. It can be configured to monitor and collect data from a wide variety of servers and network gear, offering service and performance monitoring of each object.
Zabbix works with agents running on monitored systems, though it can also run agentless using SNMP or other monitoring methods such as remote checks on open services like SMTP and HTTP. It explicitly supports VMware and other virtualization hypervisors, producing in-depth data on hypervisor performance and activity. Special attention is also paid to monitoring Java application servers, Web services, and databases.
Hosts can be added manually or through an autodiscovery process. An extensive set of default templates apply to the most common use cases such as Linux, FreeBSD, and Windows servers; well-known services such as SMTP and HTTP, and ICMP and IPMI devices for in-depth hardware monitoring. In addition, custom checks written in Perl, Python, or nearly any language can be integrated into Zabbix.
Zabbix also offers customizable dashboards and Web UI displays to focus attention on your most critical components. Notifications and escalations can draw on customizable actions that can be applied to hosts or groups of hosts. Actions can even be configured to trigger remote commands, so a script can be run on a monitored host if certain event criteria are observed.
Zabbix graphs performance data such as network throughput and CPU utilization, as well as collects them in customizable displays. Further, Zabbix supports customizable maps, screens, and even slideshows that display the current status of monitored devices.
Zabbix can be daunting to implement initially, but prudent use of templates and autodiscovery can ease the integration hassles. In addition to an installable package, Zabbix is available as a virtual appliance for several popular hypervisors.
Observium is a network and host monitor that can scan ranges of addresses for systems to monitor using common SNMP credentials. Packaged as a LAMP application, Observium is relatively easy to set up and configure, requiring the usual installations of Apache, PHP, and MySQL, database creation, Apache configuration, and the like. It is designed to be installed as its own server with a dedicated URL, rather than under a larger Web tree.
From there, you can log into the GUI and start adding hosts and networks, as well as autodiscovery ranges and SNMP data to have Observium crawl around the network and gather data on each system discovered. Observium can also discover network devices via CDP, LLDP, or FDP, and host agents can be deployed to Linux systems to aid in data collection.
All of this data is presented in an easily navigated user interface that provides a multitude of statistics, charts, and graphs. This includes everything from ping and SNMP response times to graphs of IP throughput, fragmentation, packet counts, and so forth. Depending on the device, this data will be available for every port discovered and include an inventory of modular devices.
For servers, Observium will display CPU, RAM, storage, swap, temperature, and event log status. You can incorporate data collection and performance graphing on services as well, including Apache, MySQL, BIND, Memcached, Postfix, and others.
Observium plays nice as a VM, so can quickly become a go-to tool for server and network status information. It's a great way to bring autodiscovery and charting to a network of any size.
Too often, IT administrators think they can't color outside the lines. Whether we're dealing with a custom application or an "unsupported" piece of hardware, many of us believe that if a monitoring tool can't handle it immediately, it can't be handled. That's simply not the case, and with a little bit of elbow grease, almost anything can be monitored, cataloged, and made more visible.
An example might be a custom application with a database back end, like a Web store or an internal finance application. Management wants to see pretty graphs and charts depicting usage data in some form or another. If you're using, say, Cacti already, you have several ways to bring this data into the fold, such as constructing a simple Perl or PHP script to run queries on the database and pass counts back to Cacti or even an SNMP call to the database server using private MIBs (management information bases). It can be done, and it can generally be done easily.
If it's unsupported hardware, as long as it speaks SNMP, you can most likely get at the data you need, though it may take a little research. Once you have the right MIBs to query, you can then use that information to write or adapt plug-ins to collect that data. In many cases, you can even integrate your cloud services into this monitoring by using standard SNMP on those instances, or by using an API provided by your cloud vendor. Just because you have cloud services doesn't mean you should trust all your monitoring to your cloud provider. The provider doesn't know your application and service stack as well as you do.
Getting most of these tools running usually isn't much of a challenge. They typically have packages available to download for most popular Linux distributions, if they aren't already in the package list. In some cases, they may come preconfigured as a virtual server. Configuring and tweaking the tools can take quite a while depending on the size of the infrastructure, but getting them going initially is usually a cinch. At the very least, they're worth a test-drive.
No matter which of these tools you use to keep tabs on your infrastructure, it will essentially provide the equivalent of at least one more IT admin -- one that can't necessarily fix anything, but one that watches everything, 24/7/365. The up-front time investment is well worth the effort, no matter which way you cut it. Be sure to run a small set of autonomous monitoring tools on another server, watching the main monitoring server. This is a case where it's always best to ensure the watcher is being watched.