Health tips for Unix systems

Things to watch and tools to use to make sure your Unix/Linux systems maintain their health and vigor

Periodic health checks can help ensure that your Unix systems are going to be available when they're needed most. In this post, we're going to look at some aspects of performance that should be included in your system check-ups and some handy commands that will provide you with some especially useful insights.

CPU load

Probably the most obvious health check for a Unix/Linux system is to take a look at the CPU load on a system. This is the heartbeat of a Unix system. A healthy system will have CPU power to spare. And one of the best commands for giving you a quick and easy view of how hard your CPU is working is the top command. There are a number of measurements to focus on when you use the command.

Providing a lot of information on your system's performance, top manages to be surprisingly concise in how it displays the measurements that it reports. In particular, the load average measurements can give you a clear view of how busy the CPU is, though the numbers only report the last 15 minutes' worth of activity. Knowing how many processes on average are having to wait for their time on the processor tells you whether the system is working hard and how hard to keep up with demands. A load average of .50 would mean that, on average, every other time top checks, a process is having to wait to run. The three figures provided show the load averages over the last one, five, and fifteen minutes -- so you get some perspective and can also get a feel for whether the load is getting heavier or lighter. Once these numbers climb to 1.00 (especially the fifteen-minute average), a system is likely hurting. If this number increases or persists for a considerably longer time, the system's performance will be noticeably poor. But, again, we're only looking at 15 minutes worth of data.

The top command also displays the number of running processes (196 in the listing below) and usage stats both for memory and swap space. On the system displayed below, swap is not being used at all. In fact, looking at the third line, you'll see that the CPU is idle more than 99% of the time. This system is obviously only lightly used.

The memory and swap stats are shown the fourth and fifth lines of top's output. With no swap in use and significant free memory, this system is clearly having an easy day -- at least a very easy 15 minutes.

If there were any processes dominating the CPU, we'd see them in the list of tasks shown after the five summary lines. By default, top ranks its process list in order of CPU usage (highest first).

top - 20:47:17 up  4:25,  3 users,  load average: 0.54, 0.15, 0.05
Tasks: 196 total,   1 running, 195 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2017064 total,   662924 free,   448904 used,   905236 buff/cache
KiB Swap:  3635904 total,  3635904 free,        0 used.  1091240 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1775 shs       20   0  273460  79668  48676 S   0.3  3.9   0:49.03 compiz
 3811 shs       20   0    9944   3640   3092 R   0.3  0.2   0:00.52 top
    1 root      20   0   27360   6592   5132 S   0.0  0.3   0:02.48 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    4 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+
    6 root      20   0       0      0      0 S   0.0  0.0   0:00.02 ksoftirqd/0
    7 root      20   0       0      0      0 S   0.0  0.0   0:00.49 rcu_sched
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
    9 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/0
   10 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 lru-add-dr+
   11 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/0
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1
   14 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 watchdog/1
   15 root      rt   0       0      0      0 S   0.0  0.0   0:00.12 migration/1
   16 root      20   0       0      0      0 S   0.0  0.0   0:00.03 ksoftirqd/1
   18 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:+

Using the sar command, you can get an idea if what you see in your top output has held true for a considerably longer period of time. In the example below, sar has been collecting data every ten minutes for almost an hour and a half.

stinkbug# sar
Linux 4.10.0-19-generic (stinkbug)      05/08/2017      _i686_  (2 CPU)

19:32:20     LINUX RESTART      (2 CPU)

07:35:01 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:45:01 PM     all      0.15      0.00      0.02      0.02      0.00     99.80
07:55:01 PM     all      0.14      0.00      0.02      0.02      0.00     99.82
08:05:01 PM     all      0.15      0.00      0.02      0.02      0.00     99.81
08:15:01 PM     all      0.15      0.00      0.02      0.02      0.00     99.80
08:25:01 PM     all      0.15      0.00      0.02      0.02      0.00     99.82
08:35:01 PM     all      0.14      0.00      0.02      0.02      0.00     99.83
08:45:01 PM     all      0.22      0.00      0.06      0.05      0.00     99.67
08:55:01 PM     all      0.55      0.00      0.70      2.70      0.00     96.05
Average:        all      0.21      0.00      0.11      0.36      0.00     99.33

In this example, it's clear that this system is consistently only lightly used.

One of the key benefits of sar is that can collect information around the clock, so that you can see how your system is performing even when you're not available to look. You can also use it to look at how the system is running right now. In the example below, we're asking for three data samples, each 5 seconds apart.

stinkbug# sar -u 5 3
Linux 4.10.0-19-generic (stinkbug)      05/08/2017      _i686_  (2 CPU)

09:04:09 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:04:14 PM     all      0.20      0.00      0.20      0.00      0.00     99.60
09:04:19 PM     all      0.10      0.00      0.20      0.00      0.00     99.70
09:04:24 PM     all      0.20      0.00      0.10      0.00      0.00     99.70
Average:        all      0.17      0.00      0.17      0.00      0.00     99.67

Both the top and sar commands shown above provide data on how the CPU on the system is spending its time. While largely 99% or more idle, the CPU on this system is also spending a small amount of time running user processes ("%user" or "us") and a small amount of time for system tasks ("%system" or "sy"). On a busy system, these numbers can help you to determine why the system is so busy.

Memory Usage

To look just at memory and swap space, the free command is the most convenient one to use. It will display the same variety of data that top provides, but just the memory stats.

stinkbug# free
              total        used        free      shared  buff/cache   available
Mem:        2017064      449024      662756      209204      905284     1091120
Swap:       3635904           0     3635904

The take-homes for this system are that swap space is not being used and a good amount of memory is free and available (nearly 1/3 of it not in use).

Paging and swapping

When the memory on a system is in high demand, the system has to use paging and swapping – the processes that move process data out of memory and off to the swap device and back when needed. This allows the system to behave as if it has more physical memory than it does, but comes at some cost in terms of performance. A system that is doing a lot of swapping will likely slow down considerably. The columns to focus on are the si (average number of LWPs swapped in per second) and so (number of whole processes swapped out) columns. These numbers are all 0 in the example below, but imagine them populated with numbers with two or three digits.

stinkbug# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 663184 132276 772964    0    0    15     2   16   53  0  0 99  0  0
stinkbug# vmstat 5 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 662936 132284 772996    0    0    15     2   16   53  0  0 99  0  0
 0  0      0 662928 132284 772996    0    0     0     0   27   52  0  0 100  0  0
 0  0      0 662928 132284 772996    0    0     0     1   28   55  0  0 100  0  0

Disk IO

The iostat command (particularly iostat -x)is useful for observing device input/output loading. Sometimes this information is used to justify changing the system configuration to better balance the load between devices. To make use of this information, you have to be able to translate the space-saving acronyms hovering over the device measurements -- like rrqm/s and rkB/s.

rrqm/s, wrqm/s -- number of merged read and write requests queued per second
r/s, w/s -- number of read and write requests per second
rkB/s -- number of kilobytes read from the device per second
wkB/s -- number of kilobytes written to the device per second
avgrq-sz -- average request size (in sectors)
avgqu-sz -- number of requests waiting in the device’s queue
await -- average time (milliseconds) for I/O requests to be served
r_await, w_await -- average time (milliseconds) for read and write requests to be served
svctm -- number of milliseconds spent servicing request
%util -- percentage of CPU time during which requests were issued

Of these, the avgqu-sz is one of the most important. A low value generally indicates that your systems is not heavily loaded.

stinkbug# iostat -x 5 3
Linux 4.10.0-19-generic (stinkbug)      05/08/2017      _i686_  (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.24    0.01    0.09    0.33    0.00   99.33

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.67     0.17    1.75    0.11    29.38     3.46    35.45     0.02   10.78    8.94   41.38   2.93   0.54

Disk space

Disks can fill up fast depending on what's happening on a system. Be aware of disks that might be getting close to filling up. I've often set up systems that I managed to send me warnings when the used space reached particular thresholds -- like 75% full, 90% full, and 98% full. In the example below, we see a couple of disks that are getting close.

dragonfly# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             78361192  23185840  51130588  32% /
/dev/sda2             24797380  22273432   1243972  95% /home
/dev/sda3             29753588  25503792   2713984  91% /data
/dev/sda4               295561     21531    258770   8% /boot
tmpfs                   257476         0    257476   0% /dev/shm

The hardware

Don't depend on the command line to tell you everything you need to know to ensure that the systems you manage are in good shape. Check them from time to time in person. Look for warning lights and fans that might not be working as well as expected. Make sure that critical systems are plugged into UPS devices whenever possible.

Backups

Also remember that usable backups are an important part of system health. A system that cannot be fully resuscitated after a data disaster is not in good shape. Check your backups regularly to ensure that they are usable.

Wrap-up

Being proactive can help you ward off system problems long before they threaten operations. Periodic health checks can also help you to be familiar with how a system is generally performing and this can help you recognize when a system is undergoing an unusual problem.

Join the TechWorld newsletter!

Error: Please check your email address.

Tags programmers

More about ----Linux

Show Comments
[]