The problem with wait I/O

Most Linux system monitoring tools including top report a CPU statistic iowait. There seems to be a common misconception about the interpretation of this statistic.

The statistic is presented as a percentage of CPU time and this probably leads a lot of people into believing that these are lost CPU cycles. I have witnessed co-workers requesting either more CPUs, faster CPUs or both. But this is not the solution for the observed condition. There might not even be a real problem behind the whole story.

First of all let’s see how the CPU statistics are created. There is no accurate accounting in UNIX for every single CPU cycle. The accounting works by using a sampling approach. From time to time the normal execution of a user process is interrupted. One of the things happening during the interrupt is the accounting of CPU time. The interrupt handler examines the state of execution the CPU was in when the interrupt occured. At this level there are only three possible states: executing user code, executing a system call or being idle.

If the CPU was running user code before the interrupt, then the time will be accounted as user time. When the interrupt happened during the execution of a system call, then the time will be declared system time. So where does the iowait come into play? Let’s have a look a the Linux kernel sources to investigate this. Current kernel sources contain the following function in kernel/sched.c:

 * Account for idle time.
 * @cputime: the cpu time spent in idle wait
void account_idle_time(cputime_t cputime) {
  struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
  cputime64_t cputime64 = cputime_to_cputime64(cputime);
  struct rq *rq = this_rq();

  if (atomic_read(&rq->nr_iowait) > 0)
    cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
    cpustat->idle = cputime64_add(cpustat->idle, cputime64);

For some people it may be a surprise that the accounting for iowait happens in the same function that measures the idle time. In fact there is only one simple condition that determines if the time will show up as idle or iowait. The condition just checks for the existence of at least one process waiting for I/O in the current run queue.

So we can clearly see that the time accounted as iowait is really idle time. With more or faster CPUs you will have even more idle CPU time and if you don’t change your I/O subsystem at the same time you still will observe iowait in the CPU statistics. On the other hand there is an easy way to get rid of your iowait. Just use slower and fewer CPUs.

The explanation for this observation is easy. You can’t have iowait if you don’t have any idle time in the system. If you have a machine suffering from iowait you can use the following simple perl script as cure:

$ perl -e 'setpriority(0,0,20); while(1) {}'

On a system with more than one CPU or core you might have to start more than one copy depending on the load of the machine. Every copy will use the available CPU cycles by running an endless loop at reduced priority. There won’t be any idle time left and therefore there won’t be any iowait. Terminating the perl programm returns to the previous state where you will witness the iowait again.

So we now have an easy way of getting rid of iowait by taking a simple software action. Or maybe not? Think about it. You start additional programms competing for CPU resources and that will improve your performance by avoiding iowait? Obviously not. Maybe this clarifies that iowait is not a CPU statistic and changing your CPU configuration won’t improve your overall performance. There are some tools available to look for problems with the I/O subsystem but the iowait metric is not among them. I plan to write more about this in future articles.

Looking at other operating systems is interesting. SUN Solaris used to report a wio statistic. Then probably someone at SUN realized that the conclusions drawn from the statistic were mostly wrong. As a result the metric is not reported any longer in Solaris 10. So in case of Solaris you can simply upgrade to Solaris 10 to get rid of all your wio problems. And FreeBSD does not report any iowait statistic as percentage of CPU time in the first place.