Linux ata hard disk access race condition. WWW.Smythies.com

The problem: Older, pretty slow computer has occasional ATA disk access timeout errors. This issue first appeared very late in the Ubuntu version 12.04 pre-release cycle.
The issue persists with 12.10 and udev 175-0ubuntu13.
The issue persists with 13.04 (Development) and udev 175-0ubuntu17.
Under very intensive disk I/O situations, the system can lock up. Eventually, after about 30 seconds, the system does realize it is frozen and manages to resume. It appears as though the computer is waiting for some data from the disk, but the disk doesn't think it has anything to do. It is possible that my CPU is the problem, being below the minimum server edition specifications (200 Mhz, whereas mininmum spec is 300 Mhz). However, the CPU is largely idle with these tests, as it mostly waits for disk I/O. (O.K., it also does have some pretty busy periods.)

The difference between the problem occuring or not occuring has been isolated down to the udev package:
udev 175-0ubuntu6 rule-based device node and kernel event manager - does not have the problem.
udev 175-0ubuntu7 rule-based device node and kernel event manager - has the problem. (as do all later versions. Tested up to version udev 175-0ubuntu17)
Further, the issue has been isolated to the single line change of launchpad revision 2760 (while revision 2761 edits the same line again, the results are the same).

Specifically: in the file /lib/udev/rules.d/60-persistent-storage.rules, this line:
ACTION=="add", KERNEL=="sr*", ATTR{events_poll_msecs}=="0", ATTR{events_poll_msecs}="2000"
works fine, but when it is changed to this (2760):
ACTION=="add", KERNEL=="sr*", ATTR{events_poll_msecs}=="-1", ATTR{events_poll_msecs}="2000"
or this (2761 and subsequent):
ACTION=="add", ATTR{removable}=="1", ATTR{events_poll_msecs}=="-1", ATTR{events_poll_msecs}="2000"
the problem exists.

Why? The rule is for "sr" or "removeable" devices. The hard disk is an "sd" device. Why does the change result in issues with the hard disk?

The reason is that the single udev rule change that "introduced" this issue, actually just created a new, and more probable, way to demonstrate a pre-existing issue.

The motherboard has two IDE controllers, the primary uses interrupt 14 and the secondary uses interrupt 15. It turns out that those two interrupts do not co-habitat well, and perhaps never did. The single line udev rule change that started this whole saga, also created a steady stream of interrupt 15's, even if the CD-ROM drive was not being used. The hard drive was on the primary IDE controller, using interrupt 14. The CD-ROM drive was on the secondary IDE controller using interrupt 15. Before, the single line udev rule change, there was never an interrupt 15 if the CD-ROM drive was not being used.

So now the question becomes why do the two interrupts not work well together, particularly when one considers that they are so very basic to IDE and PATA systems?

An example error message:

Jul  5 18:00:08 test-smy kernel: [173606.194006] ata1: lost interrupt (Status 0x58)
Jul  5 18:00:08 test-smy kernel: [173606.197866] ata1: drained 65536 bytes to clear DRQ
Jul  5 18:00:08 test-smy kernel: [173606.199340] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul  5 18:00:08 test-smy kernel: [173606.202256] ata1.00: failed command: READ DMA
Jul  5 18:00:08 test-smy kernel: [173606.208481] ata1.00: cmd c8/00:08:e0:0a:b6/00:00:00:00:00/e1 tag 0 dma 4096 in
Jul  5 18:00:08 test-smy kernel: [173606.208502]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jul  5 18:00:08 test-smy kernel: [173606.213384] ata1.00: status: { DRDY }
Jul  5 18:00:08 test-smy kernel: [173606.215564] ata1: soft resetting link
Jul  5 18:00:08 test-smy kernel: [173606.386449] ata1.00: configured for MWDMA2

Note that the first time this error occurs after re-boot, there will also be:

Jul  3 19:36:26 test-smy kernel: [ 6584.055786] Clocksource tsc unstable (delta = 63999858 ns)
Jul  3 19:36:26 test-smy kernel: [ 6584.056302] Switching to clocksource pit

I do not think the clock was actually unstable. I think it got out of sync because of the freeze and then was interpreted as unstable. I have no real proof, though.

Note: Typically this type of error indicates a failing disk or bad cable or other hardware problem. That is not the case here, as I have repeated the problem with multiple disks, cables and power supplies. Also, I can make it appear or disappear at will via version of udev.

Method of test 1: "sudo update-apt-xapian-index --force" results is pretty heavy disk I/O and will occasionally show the issue. As a repeating script:

#!/bin/dash

echo "Hard disk I/O"
date
while [ 1 ];
do
  sudo update-apt-xapian-index --force
  date
  sleep 3.0
done

Method of test 2: (preferred) A program was written to read bytes from random locations within a large file. The result is the disk spending most of its time seeking and will occasionally show the issue.
A program to generate the test file.
The test program.
Example program output.

Workaround: If not required, unplug the CD-ROM (or DVD-ROM). The issue will not occur.

Workaround: Merely change the one line in /lib/udev/rules.d/60-persistent-storage.rules back to what it was. Be sure to do "sudo update-initramfs -u" after changing the rule. (Note: This is not a great workaround, if the CD-ROM drive is ever used, including for new installs from an ISO.)

Workaround: Put the CD-ROM drive as the slave drive on the primary IDE controller, thus eliminating interrupt 15 from ever occuring.

Verification Experiments:

Experiment 1: If it is true that the issue is due to some issue where certain timing of interrupt 15 occurring can cause a missed interrupt 14, then one should be able to increase the number of occurrences per unit time of the lost interrupt 30 second freeze by increasing the number of interrupt 15's per unit time. So, if we change the events_poll_msecs from 2000 to 500 we should get 4 times interrupt 15 rate and about 4 times the error rate. The random disk read test program was run (test method 2). The lost interrupt 30 second freeze rate chnaged from about one per hour to about 4 per hour.

Experiment 2: O.K., using the original, pre revision 2760 udev rule can the lost interrupt 30 second freeze still be made to occur via using the CD-ROM drive at the same time, thus creating interrupt 15's? A DVD was created, with one big 4 gigabyte file. These commands were entered (for my own memory and future reference):

doug@test-smy:~$ sudo mount /dev/sr0 /media/cdrom
[sudo] password for doug:
mount: block device /dev/sr0 is write-protected, mounting read-only
doug@test-smy:~$ cd /media/cdrom
doug@test-smy:/media/cdrom$ cp big /dev/null

A high rate of interrupt 15s were observed. Meanwhile the random disk read test program was run (test method 2). The lost interrupt 30 second freeze occurred occasionally.

Experiment 3: If the CD-ROM drive is moved to be the second, slave, device on the primary IDE interface, thus eliminating interrupt 15, does the issue not occur? The udev rule used was the higher interrupt rate one from above. Otherwise, the experiment execution was as per experiment 2 above. There was never a lost interrupt 30 second freeze and never an interrupt 15. Performance is greatly hampered by having the CD-ROM drive and HDD on the same IDE cable/interface. Seeks rates dropped as low as 6 per second while the CD-ROM was also copying the big file to /dev/null. On separate IDE interfaces, the seek rate stayed as high as 84 seeks per second. The seek rate was typically 89 seeks per second with nothing else going on at the same time. With the CD-ROM drive on the same cable, but doing nothing the HDD seek rate was typically 84 seeks per second. I only use the CD-ROM drive to install a new ISO, so it will just take longer. It already took several hours on the old pathetic computer anyhow.

Experiment 4: Same as experiment 2, but with Ubuntu 10.04.4 as the OS. The lost interrupt 30 second freeze seemed to occur at a higher rate than experiment 2, however longer tests would be required for a statistically valid conclusion.

References:
The launchpad bug.
The Ubuntu forums thread. (which got closed after 12.04 was released)
The Ubuntu forums thread for raring ringtail.
The launchpad udev core dev area.
.
Doug's rough and disjointed working notes.
.
Ubuntu debugging IRQ problems wiki.

Linux ata disk race condition. WWW.Smythies.com emaildoesnotwork@smythies.com 2012.07.06 Updated 2013.02.03