Friday, April 30, 2010

Elusive Crashing Problem Solved

Four times in the last month Kimmy or I have woken up to a dead Linux computer. At first I thought it was a kernel panic (Linux BSOD), that took out my screen. It was always happening during the job runs in the middle of the night, backups, virus scans, etc. So my first step was to turn on all debugging and wait for the problem to happen again.

Well it did but there was still nothing written to any of the log areas. Then I decided to reload Fedora on a new machine, and moved all the hard drives and eSATA cages over to the new machine. Well the problem happened there as well.

So I got this crazy idea. If the drives had a problem and went in read-only mode it would explain the absense of logs. So I put the /var /tmp and main root drive on different disks.

Then the problem happened again, this time at 3:00. I had started a F13 respin build just before I left to pick up my daughter from school. Although this time the problem was a little different. Instead of a black screen with nothing on it, I had a screen saver that wouldn't wake up. Turns out the /tmp drive had gone read only. The reason nothing would wake up is programs were all waiting for the /tmp to go back into read/write. It was like a train where the engine suddenly stopped dead, and all the other cars came piling into it.

Apr 29 14:38:00 liandra kernel: ata8.00: exception Emask 0x10 SAct 0xd SErr 0x100 action 0x6
Apr 29 14:38:00 liandra kernel: ata8.00: irq_stat 0x02060002, device error via SDB FIS
Apr 29 14:38:00 liandra kernel: ata8.00: SError: { UnrecovData }
Apr 29 14:38:00 liandra kernel: ata8.00: failed command: READ FPDMA QUEUED
Apr 29 14:38:00 liandra kernel: ata8.00: cmd 60/80:00:bf:a8:b9/00:00:23:00:00/40 tag 0 ncq 65536 in
Apr 29 14:38:00 liandra kernel: res 60/80:00:bf:a8:b9/00:00:23:00:00/40 Emask 0x10 (ATA bus error)
Apr 29 14:38:00 liandra kernel: ata8.00: status: { DRDY DF }
Apr 29 14:38:00 liandra kernel: ata8.00: error: { ICRC }
Apr 29 14:38:00 liandra kernel: ata8.00: failed command: READ FPDMA QUEUED
Apr 29 14:38:00 liandra kernel: ata8.00: cmd 60/80:10:3f:a8:b9/00:00:23:00:00/40 tag 2 ncq 65536 in
Apr 29 14:38:00 liandra kernel: res 41/84:2f:90:a8:b9/04:00:23:00:00/40 Emask 0x410 (ATA bus error)
Apr 29 14:38:00 liandra kernel: ata8.00: status: { DRDY ERR }
Apr 29 14:38:00 liandra kernel: ata8.00: error: { ICRC ABRT }
Apr 29 14:38:00 liandra kernel: ata8.00: failed command: READ FPDMA QUEUED
Apr 29 14:38:00 liandra kernel: ata8.00: cmd 60/80:18:3f:a9:b9/00:00:23:00:00/40 tag 3 ncq 65536 in
Apr 29 14:38:00 liandra kernel: res 60/80:18:3f:a9:b9/00:00:23:00:00/40 Emask 0x10 (ATA bus error)
Apr 29 14:38:00 liandra kernel: ata8.00: status: { DRDY DF }
Apr 29 14:38:00 liandra kernel: ata8.00: error: { ICRC }
Apr 29 14:38:00 liandra kernel: ata8.02: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Apr 29 14:38:00 liandra kernel: ata8.02: failed command: FLUSH CACHE EXT
Apr 29 14:38:00 liandra kernel: ata8.02: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 1
Apr 29 14:38:00 liandra kernel: res 0e/12:02:02:00:00/00:00:00:10:0e/00 Emask 0x2 (HSM violation)
Apr 29 14:38:00 liandra kernel: ata8.02: status: { DRQ }
Apr 29 14:38:00 liandra kernel: ata8.02: error: { IDNF }
Apr 29 14:38:00 liandra kernel: ata8.00: hard resetting link
Apr 29 14:38:01 liandra kernel: ata8.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Apr 29 14:38:01 liandra kernel: ata8.02: hard resetting link
Apr 29 14:38:01 liandra kernel: ata8.02: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Apr 29 14:38:01 liandra kernel: ata8.00: configured for UDMA/100
Apr 29 14:38:01 liandra kernel: ata8.02: configured for UDMA/100
Apr 29 14:38:01 liandra kernel: end_request: I/O error, dev sdg, sector 0


My best guess, under heavy load the controller card on my eSATA drive cage heats up and pause for a minute or two. That's what causes the glitch. It also explains why the problem followed me when I re-installed Fedora and put it on a new machine. It also explains why I could never re-produce the problem. By the time I got back to it the unit had cooled down.

Guess I'm looking for a new external drive cage. I'm starting to think I should just build my own.

2 comments:

  1. The error is related to SATA Native Command Queueing (NCQ). FPDMA = First Party DMA. This is a newish performance feature on SATA drives.
    I'd recommend a couple checks:
    1) Update to the latest driver if you haven't already (seems several ubuntu users have also seen this - https://bugs.launchpad.net/ubuntu/+bug/550559)
    2) If you're using new drives with a motherboard that's an older generation, there might be a SATA spec compatibility issue. You can sometimes jumper the SATA drives to legacy mode (sometimes called SATA2 or 1.5). Also, you may be able to set your hardware (in BIOS) to a legacy mode. This should fix the above error, but might impose a perf penalty. There may also be a BIOS update to better support SATA.
    3) Check if there's any driver options to disable NCQ support. Though the queueing provides a perf boost, its not the end of the world to go without it.

    ReplyDelete
  2. Hmm also see discussion in this bug: https://bugzilla.kernel.org/show_bug.cgi?id=14543

    ReplyDelete