Ocfs2

OCFS2 fun yet again

I’m coming back today from a six day vacation in the warm south (that is Stuttgart), back at work and find three sheets of paper on my desk. Two tell me something I haven’t done yet, the other one tells me something I haven’t seen yet. One of my colleagues had to restart one of our web nodes and now the thing can’t mount the logging volume (and thus, logrotate / awstats failed to do it’s job). OCFS2 ain’t spitting any error messages, when trying to mount the volume you see it joining the domain the volume belongs to on the other nodes, so from a first glance at things .. nothing is wrong ? ...

OCFS2 follow-up

OK, it turned out that said colleague wasn’t responsible at all. Turns out, the real trigger was me creating a new volume on our SAN, on the same array that houses the OCFS2 volume. Apparently, during creation of an additional SAN volume, all other SAN volumes in this array are either read-only or delayed during that time, as you can see from the following log: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 kernel: (13,3):o2hb_write_timeout:242 ERROR: Heartbeat write timeout to device sdd1 after 12000 milliseconds kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 4): kernel: Heartbeat thread stuck at waiting for read completion, stuffing current time into that blocker (index 4) kernel: Index 5: took 0 ms to do submit_bio for read kernel: Index 6: took 0 ms to do waiting for read completion kernel: Index 7: took 0 ms to do bio alloc write kernel: Index 8: took 0 ms to do bio add page write kernel: Index 9: took 0 ms to do submit_bio for write kernel: Index 10: took 0 ms to do checking slots kernel: Index 11: took 0 ms to do waiting for write completion kernel: Index 12: took 2002 ms to do msleep kernel: Index 13: took 0 ms to do allocating bios for read kernel: Index 14: took 0 ms to do bio alloc read kernel: Index 15: took 0 ms to do bio add page read kernel: Index 16: took 0 ms to do submit_bio for read kernel: Index 17: took 0 ms to do waiting for read completion kernel: Index 18: took 0 ms to do bio alloc write kernel: Index 19: took 0 ms to do bio add page write kernel: Index 20: took 0 ms to do submit_bio for write kernel: Index 21: took 0 ms to do checking slots kernel: Index 22: took 0 ms to do waiting for write completion kernel: Index 23: took 2004 ms to do msleep kernel: Index 0: took 0 ms to do allocating bios for read kernel: Index 1: took 0 ms to do bio alloc read kernel: Index 2: took 0 ms to do bio add page read kernel: Index 3: took 0 ms to do submit_bio for read kernel: Index 4: took 9995 ms to do waiting for read completion kernel: (13,3):o2hb_stop_all_regions:1682 ERROR: stopping heartbeat on all active regions. kernel: Kernel panic - not syncing: *** ocfs2 is very sorry to be fencing this system by panicing ***

OCFS2 fun

Turns out, that said colleague has been playing with NFS on one off the web nodes, thus apparently rendering the remaining nodes offline (or semi-offline). Now after all web nodes hung themselves, we had to hard reset them, now everything is tingly again .. yay for a great first day …