DISCLAIMER. English language used here only for compatibility (ASCII only), so any suggestions about my bad grammar (and not only it) will be greatly appreciated.

суббота, 20 августа 2016 г.

What luks, lvm and mdadm layouts are bootable?

The only supported by default configuration layout is mdadm at the lowest
layer, then lvm and luks in any order.

(tested on Debian 8, initramfs-tools)

`mdadm` installs only one boot script into
`/usr/share/initramfs-tools/scripts/local-top/mdadm`, which has only
"multipath" in prereqs. Whereas `lvm2` boot script has "mdadm" in its prereqs,
so lvm will run after.

lvm and luks order is not that straightforward: both packages installs boot
scripts into two directories: 'local-top' and 'local-block'.
`local-top/cryptroot` generates its prereqs dynamically so, that it will run
the last. Also, `local-top/cryptroot` tries to activate lvm after opening luks
device. That makes both 'lvm->luks' and 'luks->lvm' layouts working.
Moreover, if lvm contains several luks-encrypted PVs trying to activate lvm,
when not all PVs had been opened, results in non-zero exit-code from `vgchange
-ay` (it is called without '--activationmode partial') in `activate_vg()` and
then `setup_mapping()` function will return immediately. Then `cryptroot` will
try next mapping, etc, and when all PVs will be available, activation suceeds.

The problem with lvm activation in `local-top/cryptroot` is that, if
luks-mapping marked as 'rootdev' in generated 'conf/conf.d/cryptroot' (all PVs
will be marked so, when root fs is inside lvm on luks-encrypted PVs),
`local-top/cryptroot` will go further and try to verify (with `blkid`) type of
root device (specified on kernel command line) after lvm activation. When
`activate_vg()` exits with non-zero this verification is skipped (the case for
not all PVs opened yet), but when entire vg finally activates,
`local-top/cryptroot` will reach verification and, if 'root=/dev/md/root',
verification obviously fails (because there is no such device, yet). Then
`local-top/cryptroot` will try to close mapping, which also fails, because lvm
had already been activated on that PV. Finally, after stucking for some
seconds it will proceed.

To make mdadm at the highest level (over lvm or luks) working, i may just
create boot script in `/etc/initramfs-tools/scripts/local-top/mdadm_again.sh`
with prereqs 'PREREQ="cryptroot lvm2"' , which just assembles md array. But i
will still need to wait for failed attempts to verify root in
`local-top/cryptroot`.

Essentially, i may just comment out lvm activation in `local-top/cryptroot`.
Since lvm also installs boot script into `local-block` location, root LV (if
there is no mdadm on it) will still be activated in 'luks->lvm' layout.
Though, unlike `local-top/lvm2` and `activate_vg()` in `local-top/cryptroot`,
in `local-block/lvm2` only required LV will be activated (not entire vg). A
one side-effect of this "only required LV activation" is that if root is
on raid ('root=/dev/md/root') no LV will be activated at all, because
`local-block/lvm2` just doesn't understand, what it needs to activate.

Thus, wherever i will place `mdadm` assemble boot script (in 'local-top' or in
'local-block'), if i comment out lvm activation in `local-top/cryptroot` to
avoid boot delay, i need to activate lvm there manually too: `/sbin/lvm
vgscan; /sbin/lvm vgchange -a y; mdadm --assemble --scan` .


Note, that partial lvm activation with raid over lvm may lead to broken raid.

Assume, that raid has two drives: A and B. Then drive B has missed and raid
runs on drive A some time. Then drive B appears again. Then at raid assemble
time drive missed for some time (drive B) will not be added to raid at all and
raid will run only at drive A. I need to add drive B to raid manually.

Assume, that raid has two drives: A and B. Then drive B was marked as failed
and removed from raid. But at raid assemble in initramfs at boot only drive B
was available. Then at
boot time raid will be started on available drive - drive B, - disregarding,
that this drive was marked as failed. It seems, there is no "failed" mark in
mdadm metadata at all, it's just a trick for removing drive from running raid,
because otherwise `mdadm` refuses to remove. Then, if fs is fine on that
"failed" drive, it will be used as usual. And if later, when both drives A and
B will be available, i'll try to stop and re-assemble raid, it will be started
from the device with *higher* event counter. It may be any of them, depending
on how many changes was made.  What will happen, if event counter will be
equal, i don't know.

Комментариев нет:

Отправить комментарий