DISCLAIMER. English language used here only for compatibility (ASCII only), so any suggestions about my bad grammar (and not only it) will be greatly appreciated.

четверг, 4 февраля 2010 г.

LVM notes [draft]

DISCLAIMER. English language used here only for compatibility (ASCII only), so any suggestions about my bad grammar (and not only it) will be greatly appreciated.

Status: Many chapters missed and not posted here yet. Formatting may contain errors and missed entries.

UPD-01-03-10_14-32:
+ Title changed.
+ Metadata circular buffer description (draft) added.
+ LVM extent and stripe difference (in russian yet)
UPD:[2010.02.10]: небольшие улучшения в форматировании.

FIXME: длина строка >= 80.
FIXME: отступы.
FIXME: <> как обозначения, а не как html-тэги.
FIXME: расстояние между словами (см. вывод `od`).
FIXME: и шрифт дурацкий: нули почему-то меньше остальных букв :D

Text below may look much better in vim with folding enabled ('fdm=marker', 'fmr={{{,}}}'). Though indents are probably remain incorrect. Perhaps, someday i'll fix this :-)

Draft of metadata sectors layout:
  1. LVM label sector (0-r1) {{{

     Location: {{{
     - By default, `pvcreate` places the physical volume label
       in the sector 1 (2nd 512-byte block).
     - This label can optionally be placed in any of the first
       four sectors ('--labelsector' option). And due to LVM
       tools scan this first four sector for PV label, zeroing
       of them ('-Zy' option) is recommended.

     }}}
     Format: {{{
     Let consider on example:

> `pvcreate -vvv -Zy -M2 --metadatacopies=[012] --uuid=.. /dev/sdb4` {{{
> 
> /dev/sdb4: size is 58589055 sectors
> with mcopies=0: 58588927 available sectors
> with mcopies=1: 58588671 available sectors
>   metadata area at sector 8 size 376 sectors
> with mcopies=2: 58588416 available sectors
>   metadata area at sector 8 size 376 sectors
>   metadata area at sector 58588800 size 255 sectors
> 
> Area sizes (available sectors) in hex:
> data area size = 
>   (with mcopies=0) = 58588927 sectors = 0x37dfeff sectors = 0x06fbfdfe00 bytes
>   (with mcopies=1) = 58588671 sectors = 0x37dfdff sectors = 0x06fbfbfe00 bytes
>   (with mcopies=2) = 58588416 sectors = 0x37dfd00 sectors = 0x06fbfa0000 bytes
> meta area size =
>   (1st meta area) = 376 sectors = 192512 bytes = 0x02f000 bytes
>   (2nd meta area) = 255 sectors = 130560 bytes = 0x01fe00 bytes
> 
> Area offsets in hex:
> data area offset =
>   (with mcopies=0)   = 128 sectors = 65536 bytes = 0x010000 bytes
>   (with mcopies=1,2) = 384 sectors = 196608 bytes = 0x030000 bytes
> meta area offset =
>   (1st meta area) = 8 sectors = 4096 bytes = 0x1000 bytes
>   (2nd meta area) = 58588800 sectors = 0x37dfe80 sectors = 0x06fbfd0000 bytes
> 
> PV UUID = 'pesv0I-D0Ok-cVts-73Pg-vIaN-IRz2-LSldOn'
> 
> Sector 1 dump: {{{
> 
> Below for each 16 bytes row 1st line is for mcopies=0, 2nd - mcopies=1, 3rd -
> mcopies=2.
> 
> 000200 4c 41 42 45 4c 4f 4e 45 01 00 00 00 00 00 00 00
>          L   A   B   E   L   O   N   E soh nul nul nul nul nul nul nul
> 000200 4c 41 42 45 4c 4f 4e 45 01 00 00 00 00 00 00 00
>          L   A   B   E   L   O   N   E soh nul nul nul nul nul nul nul
> 000200 4c 41 42 45 4c 4f 4e 45 01 00 00 00 00 00 00 00
>          L   A   B   E   L   O   N   E soh nul nul nul nul nul nul nul
> --
> 000210 e3 bb 4a cb 20 00 00 00 4c 56 4d 32 20 30 30 31
>          c   ;   J   K  sp nul nul nul   L   V   M   2  sp   0   0   1
> 000210 3c a9 89 2c 20 00 00 00 4c 56 4d 32 20 30 30 31
>          <   )  ht   ,  sp nul nul nul   L   V   M   2  sp   0   0   1
> 000210 76 4b 22 d4 20 00 00 00 4c 56 4d 32 20 30 30 31
>          v   K   "   T  sp nul nul nul   L   V   M   2  sp   0   0   1
> --
> 000220 70 65 73 76 30 49 44 30 4f 6b 63 56 74 73 37 33
>          p   e   s   v   0   I   D   0   O   k   c   V   t   s   7   3
> 000220 70 65 73 76 30 49 44 30 4f 6b 63 56 74 73 37 33
>          p   e   s   v   0   I   D   0   O   k   c   V   t   s   7   3
> 000220 70 65 73 76 30 49 44 30 4f 6b 63 56 74 73 37 33
>          p   e   s   v   0   I   D   0   O   k   c   V   t   s   7   3
> --
> 000230 50 67 76 49 61 4e 49 52 7a 32 4c 53 6c 64 4f 6e
>          P   g   v   I   a   N   I   R   z   2   L   S   l   d   O   n
> 000230 50 67 76 49 61 4e 49 52 7a 32 4c 53 6c 64 4f 6e
>          P   g   v   I   a   N   I   R   z   2   L   S   l   d   O   n
> 000230 50 67 76 49 61 4e 49 52 7a 32 4c 53 6c 64 4f 6e
>          P   g   v   I   a   N   I   R   z   2   L   S   l   d   O   n
> --
> 000240 00 fe fd fb 06 00 00 00 00 00 01 00 00 00 00 00
>        nul   ~   }   { ack nul nul nul nul nul soh nul nul nul nul nul
> 000240 00 fe fb fb 06 00 00 00 00 00 03 00 00 00 00 00
>        nul   ~   {   { ack nul nul nul nul nul etx nul nul nul nul nul
> 000240 00 00 fa fb 06 00 00 00 00 00 03 00 00 00 00 00
>        nul nul   z   { ack nul nul nul nul nul etx nul nul nul nul nul
> --
> 000250 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> 000250 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> 000250 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> --
> 000260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> 000260 00 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul dle nul nul nul nul nul nul
>        +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +a +b +c +d +e +f  
> 000260 00 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul dle nul nul nul nul nul nul
> --
> 000270 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> 000270 00 f0 02 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul   p stx nul nul nul nul nul nul nul nul nul nul nul nul nul
> 000270 00 f0 02 00 00 00 00 00 00 00 fd fb 06 00 00 00
>        nul   p stx nul nul nul nul nul nul nul   }   { ack nul nul nul
> --
> 000280 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> 000280 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
>        +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +a +b +c +d +e +f  
> 000280 00 fe 01 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul   ~ soh nul nul nul nul nul nul nul nul nul nul nul nul nul
> --
> 000290 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> 000290 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> 000290 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
> --
> (last NULL line repeated up to sector end)
> 
> }}}
> 
> }}}
     Notes: {{{
     - Size and offset below is in bytes.
     - All ranges below includes their boundaries.
     - Numbers (like size and offset) written on disk with least
       significant byte first. I.e least significant byte of
       number on disk will has less offset from disk beginning,
       than most significant.
     - All areas location rounded by sector boundary (512 bytes
       block), hence all size and offset values will be
       divisible by 512 (0x200), hence in hex them will always
       has NULL least significant byte (0x00).
     - I don't know for sure, is this NULL least significant
       byte actually written on disk or not, but rather is, than
       not. The most confusing place with this byte is beginning
       of data area size after UUID: NULL-byte at 0x240+0 may be
       NULL-termination byte after UUID as well as least
       significant byte of data area size value. I assume, that
       this is NULL least significant byte (see below).

     }}}
     Assumption: {{{
     - NULL least significant byte written on disk as is.
     - size value always occupy +0 -> +7 range of row.
     - offset value always occupy +8 -> +f range of row.
     - so, size and offset can be up to 16-digit hex number and
       occupy exactly one row.

     }}}
     Some sort of prove for assumptions: {{{
     - Throwing out part of number (NULL least significant byte)
       and not writing it to disk will not save space allocated
       for metadata, but may cause problems in future, when this
       least significant byte may become not NULL. So i don't
       see any sane reason for this.
     - Not writing NULL-termination byte after UUID looks
       possible, because UUID size is fixed (i think).
     - If offset and size regions on disk occupy different
       number of bytes, we either can not provide a method to
       address all allocated size (through offset) or will have
       offset values, which never be used - both is only
       introduces future incompatibilities. Hence, offset and
       size maximum allowed value should be the same and them
       should occupy the same number of bytes on disk.
     - Splitting one 16 bytes row on two equal parts by 8 bytes
       for size (fisrt) and offset (second) seems to be very
       reasonable: we have size and offset disk regions occupy
       the same number of bytes and also, as example shows,
       entire values (including NULL least significant byte)
       written into this regions.
     - Eventually, i do not find any conflict with any of my
       assumptions :-)

     }}}
     Short Physical Volume label sector format: {{{
     0x200  <LVM_magic_string>
     0x210  <smth_unknown_and_LVM_version>
     0x220  <PV_UUID>
     0x230  <PV_UUID(continue)>
     0x240  <data_size(8b)>      <data_offset(8b)>
     0x250  <NULL>
     0x260  <NULL(8b)>           <1st_meta_offset(8b)>
     0x270  <1st_meta_size(8b)>  <2nd_meta_offset(8b)>
     0x280  <2nd_meta_size(8b)>  <NULL(8b)>
     0x290  <NULL_(up_to_0x400)>

     }}}
     Detailed Physical Volume label sector format: {{{
     0x200+0 -> 0x200+f: LVM magic string. Identical for all
        three PVs.
     0x210+0 -> 0x210+7: Unknown.
     0x210+8 -> 0x210+b:  LVM version (i suppose).
     0x210+c -> 0x210+c:  Simply a separator (i suppose).
     0x210+d -> 0x210+f: LVM PV label sector (this sector)
        format version (i suppose).
     0x220+0 -> 0x230+f: PV UUID.
     0x240+0 -> 0x240+7: Data area size (number of available
        bytes).  Always present.
     0x240+8 -> 0x240+f: Data area offset. Always present.
     0x250+0 -> 0x250+f: NULL. Why?
     0x260+0 -> 0x260+7: NULL. Why?
     0x260+8 -> 0x260+f: 1st metadata circular buffer offset.
        Only for mcopies=1,2.
     0x270+0 -> 0x270+7: 1st metadata circular buffer size.
        Only for mcopies=1,2.
     0x270+8 -> 0x270+f: 2nd metadata circular buffer offset.
        Only for mcopies=2.
     0x280+0 -> 0x280+7: 2nd metadata circular buffer size
        Only for mcopies=2.
     0x280+8 -> 0x3f0+f: NULL and seems not used.

     }}}
     Last notes: {{{
     - metadata area size and location (sector number, where
       metadata circular buffer begins) can be obtained from
       `pvcreate -vvv` output. If no metadata areas (circular
       buffer) selected during PV creation (`pvcreate
       --metadatacopies=0`), than metadata buffer area size will
       be set to 120 sectors (it seems, that this is the lowest
       size), though will not be used (no pointers to metadata
       buffer will be in label sector and sector 8 remains
       unchanged as well). This can be determined by
       substraction count of available sectors from entire count
       of sectors on disk (you get 128 = 120 + 8).

     }}}

     }}}

  }}}
  2. LVM circular buffer (0-r1) {{{

     Short sector 8 format: {{{

     0x1000  <Unknown>
     0x1010  <Unknown(8b)>            <1st_meta_offset(8b)>
     0x1020  <1st_meta_size(8b)>      <latest_entry_offset(8b)>
     0x1030  <latest_entry_size(8b)>  <Unknown(8b)>
     0x1040  <NULLs_(up_to_0x1200)>

     }}}
     Detailed sector 8 format: {{{

     0x1000+0 -> 0x1000+f:  Unknown (2).
     0x1010+0 -> 0x1010+7:  Unknown (2).
     0x1010+8 -> 0x1010+f:  1st metadata circular buffer (this
          buffer) offset.
     0x1020+0 -> 0x1020+7:  1st metadata circular buffer (this
          buffer) size.
     0x1020+8 -> 0x1020+f:  Latest metadata entry in 1st circular
          buffer (this buffer) offset. Offset
       from beginning of the buffer, but
       NOT from beginning of the PV.
     0x1030+0 -> 0x1030+7:  Latest metadata entry in 1st
          circular buffer (this buffer) size,
       including null-terminator.
     0x1030+8 -> 0x1030+f:  Unknown (1).
     0x1040+0 -> 0x11f0+f:  NULLs and seems not used.

     (1): This number rather is not:
   - first unallocated PE (checked by value on example).
   - PE size (checked by value on example).
   - PV size (checked by value on example).
   - 2nd metadata buffer offset (it presents even in PVs
     with single metadata buffer).
   - latest metadata entry timestamp or smth else related
     to latest metadata entry (it differs for different
     PVs in the same VG, but latest metadata entry are
     the same for all PVs of the same VG).

   Also:
   - this value does not divisible without remainder by
     1024 or 512.
      (2): Notes:
       - range 0x1000+4 -> 0x1010+7 seems to be the same for
     all PVs (even from different VGs).

     }}}
     Metadata entries location: {{{
     - lvm metadata entry on disk location aligned roughly by
       sector boundary (512 bytes block). I.e words such 

   vg_mp3 {
   id = "YWvzHx-M1X5-TWtl-vCD1-w2zn-y0da-qui6PK"
   seqno = 50
   status = ["RESIZEABLE", "READ", "WRITE"]

       will be placed only at sector's boundary (beginning).
     - lvm metadata entry can occupy several sectors, though, if
       last occupied sector not fully filled, all trailing
       sector's part will not be cleared and, hence, it can
       contain some garbage (exactly, some part of data from
       previous record, occuping this sector).
       
     }}}
     Metadata entries format: {{{
     - Each metadata entry on disk ends with null-terminator.
     - On disk metadata timestamp (information about how and
       when metadata entry was created) written after VG
       description (information about volume group structure,
       see below) to which it relates (in contrast with metadata
       backup file, produced by `vgcfgbackup`, where timestamps
       written first).
     - On disk in metadata timestamp 'description' field is
       empty (but in metadata backup file, produced by
       `vgcfgbackup`, is not). (why?)

     }}}
     Last notes: {{{

     - When PV contain no metadata circular buffer areas
       ('--metadatacopies=0' by `pvcreate`), than restoring VG
       does not change anything in the PV metadata.
     - To obtain offset from beginning of PV to latest metadata
       entry in circular buffer, sum up '0x1010+8 -> 0x1010+f'
       value with '0x1020+8 -> 0x1020+f' value.
     - In order to locate latest metadata entry in raw 'on disk'
       metadata copy, you should look up metadata circular
       buffer (mostly, starting from sector 9, but if not, exact
       offsets you can obtained from PV label sector) dump
       splitted be sectors (512 bytes block) for sectors,
       beginning like

   vg_mp3 {
   id = "YWvzHx-M1X5-TWtl-vCD1-w2zn-y0da-qui6PK"
   seqno = 50
   status = ["RESIZEABLE", "READ", "WRITE"]

       This is the beginning of correct metadata entry.
       Afterwards, you should choose one with latest 'seqno'
       field.  As explained above, simply search by word 'seqno'
       may match with some garbage data after end of correct
       metadata entry. Though, because we look up for entry with
       latest 'seqno', anyway we'll select correct one. Also,
       note, that 'seqno' word may appear as garbage only in few
       first bytes of sector.

     }}}

  }}}


LVM extent and stripe:

(копия письма :-)
<...>
А под stripe-ами я имел в виду логический том LVM с stripe mapping, те,
например, вот такая команда

`lvcreate -vvv -i3 -I32 -l32000 -n striped_lv_3x32k test_vg`

И непонятно мне было почему LVM использует два вида блоков - extent и stripe,
и почему нельзя было реализовать все виды отображения (mapping) - и линейное,
и stripe mapping, - используя только один вид блоков. Но я тут еще посмотрел
логи `lvcreate -vvv` и в части, относящейся к активации логического раздела,
мне кажется, я нашел ответ. Хотя не уверен, что полностью правильный :-) Вот,
например:

--- Volume group ---
VG Name vg_4k
PE Size 4.00 KB
VG UUID M7TWGx-Rggp-Okru-nWGY-c7Mh-MaLi-LhYxFm

теперь если создать логический том (размером 32000 extent-ов) с stripe
mapping, разделенный на 3 физических тома и с размером stripe-а в 4КБайта
получится:

`lvcreate -vvv -i3 -I4 -l32000 -n s_lv_3x16k vg_4k`

<..>
Creating vg_4k-s_lv_3x16k
dm create vg_4k-s_lv_3x16k
LVM-M7TWGxRggpOkrunWGYc7MhMaLiLhYxFm0YvE0mFyjZS7Q103PbM4efKUb2lUXnR8
NF [16384]
Loading vg_4k-s_lv_3x16k table
Adding target: 0 256008 striped 3 8 7:1 384 7:2 384 7:0 384
dm table (253:0) OF [16384]
dm reload (253:0) NF [16384]
Resuming vg_4k-s_lv_3x16k (253:0)
dm resume (253:0) NF [16384]
<..>

Хотя это, конечно, и без лога известно было, но все же. Т.е вся
функциональность LVM реализована через device-mapper, но device-mapper ничего
не знает об extent-ах и не использует их. В своих таблицах он (`dmsetup
table`) для всех размеров и смещений использует дисковые блоки (512байт). А
для striped таблиц он также использует stripe - блок данных, который будет
записан на одно физическое устройство (т.е стандартное определение stripe-а).
Т.е получается, что device-mapper как раз и использует всего один тип блоков -
только stripe. И получается, что extent - блок, используемый только для
удобства управления LVM томами, а при работе LVM (I/O) он не используется. Т.е
блок, используемый только LVM тулсетом. Тогда становится понятно вот это
замечание из описания опции '-s' в `man vgcreate`:

-s, --physicalextentsize PhysicalExtentSize[kKmMgGtT]

<..>
If the volume group metadata uses lvm2 format those restrictions
do not apply, but having a large number of extents will slow
down the tools but have no impact on I/O performance to the log-
ical volume. The smallest PE is 1KB.

А ограничения на размеры stripe и extent, видимо, сделаны для того, чтобы все
они друг в друге помещались: 512байт - дисковый блок - степень двойки,
поэтому, наверно, stripe и extent тоже должны быть степенью двойки, чтобы
содержали целое число дисковых блоков. Кроме того, т.к и stripe, и extent
степень двойки, extent всегда будет содержать целое количество stripe-ов.
Правда, не очень понятно, почему device-mapper не позволяет устанавливать
размер stripe-а меньше 4Кбайт.
<...>