| ---------------------------------------------------------------------- |
| 1. INTRODUCTION |
| |
| Modern filesystems feature checksumming of data and metadata to |
| protect against data corruption. However, the detection of the |
| corruption is done at read time which could potentially be months |
| after the data was written. At that point the original data that the |
| application tried to write is most likely lost. |
| |
| The solution is to ensure that the disk is actually storing what the |
| application meant it to. Recent additions to both the SCSI family |
| protocols (SBC Data Integrity Field, SCC protection proposal) as well |
| as SATA/T13 (External Path Protection) try to remedy this by adding |
| support for appending integrity metadata to an I/O. The integrity |
| metadata (or protection information in SCSI terminology) includes a |
| checksum for each sector as well as an incrementing counter that |
| ensures the individual sectors are written in the right order. And |
| for some protection schemes also that the I/O is written to the right |
| place on disk. |
| |
| Current storage controllers and devices implement various protective |
| measures, for instance checksumming and scrubbing. But these |
| technologies are working in their own isolated domains or at best |
| between adjacent nodes in the I/O path. The interesting thing about |
| DIF and the other integrity extensions is that the protection format |
| is well defined and every node in the I/O path can verify the |
| integrity of the I/O and reject it if corruption is detected. This |
| allows not only corruption prevention but also isolation of the point |
| of failure. |
| |
| ---------------------------------------------------------------------- |
| 2. THE DATA INTEGRITY EXTENSIONS |
| |
| As written, the protocol extensions only protect the path between |
| controller and storage device. However, many controllers actually |
| allow the operating system to interact with the integrity metadata |
| (IMD). We have been working with several FC/SAS HBA vendors to enable |
| the protection information to be transferred to and from their |
| controllers. |
| |
| The SCSI Data Integrity Field works by appending 8 bytes of protection |
| information to each sector. The data + integrity metadata is stored |
| in 520 byte sectors on disk. Data + IMD are interleaved when |
| transferred between the controller and target. The T13 proposal is |
| similar. |
| |
| Because it is highly inconvenient for operating systems to deal with |
| 520 (and 4104) byte sectors, we approached several HBA vendors and |
| encouraged them to allow separation of the data and integrity metadata |
| scatter-gather lists. |
| |
| The controller will interleave the buffers on write and split them on |
| read. This means that Linux can DMA the data buffers to and from |
| host memory without changes to the page cache. |
| |
| Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs |
| is somewhat heavy to compute in software. Benchmarks found that |
| calculating this checksum had a significant impact on system |
| performance for a number of workloads. Some controllers allow a |
| lighter-weight checksum to be used when interfacing with the operating |
| system. Emulex, for instance, supports the TCP/IP checksum instead. |
| The IP checksum received from the OS is converted to the 16-bit CRC |
| when writing and vice versa. This allows the integrity metadata to be |
| generated by Linux or the application at very low cost (comparable to |
| software RAID5). |
| |
| The IP checksum is weaker than the CRC in terms of detecting bit |
| errors. However, the strength is really in the separation of the data |
| buffers and the integrity metadata. These two distinct buffers must |
| match up for an I/O to complete. |
| |
| The separation of the data and integrity metadata buffers as well as |
| the choice in checksums is referred to as the Data Integrity |
| Extensions. As these extensions are outside the scope of the protocol |
| bodies (T10, T13), Oracle and its partners are trying to standardize |
| them within the Storage Networking Industry Association. |
| |
| ---------------------------------------------------------------------- |
| 3. KERNEL CHANGES |
| |
| The data integrity framework in Linux enables protection information |
| to be pinned to I/Os and sent to/received from controllers that |
| support it. |
| |
| The advantage to the integrity extensions in SCSI and SATA is that |
| they enable us to protect the entire path from application to storage |
| device. However, at the same time this is also the biggest |
| disadvantage. It means that the protection information must be in a |
| format that can be understood by the disk. |
| |
| Generally Linux/POSIX applications are agnostic to the intricacies of |
| the storage devices they are accessing. The virtual filesystem switch |
| and the block layer make things like hardware sector size and |
| transport protocols completely transparent to the application. |
| |
| However, this level of detail is required when preparing the |
| protection information to send to a disk. Consequently, the very |
| concept of an end-to-end protection scheme is a layering violation. |
| It is completely unreasonable for an application to be aware whether |
| it is accessing a SCSI or SATA disk. |
| |
| The data integrity support implemented in Linux attempts to hide this |
| from the application. As far as the application (and to some extent |
| the kernel) is concerned, the integrity metadata is opaque information |
| that's attached to the I/O. |
| |
| The current implementation allows the block layer to automatically |
| generate the protection information for any I/O. Eventually the |
| intent is to move the integrity metadata calculation to userspace for |
| user data. Metadata and other I/O that originates within the kernel |
| will still use the automatic generation interface. |
| |
| Some storage devices allow each hardware sector to be tagged with a |
| 16-bit value. The owner of this tag space is the owner of the block |
| device. I.e. the filesystem in most cases. The filesystem can use |
| this extra space to tag sectors as they see fit. Because the tag |
| space is limited, the block interface allows tagging bigger chunks by |
| way of interleaving. This way, 8*16 bits of information can be |
| attached to a typical 4KB filesystem block. |
| |
| This also means that applications such as fsck and mkfs will need |
| access to manipulate the tags from user space. A passthrough |
| interface for this is being worked on. |
| |
| |
| ---------------------------------------------------------------------- |
| 4. BLOCK LAYER IMPLEMENTATION DETAILS |
| |
| 4.1 BIO |
| |
| The data integrity patches add a new field to struct bio when |
| CONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a |
| pointer to a struct bip which contains the bio integrity payload. |
| Essentially a bip is a trimmed down struct bio which holds a bio_vec |
| containing the integrity metadata and the required housekeeping |
| information (bvec pool, vector count, etc.) |
| |
| A kernel subsystem can enable data integrity protection on a bio by |
| calling bio_integrity_alloc(bio). This will allocate and attach the |
| bip to the bio. |
| |
| Individual pages containing integrity metadata can subsequently be |
| attached using bio_integrity_add_page(). |
| |
| bio_free() will automatically free the bip. |
| |
| |
| 4.2 BLOCK DEVICE |
| |
| Because the format of the protection data is tied to the physical |
| disk, each block device has been extended with a block integrity |
| profile (struct blk_integrity). This optional profile is registered |
| with the block layer using blk_integrity_register(). |
| |
| The profile contains callback functions for generating and verifying |
| the protection data, as well as getting and setting application tags. |
| The profile also contains a few constants to aid in completing, |
| merging and splitting the integrity metadata. |
| |
| Layered block devices will need to pick a profile that's appropriate |
| for all subdevices. blk_integrity_compare() can help with that. DM |
| and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 |
| will require extra work due to the application tag. |
| |
| |
| ---------------------------------------------------------------------- |
| 5.0 BLOCK LAYER INTEGRITY API |
| |
| 5.1 NORMAL FILESYSTEM |
| |
| The normal filesystem is unaware that the underlying block device |
| is capable of sending/receiving integrity metadata. The IMD will |
| be automatically generated by the block layer at submit_bio() time |
| in case of a WRITE. A READ request will cause the I/O integrity |
| to be verified upon completion. |
| |
| IMD generation and verification can be toggled using the |
| |
| /sys/block/<bdev>/integrity/write_generate |
| |
| and |
| |
| /sys/block/<bdev>/integrity/read_verify |
| |
| flags. |
| |
| |
| 5.2 INTEGRITY-AWARE FILESYSTEM |
| |
| A filesystem that is integrity-aware can prepare I/Os with IMD |
| attached. It can also use the application tag space if this is |
| supported by the block device. |
| |
| |
| bool bio_integrity_prep(bio); |
| |
| To generate IMD for WRITE and to set up buffers for READ, the |
| filesystem must call bio_integrity_prep(bio). |
| |
| Prior to calling this function, the bio data direction and start |
| sector must be set, and the bio should have all data pages |
| added. It is up to the caller to ensure that the bio does not |
| change while I/O is in progress. |
| Complete bio with error if prepare failed for some reson. |
| |
| |
| 5.3 PASSING EXISTING INTEGRITY METADATA |
| |
| Filesystems that either generate their own integrity metadata or |
| are capable of transferring IMD from user space can use the |
| following calls: |
| |
| |
| struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages); |
| |
| Allocates the bio integrity payload and hangs it off of the bio. |
| nr_pages indicate how many pages of protection data need to be |
| stored in the integrity bio_vec list (similar to bio_alloc()). |
| |
| The integrity payload will be freed at bio_free() time. |
| |
| |
| int bio_integrity_add_page(bio, page, len, offset); |
| |
| Attaches a page containing integrity metadata to an existing |
| bio. The bio must have an existing bip, |
| i.e. bio_integrity_alloc() must have been called. For a WRITE, |
| the integrity metadata in the pages must be in a format |
| understood by the target device with the notable exception that |
| the sector numbers will be remapped as the request traverses the |
| I/O stack. This implies that the pages added using this call |
| will be modified during I/O! The first reference tag in the |
| integrity metadata must have a value of bip->bip_sector. |
| |
| Pages can be added using bio_integrity_add_page() as long as |
| there is room in the bip bio_vec array (nr_pages). |
| |
| Upon completion of a READ operation, the attached pages will |
| contain the integrity metadata received from the storage device. |
| It is up to the receiver to process them and verify data |
| integrity upon completion. |
| |
| |
| 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY |
| METADATA |
| |
| To enable integrity exchange on a block device the gendisk must be |
| registered as capable: |
| |
| int blk_integrity_register(gendisk, blk_integrity); |
| |
| The blk_integrity struct is a template and should contain the |
| following: |
| |
| static struct blk_integrity my_profile = { |
| .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", |
| .generate_fn = my_generate_fn, |
| .verify_fn = my_verify_fn, |
| .tuple_size = sizeof(struct my_tuple_size), |
| .tag_size = <tag bytes per hw sector>, |
| }; |
| |
| 'name' is a text string which will be visible in sysfs. This is |
| part of the userland API so chose it carefully and never change |
| it. The format is standards body-type-variant. |
| E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. |
| |
| 'generate_fn' generates appropriate integrity metadata (for WRITE). |
| |
| 'verify_fn' verifies that the data buffer matches the integrity |
| metadata. |
| |
| 'tuple_size' must be set to match the size of the integrity |
| metadata per sector. I.e. 8 for DIF and EPP. |
| |
| 'tag_size' must be set to identify how many bytes of tag space |
| are available per hardware sector. For DIF this is either 2 or |
| 0 depending on the value of the Control Mode Page ATO bit. |
| |
| ---------------------------------------------------------------------- |
| 2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> |