|  | Lockless Ring Buffer Design | 
|  | =========================== | 
|  |  | 
|  | Copyright 2009 Red Hat Inc. | 
|  | Author:   Steven Rostedt <srostedt@redhat.com> | 
|  | License:   The GNU Free Documentation License, Version 1.2 | 
|  | (dual licensed under the GPL v2) | 
|  | Reviewers:   Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, | 
|  | and Frederic Weisbecker. | 
|  |  | 
|  |  | 
|  | Written for: 2.6.31 | 
|  |  | 
|  | Terminology used in this Document | 
|  | --------------------------------- | 
|  |  | 
|  | tail - where new writes happen in the ring buffer. | 
|  |  | 
|  | head - where new reads happen in the ring buffer. | 
|  |  | 
|  | producer - the task that writes into the ring buffer (same as writer) | 
|  |  | 
|  | writer - same as producer | 
|  |  | 
|  | consumer - the task that reads from the buffer (same as reader) | 
|  |  | 
|  | reader - same as consumer. | 
|  |  | 
|  | reader_page - A page outside the ring buffer used solely (for the most part) | 
|  | by the reader. | 
|  |  | 
|  | head_page - a pointer to the page that the reader will use next | 
|  |  | 
|  | tail_page - a pointer to the page that will be written to next | 
|  |  | 
|  | commit_page - a pointer to the page with the last finished non-nested write. | 
|  |  | 
|  | cmpxchg - hardware-assisted atomic transaction that performs the following: | 
|  |  | 
|  | A = B iff previous A == C | 
|  |  | 
|  | R = cmpxchg(A, C, B) is saying that we replace A with B if and only if | 
|  | current A is equal to C, and we put the old (current) A into R | 
|  |  | 
|  | R gets the previous A regardless if A is updated with B or not. | 
|  |  | 
|  | To see if the update was successful a compare of R == C may be used. | 
|  |  | 
|  | The Generic Ring Buffer | 
|  | ----------------------- | 
|  |  | 
|  | The ring buffer can be used in either an overwrite mode or in | 
|  | producer/consumer mode. | 
|  |  | 
|  | Producer/consumer mode is where if the producer were to fill up the | 
|  | buffer before the consumer could free up anything, the producer | 
|  | will stop writing to the buffer. This will lose most recent events. | 
|  |  | 
|  | Overwrite mode is where if the producer were to fill up the buffer | 
|  | before the consumer could free up anything, the producer will | 
|  | overwrite the older data. This will lose the oldest events. | 
|  |  | 
|  | No two writers can write at the same time (on the same per-cpu buffer), | 
|  | but a writer may interrupt another writer, but it must finish writing | 
|  | before the previous writer may continue. This is very important to the | 
|  | algorithm. The writers act like a "stack". The way interrupts works | 
|  | enforces this behavior. | 
|  |  | 
|  |  | 
|  | writer1 start | 
|  | <preempted> writer2 start | 
|  | <preempted> writer3 start | 
|  | writer3 finishes | 
|  | writer2 finishes | 
|  | writer1 finishes | 
|  |  | 
|  | This is very much like a writer being preempted by an interrupt and | 
|  | the interrupt doing a write as well. | 
|  |  | 
|  | Readers can happen at any time. But no two readers may run at the | 
|  | same time, nor can a reader preempt/interrupt another reader. A reader | 
|  | cannot preempt/interrupt a writer, but it may read/consume from the | 
|  | buffer at the same time as a writer is writing, but the reader must be | 
|  | on another processor to do so. A reader may read on its own processor | 
|  | and can be preempted by a writer. | 
|  |  | 
|  | A writer can preempt a reader, but a reader cannot preempt a writer. | 
|  | But a reader can read the buffer at the same time (on another processor) | 
|  | as a writer. | 
|  |  | 
|  | The ring buffer is made up of a list of pages held together by a linked list. | 
|  |  | 
|  | At initialization a reader page is allocated for the reader that is not | 
|  | part of the ring buffer. | 
|  |  | 
|  | The head_page, tail_page and commit_page are all initialized to point | 
|  | to the same page. | 
|  |  | 
|  | The reader page is initialized to have its next pointer pointing to | 
|  | the head page, and its previous pointer pointing to a page before | 
|  | the head page. | 
|  |  | 
|  | The reader has its own page to use. At start up time, this page is | 
|  | allocated but is not attached to the list. When the reader wants | 
|  | to read from the buffer, if its page is empty (like it is on start-up), | 
|  | it will swap its page with the head_page. The old reader page will | 
|  | become part of the ring buffer and the head_page will be removed. | 
|  | The page after the inserted page (old reader_page) will become the | 
|  | new head page. | 
|  |  | 
|  | Once the new page is given to the reader, the reader could do what | 
|  | it wants with it, as long as a writer has left that page. | 
|  |  | 
|  | A sample of how the reader page is swapped: Note this does not | 
|  | show the head page in the buffer, it is for demonstrating a swap | 
|  | only. | 
|  |  | 
|  | +------+ | 
|  | |reader|          RING BUFFER | 
|  | |page  | | 
|  | +------+ | 
|  | +---+   +---+   +---+ | 
|  | |   |-->|   |-->|   | | 
|  | |   |<--|   |<--|   | | 
|  | +---+   +---+   +---+ | 
|  | ^ |             ^ | | 
|  | | +-------------+ | | 
|  | +-----------------+ | 
|  |  | 
|  |  | 
|  | +------+ | 
|  | |reader|          RING BUFFER | 
|  | |page  |-------------------+ | 
|  | +------+                   v | 
|  | |             +---+   +---+   +---+ | 
|  | |             |   |-->|   |-->|   | | 
|  | |             |   |<--|   |<--|   |<-+ | 
|  | |             +---+   +---+   +---+  | | 
|  | |              ^ |             ^ |   | | 
|  | |              | +-------------+ |   | | 
|  | |              +-----------------+   | | 
|  | +------------------------------------+ | 
|  |  | 
|  | +------+ | 
|  | |reader|          RING BUFFER | 
|  | |page  |-------------------+ | 
|  | +------+ <---------------+ v | 
|  | |  ^          +---+   +---+   +---+ | 
|  | |  |          |   |-->|   |-->|   | | 
|  | |  |          |   |   |   |<--|   |<-+ | 
|  | |  |          +---+   +---+   +---+  | | 
|  | |  |             |             ^ |   | | 
|  | |  |             +-------------+ |   | | 
|  | |  +-----------------------------+   | | 
|  | +------------------------------------+ | 
|  |  | 
|  | +------+ | 
|  | |buffer|          RING BUFFER | 
|  | |page  |-------------------+ | 
|  | +------+ <---------------+ v | 
|  | |  ^          +---+   +---+   +---+ | 
|  | |  |          |   |   |   |-->|   | | 
|  | |  |  New     |   |   |   |<--|   |<-+ | 
|  | |  | Reader   +---+   +---+   +---+  | | 
|  | |  |  page ----^                 |   | | 
|  | |  |                             |   | | 
|  | |  +-----------------------------+   | | 
|  | +------------------------------------+ | 
|  |  | 
|  |  | 
|  |  | 
|  | It is possible that the page swapped is the commit page and the tail page, | 
|  | if what is in the ring buffer is less than what is held in a buffer page. | 
|  |  | 
|  |  | 
|  | reader page    commit page   tail page | 
|  | |              |             | | 
|  | v              |             | | 
|  | +---+           |             | | 
|  | |   |<----------+             | | 
|  | |   |<------------------------+ | 
|  | |   |------+ | 
|  | +---+      | | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | This case is still valid for this algorithm. | 
|  | When the writer leaves the page, it simply goes into the ring buffer | 
|  | since the reader page still points to the next location in the ring | 
|  | buffer. | 
|  |  | 
|  |  | 
|  | The main pointers: | 
|  |  | 
|  | reader page - The page used solely by the reader and is not part | 
|  | of the ring buffer (may be swapped in) | 
|  |  | 
|  | head page - the next page in the ring buffer that will be swapped | 
|  | with the reader page. | 
|  |  | 
|  | tail page - the page where the next write will take place. | 
|  |  | 
|  | commit page - the page that last finished a write. | 
|  |  | 
|  | The commit page only is updated by the outermost writer in the | 
|  | writer stack. A writer that preempts another writer will not move the | 
|  | commit page. | 
|  |  | 
|  | When data is written into the ring buffer, a position is reserved | 
|  | in the ring buffer and passed back to the writer. When the writer | 
|  | is finished writing data into that position, it commits the write. | 
|  |  | 
|  | Another write (or a read) may take place at anytime during this | 
|  | transaction. If another write happens it must finish before continuing | 
|  | with the previous write. | 
|  |  | 
|  |  | 
|  | Write reserve: | 
|  |  | 
|  | Buffer page | 
|  | +---------+ | 
|  | |written  | | 
|  | +---------+  <--- given back to writer (current commit) | 
|  | |reserved | | 
|  | +---------+ <--- tail pointer | 
|  | | empty   | | 
|  | +---------+ | 
|  |  | 
|  | Write commit: | 
|  |  | 
|  | Buffer page | 
|  | +---------+ | 
|  | |written  | | 
|  | +---------+ | 
|  | |written  | | 
|  | +---------+  <--- next position for write (current commit) | 
|  | | empty   | | 
|  | +---------+ | 
|  |  | 
|  |  | 
|  | If a write happens after the first reserve: | 
|  |  | 
|  | Buffer page | 
|  | +---------+ | 
|  | |written  | | 
|  | +---------+  <-- current commit | 
|  | |reserved | | 
|  | +---------+  <--- given back to second writer | 
|  | |reserved | | 
|  | +---------+ <--- tail pointer | 
|  |  | 
|  | After second writer commits: | 
|  |  | 
|  |  | 
|  | Buffer page | 
|  | +---------+ | 
|  | |written  | | 
|  | +---------+  <--(last full commit) | 
|  | |reserved | | 
|  | +---------+ | 
|  | |pending  | | 
|  | |commit   | | 
|  | +---------+ <--- tail pointer | 
|  |  | 
|  | When the first writer commits: | 
|  |  | 
|  | Buffer page | 
|  | +---------+ | 
|  | |written  | | 
|  | +---------+ | 
|  | |written  | | 
|  | +---------+ | 
|  | |written  | | 
|  | +---------+  <--(last full commit and tail pointer) | 
|  |  | 
|  |  | 
|  | The commit pointer points to the last write location that was | 
|  | committed without preempting another write. When a write that | 
|  | preempted another write is committed, it only becomes a pending commit | 
|  | and will not be a full commit until all writes have been committed. | 
|  |  | 
|  | The commit page points to the page that has the last full commit. | 
|  | The tail page points to the page with the last write (before | 
|  | committing). | 
|  |  | 
|  | The tail page is always equal to or after the commit page. It may | 
|  | be several pages ahead. If the tail page catches up to the commit | 
|  | page then no more writes may take place (regardless of the mode | 
|  | of the ring buffer: overwrite and produce/consumer). | 
|  |  | 
|  | The order of pages is: | 
|  |  | 
|  | head page | 
|  | commit page | 
|  | tail page | 
|  |  | 
|  | Possible scenario: | 
|  | tail page | 
|  | head page         commit page  | | 
|  | |                 |        | | 
|  | v                 v        v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | There is a special case that the head page is after either the commit page | 
|  | and possibly the tail page. That is when the commit (and tail) page has been | 
|  | swapped with the reader page. This is because the head page is always | 
|  | part of the ring buffer, but the reader page is not. Whenever there | 
|  | has been less than a full page that has been committed inside the ring buffer, | 
|  | and a reader swaps out a page, it will be swapping out the commit page. | 
|  |  | 
|  |  | 
|  | reader page    commit page   tail page | 
|  | |              |             | | 
|  | v              |             | | 
|  | +---+           |             | | 
|  | |   |<----------+             | | 
|  | |   |<------------------------+ | 
|  | |   |------+ | 
|  | +---+      | | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  | ^ | 
|  | | | 
|  | head page | 
|  |  | 
|  |  | 
|  | In this case, the head page will not move when the tail and commit | 
|  | move back into the ring buffer. | 
|  |  | 
|  | The reader cannot swap a page into the ring buffer if the commit page | 
|  | is still on that page. If the read meets the last commit (real commit | 
|  | not pending or reserved), then there is nothing more to read. | 
|  | The buffer is considered empty until another full commit finishes. | 
|  |  | 
|  | When the tail meets the head page, if the buffer is in overwrite mode, | 
|  | the head page will be pushed ahead one. If the buffer is in producer/consumer | 
|  | mode, the write will fail. | 
|  |  | 
|  | Overwrite mode: | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  | ^ | 
|  | | | 
|  | head page | 
|  |  | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  | ^ | 
|  | | | 
|  | head page | 
|  |  | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  | ^ | 
|  | | | 
|  | head page | 
|  |  | 
|  | Note, the reader page will still point to the previous head page. | 
|  | But when a swap takes place, it will use the most recent head page. | 
|  |  | 
|  |  | 
|  | Making the Ring Buffer Lockless: | 
|  | -------------------------------- | 
|  |  | 
|  | The main idea behind the lockless algorithm is to combine the moving | 
|  | of the head_page pointer with the swapping of pages with the reader. | 
|  | State flags are placed inside the pointer to the page. To do this, | 
|  | each page must be aligned in memory by 4 bytes. This will allow the 2 | 
|  | least significant bits of the address to be used as flags, since | 
|  | they will always be zero for the address. To get the address, | 
|  | simply mask out the flags. | 
|  |  | 
|  | MASK = ~3 | 
|  |  | 
|  | address & MASK | 
|  |  | 
|  | Two flags will be kept by these two bits: | 
|  |  | 
|  | HEADER - the page being pointed to is a head page | 
|  |  | 
|  | UPDATE - the page being pointed to is being updated by a writer | 
|  | and was or is about to be a head page. | 
|  |  | 
|  |  | 
|  | reader page | 
|  | | | 
|  | v | 
|  | +---+ | 
|  | |   |------+ | 
|  | +---+      | | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-H->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  |  | 
|  | The above pointer "-H->" would have the HEADER flag set. That is | 
|  | the next page is the next page to be swapped out by the reader. | 
|  | This pointer means the next page is the head page. | 
|  |  | 
|  | When the tail page meets the head pointer, it will use cmpxchg to | 
|  | change the pointer to the UPDATE state: | 
|  |  | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-H->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | "-U->" represents a pointer in the UPDATE state. | 
|  |  | 
|  | Any access to the reader will need to take some sort of lock to serialize | 
|  | the readers. But the writers will never take a lock to write to the | 
|  | ring buffer. This means we only need to worry about a single reader, | 
|  | and writes only preempt in "stack" formation. | 
|  |  | 
|  | When the reader tries to swap the page with the ring buffer, it | 
|  | will also use cmpxchg. If the flag bit in the pointer to the | 
|  | head page does not have the HEADER flag set, the compare will fail | 
|  | and the reader will need to look for the new head page and try again. | 
|  | Note, the flags UPDATE and HEADER are never set at the same time. | 
|  |  | 
|  | The reader swaps the reader page as follows: | 
|  |  | 
|  | +------+ | 
|  | |reader|          RING BUFFER | 
|  | |page  | | 
|  | +------+ | 
|  | +---+    +---+    +---+ | 
|  | |   |--->|   |--->|   | | 
|  | |   |<---|   |<---|   | | 
|  | +---+    +---+    +---+ | 
|  | ^ |               ^ | | 
|  | | +---------------+ | | 
|  | +-----H-------------+ | 
|  |  | 
|  | The reader sets the reader page next pointer as HEADER to the page after | 
|  | the head page. | 
|  |  | 
|  |  | 
|  | +------+ | 
|  | |reader|          RING BUFFER | 
|  | |page  |-------H-----------+ | 
|  | +------+                   v | 
|  | |             +---+    +---+    +---+ | 
|  | |             |   |--->|   |--->|   | | 
|  | |             |   |<---|   |<---|   |<-+ | 
|  | |             +---+    +---+    +---+  | | 
|  | |              ^ |               ^ |   | | 
|  | |              | +---------------+ |   | | 
|  | |              +-----H-------------+   | | 
|  | +--------------------------------------+ | 
|  |  | 
|  | It does a cmpxchg with the pointer to the previous head page to make it | 
|  | point to the reader page. Note that the new pointer does not have the HEADER | 
|  | flag set.  This action atomically moves the head page forward. | 
|  |  | 
|  | +------+ | 
|  | |reader|          RING BUFFER | 
|  | |page  |-------H-----------+ | 
|  | +------+                   v | 
|  | |  ^          +---+   +---+   +---+ | 
|  | |  |          |   |-->|   |-->|   | | 
|  | |  |          |   |<--|   |<--|   |<-+ | 
|  | |  |          +---+   +---+   +---+  | | 
|  | |  |             |             ^ |   | | 
|  | |  |             +-------------+ |   | | 
|  | |  +-----------------------------+   | | 
|  | +------------------------------------+ | 
|  |  | 
|  | After the new head page is set, the previous pointer of the head page is | 
|  | updated to the reader page. | 
|  |  | 
|  | +------+ | 
|  | |reader|          RING BUFFER | 
|  | |page  |-------H-----------+ | 
|  | +------+ <---------------+ v | 
|  | |  ^          +---+   +---+   +---+ | 
|  | |  |          |   |-->|   |-->|   | | 
|  | |  |          |   |   |   |<--|   |<-+ | 
|  | |  |          +---+   +---+   +---+  | | 
|  | |  |             |             ^ |   | | 
|  | |  |             +-------------+ |   | | 
|  | |  +-----------------------------+   | | 
|  | +------------------------------------+ | 
|  |  | 
|  | +------+ | 
|  | |buffer|          RING BUFFER | 
|  | |page  |-------H-----------+  <--- New head page | 
|  | +------+ <---------------+ v | 
|  | |  ^          +---+   +---+   +---+ | 
|  | |  |          |   |   |   |-->|   | | 
|  | |  |  New     |   |   |   |<--|   |<-+ | 
|  | |  | Reader   +---+   +---+   +---+  | | 
|  | |  |  page ----^                 |   | | 
|  | |  |                             |   | | 
|  | |  +-----------------------------+   | | 
|  | +------------------------------------+ | 
|  |  | 
|  | Another important point: The page that the reader page points back to | 
|  | by its previous pointer (the one that now points to the new head page) | 
|  | never points back to the reader page. That is because the reader page is | 
|  | not part of the ring buffer. Traversing the ring buffer via the next pointers | 
|  | will always stay in the ring buffer. Traversing the ring buffer via the | 
|  | prev pointers may not. | 
|  |  | 
|  | Note, the way to determine a reader page is simply by examining the previous | 
|  | pointer of the page. If the next pointer of the previous page does not | 
|  | point back to the original page, then the original page is a reader page: | 
|  |  | 
|  |  | 
|  | +--------+ | 
|  | | reader |  next   +----+ | 
|  | |  page  |-------->|    |<====== (buffer page) | 
|  | +--------+         +----+ | 
|  | |                | ^ | 
|  | |                v | next | 
|  | prev |              +----+ | 
|  | +------------->|    | | 
|  | +----+ | 
|  |  | 
|  | The way the head page moves forward: | 
|  |  | 
|  | When the tail page meets the head page and the buffer is in overwrite mode | 
|  | and more writes take place, the head page must be moved forward before the | 
|  | writer may move the tail page. The way this is done is that the writer | 
|  | performs a cmpxchg to convert the pointer to the head page from the HEADER | 
|  | flag to have the UPDATE flag set. Once this is done, the reader will | 
|  | not be able to swap the head page from the buffer, nor will it be able to | 
|  | move the head page, until the writer is finished with the move. | 
|  |  | 
|  | This eliminates any races that the reader can have on the writer. The reader | 
|  | must spin, and this is why the reader cannot preempt the writer. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-H->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | The following page will be made into the new head page. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-H->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | After the new head page has been set, we can set the old head page | 
|  | pointer back to NORMAL. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |-H->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | After the head page has been moved, the tail page may now move forward. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |-H->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  |  | 
|  | The above are the trivial updates. Now for the more complex scenarios. | 
|  |  | 
|  |  | 
|  | As stated before, if enough writes preempt the first write, the | 
|  | tail page may make it all the way around the buffer and meet the commit | 
|  | page. At this time, we must start dropping writes (usually with some kind | 
|  | of warning to the user). But what happens if the commit was still on the | 
|  | reader page? The commit page is not part of the ring buffer. The tail page | 
|  | must account for this. | 
|  |  | 
|  |  | 
|  | reader page    commit page | 
|  | |              | | 
|  | v              | | 
|  | +---+           | | 
|  | |   |<----------+ | 
|  | |   | | 
|  | |   |------+ | 
|  | +---+      | | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-H->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  | ^ | 
|  | | | 
|  | tail page | 
|  |  | 
|  | If the tail page were to simply push the head page forward, the commit when | 
|  | leaving the reader page would not be pointing to the correct page. | 
|  |  | 
|  | The solution to this is to test if the commit page is on the reader page | 
|  | before pushing the head page. If it is, then it can be assumed that the | 
|  | tail page wrapped the buffer, and we must drop new writes. | 
|  |  | 
|  | This is not a race condition, because the commit page can only be moved | 
|  | by the outermost writer (the writer that was preempted). | 
|  | This means that the commit will not move while a writer is moving the | 
|  | tail page. The reader cannot swap the reader page if it is also being | 
|  | used as the commit page. The reader can simply check that the commit | 
|  | is off the reader page. Once the commit page leaves the reader page | 
|  | it will never go back on it unless a reader does another swap with the | 
|  | buffer page that is also the commit page. | 
|  |  | 
|  |  | 
|  | Nested writes | 
|  | ------------- | 
|  |  | 
|  | In the pushing forward of the tail page we must first push forward | 
|  | the head page if the head page is the next page. If the head page | 
|  | is not the next page, the tail page is simply updated with a cmpxchg. | 
|  |  | 
|  | Only writers move the tail page. This must be done atomically to protect | 
|  | against nested writers. | 
|  |  | 
|  | temp_page = tail_page | 
|  | next_page = temp_page->next | 
|  | cmpxchg(tail_page, temp_page, next_page) | 
|  |  | 
|  | The above will update the tail page if it is still pointing to the expected | 
|  | page. If this fails, a nested write pushed it forward, the current write | 
|  | does not need to push it. | 
|  |  | 
|  |  | 
|  | temp page | 
|  | | | 
|  | v | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | Nested write comes in and moves the tail page forward: | 
|  |  | 
|  | tail page (moved by nested writer) | 
|  | temp page   | | 
|  | |        | | 
|  | v        v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | The above would fail the cmpxchg, but since the tail page has already | 
|  | been moved forward, the writer will just try again to reserve storage | 
|  | on the new tail page. | 
|  |  | 
|  | But the moving of the head page is a bit more complex. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-H->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | The write converts the head page pointer to UPDATE. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | But if a nested writer preempts here, it will see that the next | 
|  | page is a head page, but it is also nested. It will detect that | 
|  | it is nested and will save that information. The detection is the | 
|  | fact that it sees the UPDATE flag instead of a HEADER or NORMAL | 
|  | pointer. | 
|  |  | 
|  | The nested writer will set the new head page pointer. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-H->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | But it will not reset the update back to normal. Only the writer | 
|  | that converted a pointer from HEAD to UPDATE will convert it back | 
|  | to NORMAL. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-H->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | After the nested writer finishes, the outermost writer will convert | 
|  | the UPDATE pointer to NORMAL. | 
|  |  | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |-H->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  |  | 
|  | It can be even more complex if several nested writes came in and moved | 
|  | the tail page ahead several pages: | 
|  |  | 
|  |  | 
|  | (first writer) | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-H->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | The write converts the head page pointer to UPDATE. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |--->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | Next writer comes in, and sees the update and sets up the new | 
|  | head page. | 
|  |  | 
|  | (second writer) | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-H->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | The nested writer moves the tail page forward. But does not set the old | 
|  | update page to NORMAL because it is not the outermost writer. | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-H->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | Another writer preempts and sees the page after the tail page is a head page. | 
|  | It changes it from HEAD to UPDATE. | 
|  |  | 
|  | (third writer) | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-U->|   |---> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | The writer will move the head page forward: | 
|  |  | 
|  |  | 
|  | (third writer) | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-U->|   |-H-> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | But now that the third writer did change the HEAD flag to UPDATE it | 
|  | will convert it to normal: | 
|  |  | 
|  |  | 
|  | (third writer) | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |--->|   |-H-> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  |  | 
|  | Then it will move the tail page, and return back to the second writer. | 
|  |  | 
|  |  | 
|  | (second writer) | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |--->|   |-H-> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  |  | 
|  | The second writer will fail to move the tail page because it was already | 
|  | moved, so it will try again and add its data to the new tail page. | 
|  | It will return to the first writer. | 
|  |  | 
|  |  | 
|  | (first writer) | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |--->|   |-H-> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | The first writer cannot know atomically if the tail page moved | 
|  | while it updates the HEAD page. It will then update the head page to | 
|  | what it thinks is the new head page. | 
|  |  | 
|  |  | 
|  | (first writer) | 
|  |  | 
|  | tail page | 
|  | | | 
|  | v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-H->|   |-H-> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | Since the cmpxchg returns the old value of the pointer the first writer | 
|  | will see it succeeded in updating the pointer from NORMAL to HEAD. | 
|  | But as we can see, this is not good enough. It must also check to see | 
|  | if the tail page is either where it use to be or on the next page: | 
|  |  | 
|  |  | 
|  | (first writer) | 
|  |  | 
|  | A        B    tail page | 
|  | |        |        | | 
|  | v        v        v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |-H->|   |-H-> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | If tail page != A and tail page != B, then it must reset the pointer | 
|  | back to NORMAL. The fact that it only needs to worry about nested | 
|  | writers means that it only needs to check this after setting the HEAD page. | 
|  |  | 
|  |  | 
|  | (first writer) | 
|  |  | 
|  | A        B    tail page | 
|  | |        |        | | 
|  | v        v        v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |-U->|   |--->|   |-H-> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  | 
|  | Now the writer can update the head page. This is also why the head page must | 
|  | remain in UPDATE and only reset by the outermost writer. This prevents | 
|  | the reader from seeing the incorrect head page. | 
|  |  | 
|  |  | 
|  | (first writer) | 
|  |  | 
|  | A        B    tail page | 
|  | |        |        | | 
|  | v        v        v | 
|  | +---+    +---+    +---+    +---+ | 
|  | <---|   |--->|   |--->|   |--->|   |-H-> | 
|  | --->|   |<---|   |<---|   |<---|   |<--- | 
|  | +---+    +---+    +---+    +---+ | 
|  |  |