| Anticipatory IO scheduler |
| ------------------------- |
| Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003 |
| |
| Attention! Database servers, especially those using "TCQ" disks should |
| investigate performance with the 'deadline' IO scheduler. Any system with high |
| disk performance requirements should do so, in fact. |
| |
| If you see unusual performance characteristics of your disk systems, or you |
| see big performance regressions versus the deadline scheduler, please email |
| me. Database users don't bother unless you're willing to test a lot of patches |
| from me ;) its a known issue. |
| |
| Also, users with hardware RAID controllers, doing striping, may find |
| highly variable performance results with using the as-iosched. The |
| as-iosched anticipatory implementation is based on the notion that a disk |
| device has only one physical seeking head. A striped RAID controller |
| actually has a head for each physical device in the logical RAID device. |
| |
| However, setting the antic_expire (see tunable parameters below) produces |
| very similar behavior to the deadline IO scheduler. |
| |
| |
| Selecting IO schedulers |
| ----------------------- |
| To choose IO schedulers at boot time, use the argument 'elevator=deadline'. |
| 'noop' and 'as' (the default) are also available. IO schedulers are assigned |
| globally at boot time only presently. |
| |
| |
| Anticipatory IO scheduler Policies |
| ---------------------------------- |
| The as-iosched implementation implements several layers of policies |
| to determine when an IO request is dispatched to the disk controller. |
| Here are the policies outlined, in order of application. |
| |
| 1. one-way Elevator algorithm. |
| |
| The elevator algorithm is similar to that used in deadline scheduler, with |
| the addition that it allows limited backward movement of the elevator |
| (i.e. seeks backwards). A seek backwards can occur when choosing between |
| two IO requests where one is behind the elevator's current position, and |
| the other is in front of the elevator's position. If the seek distance to |
| the request in back of the elevator is less than half the seek distance to |
| the request in front of the elevator, then the request in back can be chosen. |
| Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. |
| This favors forward movement of the elevator, while allowing opportunistic |
| "short" backward seeks. |
| |
| 2. FIFO expiration times for reads and for writes. |
| |
| This is again very similar to the deadline IO scheduler. The expiration |
| times for requests on these lists is tunable using the parameters read_expire |
| and write_expire discussed below. When a read or a write expires in this way, |
| the IO scheduler will interrupt its current elevator sweep or read anticipation |
| to service the expired request. |
| |
| 3. Read and write request batching |
| |
| A batch is a collection of read requests or a collection of write |
| requests. The as scheduler alternates dispatching read and write batches |
| to the driver. In the case a read batch, the scheduler submits read |
| requests to the driver as long as there are read requests to submit, and |
| the read batch time limit has not been exceeded (read_batch_expire). |
| The read batch time limit begins counting down only when there are |
| competing write requests pending. |
| |
| In the case of a write batch, the scheduler submits write requests to |
| the driver as long as there are write requests available, and the |
| write batch time limit has not been exceeded (write_batch_expire). |
| However, the length of write batches will be gradually shortened |
| when read batches frequently exceed their time limit. |
| |
| When changing between batch types, the scheduler waits for all requests |
| from the previous batch to complete before scheduling requests for the |
| next batch. |
| |
| The read and write fifo expiration times described in policy 2 above |
| are checked only when in scheduling IO of a batch for the corresponding |
| (read/write) type. So for example, the read FIFO timeout values are |
| tested only during read batches. Likewise, the write FIFO timeout |
| values are tested only during write batches. For this reason, |
| it is generally not recommended for the read batch time |
| to be longer than the write expiration time, nor for the write batch |
| time to exceed the read expiration time (see tunable parameters below). |
| |
| When the IO scheduler changes from a read to a write batch, |
| it begins the elevator from the request that is on the head of the |
| write expiration FIFO. Likewise, when changing from a write batch to |
| a read batch, scheduler begins the elevator from the first entry |
| on the read expiration FIFO. |
| |
| 4. Read anticipation. |
| |
| Read anticipation occurs only when scheduling a read batch. |
| This implementation of read anticipation allows only one read request |
| to be dispatched to the disk controller at a time. In |
| contrast, many write requests may be dispatched to the disk controller |
| at a time during a write batch. It is this characteristic that can make |
| the anticipatory scheduler perform anomalously with controllers supporting |
| TCQ, or with hardware striped RAID devices. Setting the antic_expire |
| queue parameter (see below) to zero disables this behavior, and the |
| anticipatory scheduler behaves essentially like the deadline scheduler. |
| |
| When read anticipation is enabled (antic_expire is not zero), reads |
| are dispatched to the disk controller one at a time. |
| At the end of each read request, the IO scheduler examines its next |
| candidate read request from its sorted read list. If that next request |
| is from the same process as the request that just completed, |
| or if the next request in the queue is "very close" to the |
| just completed request, it is dispatched immediately. Otherwise, |
| statistics (average think time, average seek distance) on the process |
| that submitted the just completed request are examined. If it seems |
| likely that that process will submit another request soon, and that |
| request is likely to be near the just completed request, then the IO |
| scheduler will stop dispatching more read requests for up time (antic_expire) |
| milliseconds, hoping that process will submit a new request near the one |
| that just completed. If such a request is made, then it is dispatched |
| immediately. If the antic_expire wait time expires, then the IO scheduler |
| will dispatch the next read request from the sorted read queue. |
| |
| To decide whether an anticipatory wait is worthwhile, the scheduler |
| maintains statistics for each process that can be used to compute |
| mean "think time" (the time between read requests), and mean seek |
| distance for that process. One observation is that these statistics |
| are associated with each process, but those statistics are not associated |
| with a specific IO device. So for example, if a process is doing IO |
| on several file systems on separate devices, the statistics will be |
| a combination of IO behavior from all those devices. |
| |
| |
| Tuning the anticipatory IO scheduler |
| ------------------------------------ |
| When using 'as', the anticipatory IO scheduler there are 5 parameters under |
| /sys/block/*/queue/iosched/. All are units of milliseconds. |
| |
| The parameters are: |
| * read_expire |
| Controls how long until a read request becomes "expired". It also controls the |
| interval between which expired requests are served, so set to 50, a request |
| might take anywhere < 100ms to be serviced _if_ it is the next on the |
| expired list. Obviously request expiration strategies won't make the disk |
| go faster. The result basically equates to the timeslice a single reader |
| gets in the presence of other IO. 100*((seek time / read_expire) + 1) is |
| very roughly the % streaming read efficiency your disk should get with |
| multiple readers. |
| |
| * read_batch_expire |
| Controls how much time a batch of reads is given before pending writes are |
| served. A higher value is more efficient. This might be set below read_expire |
| if writes are to be given higher priority than reads, but reads are to be |
| as efficient as possible when there are no writes. Generally though, it |
| should be some multiple of read_expire. |
| |
| * write_expire, and |
| * write_batch_expire are equivalent to the above, for writes. |
| |
| * antic_expire |
| Controls the maximum amount of time we can anticipate a good read (one |
| with a short seek distance from the most recently completed request) before |
| giving up. Many other factors may cause anticipation to be stopped early, |
| or some processes will not be "anticipated" at all. Should be a bit higher |
| for big seek time devices though not a linear correspondence - most |
| processes have only a few ms thinktime. |
| |