diff options
author | Mike Holmes <mike.holmes@linaro.org> | 2016-01-11 19:39:55 -0500 |
---|---|---|
committer | Maxim Uvarov <maxim.uvarov@linaro.org> | 2016-03-04 13:24:49 +0300 |
commit | 8e7549b4052544578971381b6c6c68d57ca0eb3c (patch) | |
tree | 658a777a192b45f33034614e5dde47ea6a25455e /doc | |
parent | ecb0ab65b6b4e8772405d7bc2c095ef229c18c26 (diff) |
doc: users: migrate TM from API to users doc
Signed-off-by: Mike Holmes <mike.holmes@linaro.org>
Signed-off-by: Maxim Uvarov <maxim.uvarov@linaro.org>
Diffstat (limited to 'doc')
-rw-r--r-- | doc/users-guide/users-guide-tm.adoc | 264 | ||||
-rw-r--r-- | doc/users-guide/users-guide.adoc | 1 |
2 files changed, 265 insertions, 0 deletions
diff --git a/doc/users-guide/users-guide-tm.adoc b/doc/users-guide/users-guide-tm.adoc new file mode 100644 index 000000000..dc0d0030f --- /dev/null +++ b/doc/users-guide/users-guide-tm.adoc @@ -0,0 +1,264 @@ +== Traffic Manager \(TM) + +The TM subsystem is a general packet scheduling system that accepts +packets from input queues and applies strict priority scheduling, weighted fair +queueing scheduling and/or bandwidth controls to decide which input packet +should be chosen as the next output packet and when this output packet can be +sent onwards. + +A given platform supporting this TM API could support one or more pure hardware +based packet scheduling systems, one or more pure software based systems or one +or more hybrid systems - where because of hardware constraints some of the +packet scheduling is done in hardware and some is done in software. In +addition, there may also be additional API's beyond those described here for: + +- controlling advanced capabilities supported by specific hardware, software +or hybrid subsystems +- dealing with constraints and limitations of +specific implementations. + +The intention here is to be the simplest API that covers the vast majority of +packet scheduling requirements. + +Often a TM subsystem's output(s) will be directly connected to a device's +physical (or virtual) output interfaces/links, in which case sometimes such a +system will be called an Egress Packet Scheduler or an Output Link Shaper, +etc.. While the TM subsystems configured by this API can be used in such a +way, this API equally well supports the ability to have the TM subsystem's +outputs connect to other TM subsystem input queues or general software queues +or even some combination of these three cases. + +=== TM Algorithms + +The packet scheduling/dropping techniques that can be applied to input +traffic include any mixture of the following: + +- Strict Priority scheduling. +- Weighted Fair Queueing scheduling (WFQ). +- Bandwidth Shaping. +- Weighted Random Early Discard (WRED). + +Note that Bandwidth Shaping is the only feature that can cause packets to be +"delayed", and Weighted Random Early Discard is the only feature (other than +input queues becoming full) that can cause packets to be dropped. + +==== Strict Priority Scheduling + +Strict Priority Scheduling (or just priority for short), is a technique where input +queues and the packets from them, are assigned a priority value in the range 0 +.. ODP_TM_MAX_PRIORITIES - 1. At all times packets the the smallest priority +value will be chosen ahead of packets with a numerically larger priority value. +This is called strict priority scheduling because the algorithm strictly +enforces the scheduling of higher priority packets over lower priority +packets. + +==== Bandwidth Shaping + +Bandwidth Shaping (or often just Shaping) is the term used here for the idea of +controlling packet rates using single rate and/or dual rate token bucket +algorithms. For single rate shaping a rate (the commit rate) and a "burst +size" (the maximum commit count) are configured. Then an internal signed +integer counter called the _commitCnt_ is maintained such that if the _commitCnt_ +is positive then packets are eligible to be sent. When such a packet is +actually sent then its _commitCnt_ is decremented (usually by its length, but one +could decrement by 1 for each packet instead). The _commitCnt_ is then +incremented periodically based upon the configured rate, so that this technique +causes the traffic to be limited to the commit rate over the long term, while +allowing some ability to exceed this rate for a very short time (based on the +burst size) in order to catch up if the traffic input temporarily drops below +the commit rate. + +Dual Rate Shaping is designed to allow certain traffic flows to fairly send +more than their assigned commit rate when the scheduler has excess capacity. +The idea being that it may be better to allow some types of traffic to send +more than their committed bandwidth rather than letting the TM outputs be idle. +The configuration of Dual Rate Shaping requires additionally a peak rate and a +peak burst size. The peak rate must be greater than the related comls mit +rate, but the burst sizes have no similar constraint. Also for every input +priority that has Dual Rate shaping enabled, there needs to be an additional +equal or lower priority (equal or higher numeric priority value) assigned. +Then if the traffic exceeds its commit rate but not its peak rate, the +"excess" traffic will be sent at the lower priority level - which by the +strict priority algorithm should cause no degradation of the higher priority +traffic, while allowing for less idle outputs. + +==== Weighted Fair Queuing + +Weighted Fair Queuing (WFQ) is used to arbitrate amongst multiple input +packets with the same priority. Each input can be assigned a weight in the +range MIN_WFQ_WEIGHT..MAX_WFQ_WEIGHT (nominally 1..255) that affects the way +the algorithm chooses the next packet. If all of the weights are equal AND all +of the input packets are the same length then the algorithm is equivalent to a +round robin scheduling. If all of the weights are equal but the packets have +different lengths then the WFQ algorithm will attempt to choose the packet such +that inputs each get a fair share of the bandwidth - in other words it +implements a weighted round robin algorithm where the weighting is based on +frame length. + +When the input weights are not all equal and the input packet lengths vary then +the WFQ algorithm will schedule packets such that the packet with the lowest +"Virtual Finish Time" is chosen first. An input packet's Virtual Finish Time +is roughly calculated based on the WFQ object's base Virtual Finish Time when +the packet becomes the first packet in its queue plus its frame length divided +by its weight. +---- +virtualFinishTime = wfqVirtualTimeBase + (pktLength / wfqWeight) +---- + +In a system running at full capacity with no bandwidth limits - over the long +term - each input fan-in's average transmit rate will be the same fraction of +the output bandwidth as the fraction of its weight divided by the sum of all of +the WFQ fan-in weights. Hence larger WFQ weights result in better "service" +for a given fan-in. + +[source,c] +---- +totalWfqWeight = 0; +for (each fan-in entity - fanIn - feeding this WFQ scheduler) + totalWfqWeight += fanIn->sfqWeight; + +fanIn->avgTransmitRate = avgOutputRatefanIn->sfqWeight / totalWfqWeight; +---- + +==== Weighted Random Early Discard + +The Weighted Random Early Discard (WRED) algorithm deals with the situation +where an input packet rate exceeds some output rate (including the case where +Bandwidth Shaping limits some output rates). Without WRED enabled and +configured, the TM system will just implement a tail dropping scheme whereby +whichever packet is unlucky enough to arrive when an TM input queue is full +will be discarded regardless of priority or any other consideration. WRED +allows one to configure the system to use a better/fairer algorithm than simple +tail dropping. It works by measuring the "fullness" of various packet queues +and converting this percentage into a probability of random packet dropping +with the help of some configurable parameters. Then a random number is picked +and together with the drop probability, a decision is made to accept the packet +or drop it. A basic parameterization of WRED requires three parameters: + +- the maximum queue level (which could be either a maximum number of + packets or a maximum amount of memory (i.e. bytes/buffers) used), +- a starting threshold - which is a number in the range 0..100 + representing a percentage of the maximum queue level at which the + drop probability becomes non-zero, +- a drop probability - which is a number in the range 0..100 + representing a probability (0 means no drop and 100 means + certain drop) - which is used when the queue is near 100% full. + +Note that all packet drops for a TM system only occur when a new packet arrives +at a given TM system input queue. At that time either the WRED algorithm, if +enabled for this input queue, or the "input queue full" tail drop algorithm +will make a drop/no drop decision. After this point, any packets not dropped, +will at some point be sent out a TM output - assuming that the topology is +fully connected and enabled. + +=== Hierarchical Scheduling and tm_nodes + +This API supports the ability to do Hierarchical Scheduling whereby the +final scheduling decision is controlled by equal priority schedulers, +strict priority multiplexers, bandwidth shapers - at multiple levels - all +forming a tree rooted at a single egress object. In other words, all +tm_queues and tm_nodes have the property that their logical "output" feeds +into one fan-in of a subsequent tm_node or egresss object - forming a proper +tree. + +.Hierarchical Scheduling +image::../images/tm_hierarchy.svg[align="center"] + +Multi-level/hierarchical scheduling adds both great control and significant +complexity. Logically, despite the implication of the tm_node tree diagrams, +there are no queues between the levels of hierarchy. Instead all packets are +held in their input queue, until such time that the totality of all of the +tm_nodes in the single path from input queue to output object agrees that this +packet should be the next to be chosen to leave the TM system through the +output object "portal". Hence what flows from level to level is the "local +choice" of what packet/tm_queue should next be serviced. + +==== tm_nodes + +Tm_nodes are the main "entity"/object that a TM system is composed of. Each +tm_node is a mini-TM subsystem of its own, but the interconnection and +interplay of a multi-level "tree" of tm_nodes can allow the user to specify +some very sophisticated behaviours. Each tm_node can contain a set of scheduler +(one per strict priority level), a strict priority multiplexer, a bandwidth +shaper and a WRED component - or a subset of these. + +.Traffic Manager Node +image::../images/tm_node.svg[align="center"] + +In its full generality an tm_node consists of a set of "fan-in" connections to +preceding tm_queues or tm_nodes. The fan-in for a single tm_node can range +from 1 to many many thousands. This fan-in is divided first into a WFQ +scheduler per priority level. So if 4 priority levels are implemented by this +tm_node, there would be 4 WFQ schedulers - each with its own unique fan-in. +After the WFQ schedulers a priority chooser comes next - where it will always +choose the highest priority WFQ output available. The output of the priority +chooser then feeds a bandwidth shaper function which then finally uses the +shaper's propagation table to determine its output packet and its priority. +This output could then be remapped via a priority map profile and then becomes +one of the input fan-in to perhaps another level of tm_nodes, and so on. + +During this process it is important to remember that the bandwidth shaping +function never causes packets to be dropped. Instead all packet drops occur +because of tm_queue fullness or be running the WRED algorithm at the time a new +packet attempts to be appended to the end of some input queue. + +The WRED profile associated with an tm_node considers the entire set of +tm_queues feeding directly or indirectly into it as its measure of queue +fullness. + +==== tm_queues + +tm_queues are the second major type of "entity"/object that a TM system is +composed of. All packets MUST first enter the TM system via some tm_queue. +Then logically, the head packets of all of the tm_queues are examined +simultaneously by the entire TM system, and ONE tm_queue is chosen send its +head packet out of the TM system's egress. Abstractly packets stay in the +tm_queue until they are chosen at which time they are instantly transferred +from tm_queue to/through the corresponding TM egress. It is also important to +note that packets in the same tm_queue MUST always stay in order. In other +words, the second packet in an tm_queue must never leave the TM system through +a TM egress spigot before the first packet has left the system. So tm_queue +packet order must always be maintained. + +==== TM egress + +Note that TM egress objects are NOT referred to as queues, because in many/most +cases they don't have multi-packet structure but instead are viewed as a +port/spigot through which the TM system schedules and finally transfers input +packets through. + +=== Ideal versus Actual Behavior + +It is important to recognize the difference between the "abstract" mathematical +model of the prescribed behavior and real implementations. The model describes +the Ideal, but theoretically desired behavior, but such an Ideal is generally +not practical to implement. Instead, one understands that virtually all Real +TM systems attempt to approximate the Ideal behavior as given by the TM +configuration as best as they can - while still attaining high packet +processing performance. The idea is that instead of trying too hard to be +"perfect" at the granularity of say microseconds, it may be better to instead +try to match the long term Ideal behavior over a much more reasonable period of +time like a millisecond. It is generally better to have a stable +implementation that when averaged over a period of several milliseconds matches +the Ideal behavior very closely than to have an implementation that is perhaps +more accurate over a period of microseconds, but whose millisecond averaged +behavior drifts away from the Ideal case. + +=== Other TM Concepts + +==== Profiles + +This specification often packages related TM system parameters into +records/objects called profiles. These profiles can then be associated with +various entities like tm_nodes and tm_queue's. This way the amount of storage +associated with setting related parameters can be reduced and in addition it is +common to re-use the same set of parameter set over and over again, and also to +be able to change the parameter set once and have it affect lots of entities +with which it is associated with/applied to. + +==== Absolute Limits versus +odp_tm_capability_t+ + +This header file defines some constants representing the absolute maximum +settings for any TM system, though in most cases a TM system can (and should) +be created/instantiated with smaller values, since lower values will often +result in faster operation and/or less memory used. diff --git a/doc/users-guide/users-guide.adoc b/doc/users-guide/users-guide.adoc index 50ecab18c..d476225c9 100644 --- a/doc/users-guide/users-guide.adoc +++ b/doc/users-guide/users-guide.adoc @@ -1044,6 +1044,7 @@ in-place, when the output packet is the same as the input packet or the output packet can be a new packet provided by the application or allocated by the implementation from the session output pool. +include::users-guide-tm.adoc[] include::users-guide-cls.adoc[] |