| Open vSwitch datapath developer documentation |
| ============================================= |
| |
| The Open vSwitch kernel module allows flexible userspace control over |
| flow-level packet processing on selected network devices. It can be |
| used to implement a plain Ethernet switch, network device bonding, |
| VLAN processing, network access control, flow-based network control, |
| and so on. |
| |
| The kernel module implements multiple "datapaths" (analogous to |
| bridges), each of which can have multiple "vports" (analogous to ports |
| within a bridge). Each datapath also has associated with it a "flow |
| table" that userspace populates with "flows" that map from keys based |
| on packet headers and metadata to sets of actions. The most common |
| action forwards the packet to another vport; other actions are also |
| implemented. |
| |
| When a packet arrives on a vport, the kernel module processes it by |
| extracting its flow key and looking it up in the flow table. If there |
| is a matching flow, it executes the associated actions. If there is |
| no match, it queues the packet to userspace for processing (as part of |
| its processing, userspace will likely set up a flow to handle further |
| packets of the same type entirely in-kernel). |
| |
| |
| Flow key compatibility |
| ---------------------- |
| |
| Network protocols evolve over time. New protocols become important |
| and existing protocols lose their prominence. For the Open vSwitch |
| kernel module to remain relevant, it must be possible for newer |
| versions to parse additional protocols as part of the flow key. It |
| might even be desirable, someday, to drop support for parsing |
| protocols that have become obsolete. Therefore, the Netlink interface |
| to Open vSwitch is designed to allow carefully written userspace |
| applications to work with any version of the flow key, past or future. |
| |
| To support this forward and backward compatibility, whenever the |
| kernel module passes a packet to userspace, it also passes along the |
| flow key that it parsed from the packet. Userspace then extracts its |
| own notion of a flow key from the packet and compares it against the |
| kernel-provided version: |
| |
| - If userspace's notion of the flow key for the packet matches the |
| kernel's, then nothing special is necessary. |
| |
| - If the kernel's flow key includes more fields than the userspace |
| version of the flow key, for example if the kernel decoded IPv6 |
| headers but userspace stopped at the Ethernet type (because it |
| does not understand IPv6), then again nothing special is |
| necessary. Userspace can still set up a flow in the usual way, |
| as long as it uses the kernel-provided flow key to do it. |
| |
| - If the userspace flow key includes more fields than the |
| kernel's, for example if userspace decoded an IPv6 header but |
| the kernel stopped at the Ethernet type, then userspace can |
| forward the packet manually, without setting up a flow in the |
| kernel. This case is bad for performance because every packet |
| that the kernel considers part of the flow must go to userspace, |
| but the forwarding behavior is correct. (If userspace can |
| determine that the values of the extra fields would not affect |
| forwarding behavior, then it could set up a flow anyway.) |
| |
| How flow keys evolve over time is important to making this work, so |
| the following sections go into detail. |
| |
| |
| Flow key format |
| --------------- |
| |
| A flow key is passed over a Netlink socket as a sequence of Netlink |
| attributes. Some attributes represent packet metadata, defined as any |
| information about a packet that cannot be extracted from the packet |
| itself, e.g. the vport on which the packet was received. Most |
| attributes, however, are extracted from headers within the packet, |
| e.g. source and destination addresses from Ethernet, IP, or TCP |
| headers. |
| |
| The <linux/openvswitch.h> header file defines the exact format of the |
| flow key attributes. For informal explanatory purposes here, we write |
| them as comma-separated strings, with parentheses indicating arguments |
| and nesting. For example, the following could represent a flow key |
| corresponding to a TCP packet that arrived on vport 1: |
| |
| in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), |
| eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, |
| frag=no), tcp(src=49163, dst=80) |
| |
| Often we ellipsize arguments not important to the discussion, e.g.: |
| |
| in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) |
| |
| |
| Basic rule for evolving flow keys |
| --------------------------------- |
| |
| Some care is needed to really maintain forward and backward |
| compatibility for applications that follow the rules listed under |
| "Flow key compatibility" above. |
| |
| The basic rule is obvious: |
| |
| ------------------------------------------------------------------ |
| New network protocol support must only supplement existing flow |
| key attributes. It must not change the meaning of already defined |
| flow key attributes. |
| ------------------------------------------------------------------ |
| |
| This rule does have less-obvious consequences so it is worth working |
| through a few examples. Suppose, for example, that the kernel module |
| did not already implement VLAN parsing. Instead, it just interpreted |
| the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the |
| packet. The flow key for any packet with an 802.1Q header would look |
| essentially like this, ignoring metadata: |
| |
| eth(...), eth_type(0x8100) |
| |
| Naively, to add VLAN support, it makes sense to add a new "vlan" flow |
| key attribute to contain the VLAN tag, then continue to decode the |
| encapsulated headers beyond the VLAN tag using the existing field |
| definitions. With this change, a TCP packet in VLAN 10 would have a |
| flow key much like this: |
| |
| eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) |
| |
| But this change would negatively affect a userspace application that |
| has not been updated to understand the new "vlan" flow key attribute. |
| The application could, following the flow compatibility rules above, |
| ignore the "vlan" attribute that it does not understand and therefore |
| assume that the flow contained IP packets. This is a bad assumption |
| (the flow only contains IP packets if one parses and skips over the |
| 802.1Q header) and it could cause the application's behavior to change |
| across kernel versions even though it follows the compatibility rules. |
| |
| The solution is to use a set of nested attributes. This is, for |
| example, why 802.1Q support uses nested attributes. A TCP packet in |
| VLAN 10 is actually expressed as: |
| |
| eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), |
| ip(proto=6, ...), tcp(...))) |
| |
| Notice how the "eth_type", "ip", and "tcp" flow key attributes are |
| nested inside the "encap" attribute. Thus, an application that does |
| not understand the "vlan" key will not see either of those attributes |
| and therefore will not misinterpret them. (Also, the outer eth_type |
| is still 0x8100, not changed to 0x0800.) |
| |
| Handling malformed packets |
| -------------------------- |
| |
| Don't drop packets in the kernel for malformed protocol headers, bad |
| checksums, etc. This would prevent userspace from implementing a |
| simple Ethernet switch that forwards every packet. |
| |
| Instead, in such a case, include an attribute with "empty" content. |
| It doesn't matter if the empty content could be valid protocol values, |
| as long as those values are rarely seen in practice, because userspace |
| can always forward all packets with those values to userspace and |
| handle them individually. |
| |
| For example, consider a packet that contains an IP header that |
| indicates protocol 6 for TCP, but which is truncated just after the IP |
| header, so that the TCP header is missing. The flow key for this |
| packet would include a tcp attribute with all-zero src and dst, like |
| this: |
| |
| eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) |
| |
| As another example, consider a packet with an Ethernet type of 0x8100, |
| indicating that a VLAN TCI should follow, but which is truncated just |
| after the Ethernet type. The flow key for this packet would include |
| an all-zero-bits vlan and an empty encap attribute, like this: |
| |
| eth(...), eth_type(0x8100), vlan(0), encap() |
| |
| Unlike a TCP packet with source and destination ports 0, an |
| all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka |
| VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan |
| attribute expressly to allow this situation to be distinguished. |
| Thus, the flow key in this second example unambiguously indicates a |
| missing or malformed VLAN TCI. |
| |
| Other rules |
| ----------- |
| |
| The other rules for flow keys are much less subtle: |
| |
| - Duplicate attributes are not allowed at a given nesting level. |
| |
| - Ordering of attributes is not significant. |
| |
| - When the kernel sends a given flow key to userspace, it always |
| composes it the same way. This allows userspace to hash and |
| compare entire flow keys that it may not be able to fully |
| interpret. |