Age | Commit message (Collapse) | Author |
|
Reported-by: Alexey I. Froloff <raorn@altlinux.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
|
|
net/udp.h is currently included indirectly via linux/ipv6.h which is
in turn included indirectly via linux/ip.h. However, this breaks down
if CONFIG_IPV6 is not set, leading to a number of build errors.
Signed-off-by: Simon Horman <horms@verge.net.au>
[Jesse: shortened commit message]
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Simon Horman <horms@verge.net.au>
[Jesse: Added missing pr_fmt in vport-gre.c and dp_sysfs_dp.c]
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
In the earliest kernels that we support this family of macros
wasn't defined at all. Later they were defined but did not include
the module name. Finally, pr_warn was made a synonym for pr_warning.
This harmonizes the behavior across all kernels.
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Some of the flow actions that modify skbuff data did not check that the
skbuff was long enough before doing so. This commit fixes that problem.
Previously, the strategy for avoiding this was to only indicate the layer-3
nw_proto field in the flow if the corresponding layer-4 header was fully
present, so that if, for example, nw_proto was IPPROTO_TCP, this meant
that a TCP header was present. The original motivation for this patch was
to add corresponding code to only indicate a layer-2 dl_type if the
corresponding layer-3 header was fully present. But I'm now convinced that
this approach is conceptually wrong, because the meaning of a layer-N
header should not be affected by the meaning of a layer-(N+1) header.
This commit switches to a new approach. Now, when a header is missing, its
fields in the flow are simply zeroed and have no effect on the "type" field
for the outer header. Responsibility for ensuring that a header is fully
present is now shifted to the actions that wish to modify that header.
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
This commit started out as simply better documenting flow_extract(),
but then I realized that nothing cares about transport_header in the
non-IP case, so don't bother with it at all.
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
These calls to pskb_may_pull() can be reduced to checks on skb->len because
in these contexts those headers will already have been pulled into the
skb linear area if it is there at all.
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
Until now flow_extract() has simply returned a bogus flow when memory
allocation errors occurred. This fixes the problem by propagating the
error to the caller.
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
"ARP spoofing" is when a host claims an incorrect association between an
IP address and a MAC address for deceptive purposes. OpenFlow by itself
can prevent a host from sending out ARP replies from an incorrect MAC
address in the Ethernet L2 header, but it cannot control the MAC addresses
inside the ARP L3 packet. This commit adds a new action that can be used
to drop these spoofed packets.
CC: Paul Ingram <paul@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
flow_extract() can fail due to memory allocation errors in pskb_may_pull().
Currently it doesn't return those properly, instead just reporting a bogus
flow to the caller. But its return value is currently in use for reporting
whether the packet was an IPv4 fragment. This commit switches to reporting
that in the skb itself so that the return value can be reused to report
errors.
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
The callers ensure that this is already the case.
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
'bool' is better modern kernel style.
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
kfree_skb() will ignore a NULL pointer.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Add support for the transport portion of the CAPWAP protocol as
an alternative to GRE for L2 over L3 tunneling. This is not
full support for the CAPWAP protocol. CAPWAP covers management
of wireless access points and describes a control protocol for
setting those devices up. It also describes a data plane protocol
that allows packets to be tunneled to a controller for inspection.
This data plane protocol is the only component covered by this
commit.
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Up until now it was assumed that encapsulated packets larger than
the MTU would be fragmented by the IP stack. However, some
tunneling protocols provide their own fragmentation mechanism. This
adds the necessary support to the generic tunnel code to support
fragmentation.
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Much of the code in the GRE implementation is not specific to the
GRE protocol but is actually common to all types of tunnels. In
order to support future types of tunnels, move this code into a
common library.
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Between 2.6.35 and 2.6.36-rc1 the owner element of struct brport_attribute
was removed.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
This adds compatibility with a series kernel changesets that
introduces 64bit statistics. The final changeset (to date) being
"net: Document that dev_get_stats() returns the given pointer".
The relevant changesets were added between 2.6.35 and 2.6.36-rc1.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
This adds compatibility with kernel changeset
"bridge: use rx_handler_data pointer to store net_bridge_port pointer"
which was added between 2.6.35 and 2.6.36-rc1.
With this change it is now safe to (attempt to) insert both bridge and
datapath with newer (>=2.6.36) kernels, although whichever is inserted
second will fail to initialise on the call to netdev_rx_handler_register()
Signed-off-by: Simon Horman <horms@verge.net.au>
[Jesse: fixed merge conflicts in vport-netdev.c and netdevice.h]
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Although not strictly necessary, this will make this
function more consistent when compatibility for 2.6.36 is added.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
This brings the code up to sync with the kernel as
of changeset "net-next: remove useless union keyword",
which was added between 2.6.35 and 2.6.36-rc1
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
For kernels that have netdev_rx_handler_register() (>=2.6.35),
duplicate netdevs are detected by netdev_rx_handler_register().
So by adding duplicate detection to the netdev_rx_handler_register()
compatibility code the explicit check in netdev_create() can be removed.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
This adds compatibility with kernel changeset
of changeset "net: add rx_handler data pointer"
and thus "net: replace hooks in __netif_receive_skb V5",
which were added between 2.6.35 and 2.6.36-rc1
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
We enforce mutual exclusion when updating statistics by disabling
bottom halves and only writing to per-CPU state. However, reading
requires looking at the statistics for foreign CPUs, which could be
in the process of updating them since there isn't a lock. This means
we could get garbage values for 64-bit values on 32-bit machines or
byte counts that don't correspond to packet counts, etc.
This commit introduces a sequence lock for statistics values to avoid
this problem. Getting a write lock is very cheap - it only requires
incrementing a counter plus a memory barrier (which is compiled away
on x86) to acquire or release the lock and will never block. On
read we spin until the sequence number hasn't changed in the middle
of the operation, indicating that the we have a consistent set of
values.
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
The current meaning of the GRE checksum option is to include
checksums on transmit and require packets to have them on receive.
In addition, incoming packets with checksums are always validated
regardless of this option. Requiring checksums on receive creates
surprising behavior and interoperability issues. This disables the
requirement on receive. The new behavior is that the sender decides
whether to checksum packets and the receiver will validate packets
with checksums (similar to UDP).
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
The kernel and user datapaths have code that assumes that 802.1Q headers
are used only inside Ethernet II frames, not inside SNAP-encapsulated
frames. But the kernel and user flow_extract() implementations would
interpret 802.1Q headers inside SNAP headers as being valid VLANs. This
would cause packet corruption if any VLAN-related actions were to be taken,
so change the two flow_extract() implementations only to accept 802.1Q as
an Ethernet II frame type, not as a SNAP-encoded frame type.
802.1Q-2005 says that this is correct anyhow:
Where the ISS instance used to transmit and receive tagged frames is
provided by a media access control method that can support Ethernet
Type encoding directly (e.g., is an IEEE 802.3 or IEEE 802.11 MAC) or
is media access method independent (e.g., 6.6), the TPID is Ethernet
Type encoded, i.e., is two octets in length and comprises solely the
assigned Ethernet Type value.
Where the ISS instance is provided by a media access method that
cannot directly support Ethernet Type encoding (e.g., is an IEEE
802.5 or FDDI MAC), the TPID is encoded according to the rule for
a Subnetwork Access Protocol (Clause 10 of IEEE Std 802) that
encapsulates Ethernet frames over LLC, and comprises the SNAP
header (AA-AA-03) followed by the SNAP PID (00-00-00) followed by
the two octets of the assigned Ethernet Type value.
All of the media that OVS handles supports Ethernet Type fields, so to me
that means that we don't have to handle 802.1Q-inside-SNAP.
On the other hand, we *do* have to handle SNAP-inside-802.1Q, because this
is actually allowed by the standards. So this commit also adds that
support.
I verified that, with this change, both SNAP and Ethernet packets are
properly recognized both with and without 802.1Q encapsulation.
I was a bit surprised to find out that Linux does not accept
SNAP-encapsulated IP frames on Ethernet.
Here's a summary of how frames are handled before and after this commit:
Common cases
------------
Ethernet
+------------+
1. |dst|src|TYPE|
+------------+
Ethernet LLC SNAP
+------------+ +--------+ +-----------+
2. |dst|src| len| |aa|aa|03| |000000|TYPE|
+------------+ +--------+ +-----------+
Ethernet 802.1Q
+------------+ +---------+
3. |dst|src|8100| |VLAN|TYPE|
+------------+ +---------+
Ethernet 802.1Q LLC SNAP
+------------+ +---------+ +--------+ +-----------+
4. |dst|src|8100| |VLAN| LEN| |aa|aa|03| |000000|TYPE|
+------------+ +---------+ +--------+ +-----------+
Unusual cases
-------------
Ethernet LLC SNAP 802.1Q
+------------+ +--------+ +-----------+ +---------+
5. |dst|src| len| |aa|aa|03| |000000|8100| |VLAN|TYPE|
+------------+ +--------+ +-----------+ +---------+
Ethernet LLC
+------------+ +--------+
6. |dst|src| len| |xx|xx|xx|
+------------+ +--------+
Ethernet LLC SNAP
+------------+ +--------+ +-----------+
7. |dst|src| len| |aa|aa|03| |xxxxxx|xxxx|
+------------+ +--------+ +-----------+
Ethernet 802.1Q LLC
+------------+ +---------+ +--------+
8. |dst|src|8100| |VLAN| LEN| |xx|xx|xx|
+------------+ +---------+ +--------+
Ethernet 802.1Q LLC SNAP
+------------+ +---------+ +--------+ +-----------+
9. |dst|src|8100| |VLAN| LEN| |aa|aa|03| |xxxxxx|xxxx|
+------------+ +---------+ +--------+ +-----------+
Behavior
--------
--------------- --------------- -------------------------------------
Before After
this commit this commit
dl_type dl_vlan dl_type dl_vlan Notes
------- ------- ------- ------- -------------------------------------
1. TYPE ffff TYPE ffff no change
2. TYPE ffff TYPE ffff no change
3. TYPE VLAN TYPE VLAN no change
4. LEN VLAN TYPE VLAN proposal fixes behavior
5. TYPE VLAN 8100 ffff 802.1Q says this is invalid framing
6. 05ff ffff 05ff ffff no change
7. 05ff ffff 05ff ffff no change
8. LEN VLAN 05ff VLAN proposal fixes behavior
9. LEN VLAN 05ff VLAN proposal fixes behavior
Signed-off-by: Ben Pfaff <blp@nicira.com>
|
|
In-kernel loops need to be suppressed; otherwise, they cause high CPU
consumption, even to the point that the machine becomes unusable. Ideally
these flows should never be added to the Open vSwitch flow table, but it
is fairly easy for a buggy controller to create them given the menagerie
of tunnels, patches, etc. that OVS makes available.
Commit ecbb6953b "datapath: Add loop checking" did the initial work
toward suppressing loops, by dropping packets that recursed more than 5
times. This at least prevented the kernel stack from overflowing and
thereby OOPSing the machine. But even with this commit, it is still
possible to waste a lot of CPU time due to loops. The problem is not
limited to 5 recursive calls per packet: any packet can be sent to
multiple destinations, which in turn can themselves be sent to multiple
destinations, and so on. We have actually seen in practice a case where
each packet was, apparently, sent to at least 2 destinations per hop, so
that each packet actually consumed CPU time for 2**5 == 32 packets,
possibly more.
This commit takes loop suppression a step further, by clearing the actions
of flows that are implicated in loops. Thus, after the first packet in
such a flow, later packets for either the "root" flow or for flows that
it ends up looping through are simply discarded, saving a huge amount of
CPU time.
This version of the commit just clears the actions from the flows that a
part of the loop. Probably, there should be some additional action to tell
ovs-vswitchd that a loop has been detected, so that it can in turn inform
the controller one way or another.
My test case was this:
ovs-controller -H --max-idle=permanent punix:/tmp/controller
ovs-vsctl -- \
set-controller br0 unix:/tmp/controller -- \
add-port br0 patch00 -- \
add-port br0 patch01 -- \
add-port br0 patch10 -- \
add-port br0 patch11 -- \
add-port br0 patch20 -- \
add-port br0 patch21 -- \
add-port br0 patch30 -- \
add-port br0 patch31 -- \
set Interface patch00 type=patch options:peer=patch01 -- \
set Interface patch01 type=patch options:peer=patch00 -- \
set Interface patch10 type=patch options:peer=patch11 -- \
set Interface patch11 type=patch options:peer=patch10 -- \
set Interface patch20 type=patch options:peer=patch21 -- \
set Interface patch21 type=patch options:peer=patch20 -- \
set Interface patch30 type=patch options:peer=patch31 -- \
set Interface patch31 type=patch options:peer=patch30
followed by sending a single "ping" packet from an attached Ethernet
port into the bridge. After this, without this commit the vswitch
userspace and kernel consume 50-75% of the machine's CPU (in my KVM
test setup on a single physical host); with this commit, some CPU is
consumed initially but it converges on 0% quickly.
A more challenging test sends a series of packets in multiple flows;
I used "hping3" with its default options. Without this commit, the
vswitch consumes 100% of the machine's CPU, most of which is in the
kernel. With this commit, the vswitch consumes "only" 33-50% CPU,
most of which is in userspace, so the machine is more responsive.
A refinement on this commit would be to pass the loop counter down to
userspace as part of the odp_msg struct and then back up as part of
the ODP_EXECUTE command arguments. This would, presumably, reduce
the CPU requirements, since it would allow loop detection to happen
earlier, during initial setup of flows, instead of just on the second
and subsequent packets of flows.
|
|
This function is both trivial and on the packet processing fast path, so
expand it inline.
|
|
Originally, the datapath didn't care about IP TOS at all. Then, to support
NetFlow, we made it keep track of the last-seen IP TOS value on a per-flow
basis. Then, to support OpenFlow 1.0, we added a nw_tos field to
odp_flow_key. We don't need both methods, so this commit drops the
NetFlow-specific tracking.
This introduces a small kernel ABI break: upgrading the kernel module
without upgrading the OVS userspace will mean that NetFlow records will
all show an IP TOS value of 0. I don't consider that to be a serious
problem.
|
|
We don't actually use this function anymore so there isn't a
point in having a configure test for it.
|
|
Signed-off-by: Alexey I. Froloff <raorn@altlinux.org>
|
|
The previous commit still had some issues with the
"set_normalized_timespec" symbol being undefined. Here we just replace
it. We can search for a more elegant solution later if necessary.
|
|
The commit "datapath: Don't query time for every packet." (6bfafa55)
introduced the use of "set_normalized_timespec". Unfortunately, older
kernels don't export the symbol. This implements the function on those
older kernels.
|
|
'struct net_device' is refcounted and can stick around for quite a
while if someone is still holding a reference to it. However, we
free the vport that it is attached to in the next RCU grace period
after detach. This assigns the vport to NULL on detach and adds
appropriate checks.
|
|
When we detached a vport we would assign NULL to dp_port->vport
before calling synchronize_rcu(). However, since vports have a
longer lifetime than dp_ports there were no checks before
dereferencing dp_port->vport. This changes the behavior to
match the assumption by not assigning NULL during detach. This
avoids a potential NULL pointer dereference in do_output() among
other places.
|
|
Several blocks of code were either no longer being called or had
been "#if 0"'d out for a long time. This removes them.
|
|
On vport ingress we already check for shared SKBs but then later
warn in several other places. In a similar vein, we check every
packet to see if it is LRO but only certain vports can produce
these packets. Remove and consolidate checks to the places where
they are needed.
|
|
A few functions were missed in the change to move the return type
onto the same line as the arguments.
|
|
Rather than actually query the time every time a packet comes through,
just store the current jiffies and convert it to actual time when
requested. GRE is the primary beneficiary of this because the traffic
travels through the datapath twice. This change reduces CPU utilization
3-4% with GRE.
|
|
Currently the flow key is updated to match an action that is applied
to a packet but these field are never looked at again. Not only is
this a waste of time it also makes optimizations involving caching
the flow key more difficult.
|
|
We don't need a function to set a variable. In practice it will
almost certainly get inlined but this makes it easier to read.
|
|
GRE is a somewhat annoying protocol because the header is variable
length. However, it does have a few fields that are always present
so we can make the parsing seem less magical by using a struct for
those fields instead of building it up field by field.
|
|
We currently remove ports from the GRE hash table and then immediately
free the ports. Since received packets could be using that port this
can lead to a crash (the port has already been detached from the
datapath so this can't happen for transmitted packets). As a result
we need to wait for an RCU grace period to elapse before actually
freeing the port.
In an ideal world we would actually remove the port from the hash
table in a hypothetical gre_detach() function since this is one of
the purposes of detaching. However, we also use the hash table to
look for collisions in the lookup criteria and don't want to allow
two identical ports to exist. It doesn't matter though because we
aren't blocking on the freeing of resources.
|
|
DEFINE_PER_CPU is simpler and faster than alloc_percpu() so use it
for the loop counter, which is already statically defined.
|
|
In some places we would put the return type on the same line as
the rest of the function definition and other places we wouldn't.
Reformat everything to match kernel style.
|
|
We currently use EEXIST to represent both a device that is already
attached and for GRE devices that are the same as another one.
Instead use EBUSY for already attached devices to disambiguate the
two situations.
|
|
The offsets for checksum offsets should always be positive so make
that explicit by using unsigned ints. This helps bug checks that
test if the offsets are greater than their upper limits.
|