aboutsummaryrefslogtreecommitdiff
path: root/net/core/dev.c
AgeCommit message (Collapse)Author
2014-03-28net: Account for all vlan headers in skb_mac_gso_segmentVlad Yasevich
skb_network_protocol() already accounts for multiple vlan headers that may be present in the skb. However, skb_mac_gso_segment() doesn't know anything about it and assumes that skb->mac_len is set correctly to skip all mac headers. That may not always be the case. If we are simply forwarding the packet (via bridge or macvtap), all vlan headers may not be accounted for. A simple solution is to allow skb_network_protocol to return the vlan depth it has calculated. This way skb_mac_gso_segment will correctly skip all mac headers. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13net: core: introduce netif_skb_dev_featuresFlorian Westphal
Will be used by upcoming ipv4 forward path change that needs to determine feature mask using skb->dst->dev instead of skb->dev. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09net: Mark functions as static in core/dev.cRashika Kheria
Mark functions as static in core/dev.c because they are not used outside this file. This eliminates the following warning in core/dev.c: net/core/dev.c:2806:5: warning: no previous prototype for ‘__dev_queue_xmit’ [-Wmissing-prototypes] net/core/dev.c:4640:5: warning: no previous prototype for ‘netdev_adjacent_sysfs_add’ [-Wmissing-prototypes] net/core/dev.c:4650:6: warning: no previous prototype for ‘netdev_adjacent_sysfs_del’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) BPF debugger and asm tool by Daniel Borkmann. 2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann. 3) Correct reciprocal_divide and update users, from Hannes Frederic Sowa and Daniel Borkmann. 4) Currently we only have a "set" operation for the hw timestamp socket ioctl, add a "get" operation to match. From Ben Hutchings. 5) Add better trace events for debugging driver datapath problems, also from Ben Hutchings. 6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data. 7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko. 8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel Borkmann. 9) Share IP header compression code between Bluetooth and IEEE802154 layers, from Jukka Rissanen. 10) Fix ipv6 router reachability probing, from Jiri Benc. 11) Allow packets to be captured on macvtap devices, from Vlad Yasevich. 12) Support tunneling in GRO layer, from Jerry Chu. 13) Allow bonding to be configured fully using netlink, from Scott Feldman. 14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can already get the TCI. From Atzm Watanabe. 15) New "Heavy Hitter" qdisc, from Terry Lam. 16) Significantly improve the IPSEC support in pktgen, from Fan Du. 17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom Herbert. 18) Add Proportional Integral Enhanced packet scheduler, from Vijay Subramanian. 19) Allow openvswitch to mmap'd netlink, from Thomas Graf. 20) Key TCP metrics blobs also by source address, not just destination address. From Christoph Paasch. 21) Support 10G in generic phylib. From Andy Fleming. 22) Try to short-circuit GRO flow compares using device provided RX hash, if provided. From Tom Herbert. The wireless and netfilter folks have been busy little bees too. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits) net/cxgb4: Fix referencing freed adapter ipv6: reallocate addrconf router for ipv6 address when lo device up fib_frontend: fix possible NULL pointer dereference rtnetlink: remove IFLA_BOND_SLAVE definition rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info qlcnic: update version to 5.3.55 qlcnic: Enhance logic to calculate msix vectors. qlcnic: Refactor interrupt coalescing code for all adapters. qlcnic: Update poll controller code path qlcnic: Interrupt code cleanup qlcnic: Enhance Tx timeout debugging. qlcnic: Use bool for rx_mac_learn. bonding: fix u64 division rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC sfc: Use the correct maximum TX DMA ring size for SFC9100 Add Shradha Shah as the sfc driver maintainer. net/vxlan: Share RX skb de-marking and checksum checks with ovs tulip: cleanup by using ARRAY_SIZE() ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called net/cxgb4: Don't retrieve stats during recovery ...
2014-01-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual rocket science stuff from trivial.git" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) neighbour.h: fix comment sched: Fix warning on make htmldocs caused by wait.h slab: struct kmem_cache is protected by slab_mutex doc: Fix typo in USB Gadget Documentation of/Kconfig: Spelling s/one/once/ mkregtable: Fix sscanf handling lp5523, lp8501: comment improvements thermal: rcar: comment spelling treewide: fix comments and printk msgs IXP4xx: remove '1 &&' from a condition check in ixp4xx_restart() Documentation: update /proc/uptime field description Documentation: Fix size parameter for snprintf arm: fix comment header and macro name asm-generic: uaccess: Spelling s/a ny/any/ mtd: onenand: fix comment header doc: driver-model/platform.txt: fix a typo drivers: fix typo in DEVTMPFS_MOUNT Kconfig help text doc: Fix typo (acces_process_vm -> access_process_vm) treewide: Fix typos in printk drivers/gpu/drm/qxl/Kconfig: reformat the help text ...
2014-01-21net: Export gro_find_by_type helpersOr Gerlitz
Export the gro_find_receive/complete_by_type helpers to they can be invoked by the gro callbacks of encapsulation protocols such as vxlan. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21net: Add GRO support for UDP encapsulating protocolsOr Gerlitz
Add GRO handlers for protocols that do UDP encapsulation, with the intent of being able to coalesce packets which encapsulate packets belonging to the same TCP session. For GRO purposes, the destination UDP port takes the role of the ether type field in the ethernet header or the next protocol in the IP header. The UDP GRO handler will only attempt to coalesce packets whose destination port is registered to have gro handler. Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive code twice on a packet. This solves the problem of udp encapsulated packets whose inner VM packet is udp and happen to carry a port which has registered offloads. Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21net: remove unnecessary initializations in net_dev_initSabrina Dubroca
softnet_data is already set to 0, no need to use memset or initialize specific fields to 0 or NULL afterwards. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19net: document accel_priv parameter for __dev_queue_xmit()Jason Wang
To silent "make htmldocs" warning. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16net-sysfs: add support for device-specific rx queue sysfs attributesMichael Dalton
Extend existing support for netdevice receive queue sysfs attributes to permit a device-specific attribute group. Initial use case for this support will be to allow the virtio-net device to export per-receive queue mergeable receive buffer size. Signed-off-by: Michael Dalton <mwdalton@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16net: add NETDEV_PRECHANGEMTU to notify before mtu change happensVeaceslav Falico
Currently, if a device changes its mtu, first the change happens (invloving all the side effects), and after that the NETDEV_CHANGEMTU is sent so that other devices can catch up with the new mtu. However, if they return NOTIFY_BAD, then the change is reverted and error returned. This is a really long and costy operation (sometimes). To fix this, add NETDEV_PRECHANGEMTU notification which is called prior to any change actually happening, and if any callee returns NOTIFY_BAD - the change is aborted. This way we're skipping all the playing with apply/revert the mtu. CC: "David S. Miller" <davem@davemloft.net> CC: Jiri Pirko <jiri@resnulli.us> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16net: Check skb->rxhash in gro_receiveTom Herbert
When initializing a gro_list for a packet, first check the rxhash of the incoming skb against that of the skb's in the list. This should be a very strong inidicator of whether the flow is going to be matched, and potentially allows a lot of other checks to be short circuited. Use skb_hash_raw so that we don't force the hash to be calculated. Tested by running netperf 200 TCP_STREAMs between two machines with GRO, HW rxhash, and 1G. Saw no performance degration, slight reduction of time in dev_gro_receive. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-15net: rename sysfs symlinks on device name changeVeaceslav Falico
Currently, we don't rename the upper/lower_ifc symlinks in /sys/class/net/*/ , which might result stale/duplicate links/names. Fix this by adding netdev_adjacent_rename_links(dev, oldname) which renames all the upper/lower interface's links to dev from the upper/lower_oldname to the new name. We don't need a rollback because only we control these symlinks and if we fail to rename them - sysfs will anyway complain. Reported-by: Ding Tianhong <dingtianhong@huawei.com> CC: Ding Tianhong <dingtianhong@huawei.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-15net: add sysfs helpers for netdev_adjacent logicVeaceslav Falico
They clean up the code a bit and can be used further. CC: Ding Tianhong <dingtianhong@huawei.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14net: Add trace events for all receive entry points, exposing more skb fieldsBen Hutchings
The existing net/netif_rx and net/netif_receive_skb trace events provide little information about the skb, nor do they indicate how it entered the stack. Add trace events at entry of each of the exported functions, including most fields that are likely to be interesting for debugging driver datapath behaviour. Split netif_rx() and netif_receive_skb() so that internal calls are not traced. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14net: Add net_dev_start_xmit trace event, exposing more skb fieldsBen Hutchings
The existing net/net_dev_xmit trace event provides little information about the skb that has been passed to the driver, and it is not simple to add more since the skb may already have been freed at the point the event is emitted. Add a separate trace event before the skb is passed to the driver, including most fields that are likely to be interesting for debugging driver datapath behaviour. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14net: Fix indentation in dev_hard_start_xmit()Ben Hutchings
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2014-01-13net: make dev_set_mtu() honor notification return codeVeaceslav Falico
Currently, after changing the MTU for a device, dev_set_mtu() calls NETDEV_CHANGEMTU notification, however doesn't verify it's return code - which can be NOTIFY_BAD - i.e. some of the net notifier blocks refused this change, and continues nevertheless. To fix this, verify the return code, and if it's an error - then revert the MTU to the original one, notify again and pass the error code. CC: Jiri Pirko <jiri@resnulli.us> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13net: gro: change GRO overflow strategyEric Dumazet
GRO layer has a limit of 8 flows being held in GRO list, for performance reason. When a packet comes for a flow not yet in the list, and list is full, we immediately give it to upper stacks, lowering aggregation performance. With TSO auto sizing and FQ packet scheduler, this situation happens more often. This patch changes strategy to simply evict the oldest flow of the list. This works better because of the nature of packet trains for which GRO is efficient. This also has the effect of lowering the GRO latency if many flows are competing. Tested : Used a 40Gbps NIC, with 4 RX queues, and 200 concurrent TCP_STREAM netperf. Before patch, aggregate rate is 11Gbps (while a single flow can reach 30Gbps) After patch, line rate is reached. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10net: core: explicitly select a txq before doing l2 forwardingJason Wang
Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The will cause several issues: - NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan instead of lower device which misses the necessary txq synchronization for lower device such as txq stopping or frozen required by dev watchdog or control path. - dev_hard_start_xmit() was called with NULL txq which bypasses the net device watchdog. - dev_hard_start_xmit() does not check txq everywhere which will lead a crash when tso is disabled for lower device. Fix this by explicitly introducing a new param for .ndo_select_queue() for just selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also extended to accept this parameter and dev_queue_xmit_accel() was used to do l2 forwarding transmission. With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of dev_queue_xmit() to do the transmission. In the future, it was also required for macvtap l2 forwarding support since it provides a necessary synchronization method. Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: e1000-devel@lists.sourceforge.net Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07net-gre-gro: Add GRE support to the GRO stackJerry Chu
This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36 ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO stack. It also serves as an example for supporting other encapsulation protocols in the GRO stack in the future. The patch supports version 0 and all the flags (key, csum, seq#) but will flush any pkt with the S (seq#) flag. This is because the S flag is not support by GSO, and a GRO pkt may end up in the forwarding path, thus requiring GSO support to break it up correctly. Currently the "packet_offload" structure only contains L3 (ETH_P_IP/ ETH_P_IPV6) GRO offload support so the encapped pkts are limited to IP pkts (i.e., w/o L2 hdr). But support for other protocol type can be easily added, so is the support for GRE variations like NVGRE. The patch also support csum offload. Specifically if the csum flag is on and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE), the code will take advantage of the csum computed by the h/w when validating the GRE csum. Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre: add GRO capability" already introduces GRO capability to IPv4 GRE tunnels, using the gro_cells infrastructure. But GRO is done after GRE hdr has been removed (i.e., decapped). The following patch applies GRO when pkts first come in (before hitting the GRE tunnel code). There is some performance advantage for applying GRO as early as possible. Also this approach is transparent to other subsystem like Open vSwitch where GRE decap is handled outside of the IP stack hence making it harder for the gro_cells stuff to apply. On the other hand, some NICs are still not capable of hashing on the inner hdr of a GRE pkt (RSS). In that case the GRO processing of pkts from the same remote host will all happen on the same CPU and the performance may be suboptimal. I'm including some rough preliminary performance numbers below. Note that the performance will be highly dependent on traffic load, mix as usual. Moreover it also depends on NIC offload features hence the following is by no means a comprehesive study. Local testing and tuning will be needed to decide the best setting. All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs. (super_netperf 50 -H 192.168.1.18 -l 30) An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1 mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123) is configured. The GRO support for pkts AFTER decap are controlled through the device feature of the GRE device (e.g., ethtool -K gre1 gro on/off). 1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off thruput: 9.16Gbps CPU utilization: 19% 1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off thruput: 5.9Gbps CPU utilization: 15% 1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 12-13% 1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 10% The following tests were performed on a different NIC that is capable of csum offload. I.e., the h/w is capable of computing IP payload csum (CHECKSUM_COMPLETE). 2.1 ethtool -K gre1 gro on (hence will use gro_cells) 2.1.1 ethtool -K eth0 gro off; csum offload disabled thruput: 8.53Gbps CPU utilization: 9% 2.1.2 ethtool -K eth0 gro off; csum offload enabled thruput: 8.97Gbps CPU utilization: 7-8% 2.1.3 ethtool -K eth0 gro on; csum offload disabled thruput: 8.83Gbps CPU utilization: 5-6% 2.1.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.98Gbps CPU utilization: 5% 2.2 ethtool -K gre1 gro off 2.2.1 ethtool -K eth0 gro off; csum offload disabled thruput: 5.93Gbps CPU utilization: 9% 2.2.2 ethtool -K eth0 gro off; csum offload enabled thruput: 5.62Gbps CPU utilization: 8% 2.2.3 ethtool -K eth0 gro on; csum offload disabled thruput: 7.69Gbps CPU utilization: 8% 2.2.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.96Gbps CPU utilization: 5-6% Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07net: Do not enable tx-nocache-copy by defaultBenjamin Poirier
There are many cases where this feature does not improve performance or even reduces it. For example, here are the results from tests that I've run using 3.12.6 on one Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are from the Xeon, but they're similar on the i7. All numbers report the mean±stddev over 10 runs of 10s. 1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache copy from user on transmit" There is no statistically significant difference between tx-nocache-copy on/off. nic irqs spread out (one queue per cpu) 200x netperf -r 1400,1 tx-nocache-copy off 692000±1000 tps 50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3 tx-nocache-copy on 693000±1000 tps 50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7 200x netperf -r 14000,14000 tx-nocache-copy off 86450±80 tps 50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40 tx-nocache-copy on 86110±60 tps 50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20 2) single stream throughput tests tx-nocache-copy leads to higher service demand throughput cpu0 cpu1 demand (Gb/s) (Gcycle) (Gcycle) (cycle/B) nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send) tx-nocache-copy off 9402±5 9.4±0.2 0.80±0.01 tx-nocache-copy on 9403±3 9.85±0.04 0.838±0.004 nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send) tx-nocache-copy off 9401±5 5.83±0.03 5.0±0.1 0.923±0.007 tx-nocache-copy on 9404±2 5.74±0.03 5.523±0.009 0.958±0.002 As a second example, here are some results from Eric Dumazet with latest net-next. tx-nocache-copy also leads to higher service demand (cpu is Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) lpq83:~# ./ethtool -K eth0 tx-nocache-copy on lpq83:~# perf stat ./netperf -H lpq84 -c MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 9407.44 2.50 -1.00 0.522 -1.000 Performance counter stats for './netperf -H lpq84 -c': 4282.648396 task-clock # 0.423 CPUs utilized 9,348 context-switches # 0.002 M/sec 88 CPU-migrations # 0.021 K/sec 355 page-faults # 0.083 K/sec 11,812,797,651 cycles # 2.758 GHz [82.79%] 9,020,522,817 stalled-cycles-frontend # 76.36% frontend cycles idle [82.54%] 4,579,889,681 stalled-cycles-backend # 38.77% backend cycles idle [67.33%] 6,053,172,792 instructions # 0.51 insns per cycle # 1.49 stalled cycles per insn [83.64%] 597,275,583 branches # 139.464 M/sec [83.70%] 8,960,541 branch-misses # 1.50% of all branches [83.65%] 10.128990264 seconds time elapsed lpq83:~# ./ethtool -K eth0 tx-nocache-copy off lpq83:~# perf stat ./netperf -H lpq84 -c MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 9412.45 2.15 -1.00 0.449 -1.000 Performance counter stats for './netperf -H lpq84 -c': 2847.375441 task-clock # 0.281 CPUs utilized 11,632 context-switches # 0.004 M/sec 49 CPU-migrations # 0.017 K/sec 354 page-faults # 0.124 K/sec 7,646,889,749 cycles # 2.686 GHz [83.34%] 6,115,050,032 stalled-cycles-frontend # 79.97% frontend cycles idle [83.31%] 1,726,460,071 stalled-cycles-backend # 22.58% backend cycles idle [66.55%] 2,079,702,453 instructions # 0.27 insns per cycle # 2.94 stalled cycles per insn [83.22%] 363,773,213 branches # 127.757 M/sec [83.29%] 4,242,732 branch-misses # 1.17% of all branches [83.51%] 10.128449949 seconds time elapsed CC: Tom Herbert <therbert@google.com> Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c net/ipv6/ip6_tunnel.c net/ipv6/ip6_vti.c ipv6 tunnel statistic bug fixes conflicting with consolidation into generic sw per-cpu net stats. qlogic conflict between queue counting bug fix and the addition of multiple MAC address support. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: * Add full port randomization support. Some crazy researchers found a way to reconstruct the secure ephemeral ports that are allocated in random mode by sending off-path bursts of UDP packets to overrun the socket buffer of the DNS resolver to trigger retransmissions, then if the timing for the DNS resolution done by a client is larger than usual, then they conclude that the port that received the burst of UDP packets is the one that was opened. It seems a bit aggressive method to me but it seems to work for them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a new NAT mode to fully randomize ports using prandom. * Add a new classifier to x_tables based on the socket net_cls set via cgroups. These includes two patches to prepare the field as requested by Zefan Li. Also from Daniel Borkmann. * Use prandom instead of get_random_bytes in several locations of the netfilter code, from Florian Westphal. * Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack mark, also from Florian Westphal. * Fix compilation warning due to unused variable in IPVS, from Geert Uytterhoeven. * Add support for UID/GID via nfnetlink_queue, from Valentina Giusti. * Add IPComp extension to x_tables, from Fan Du. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03net: netprio: rename config to be more consistent with cgroup configsDaniel Borkmann
While we're at it and introduced CGROUP_NET_CLASSID, lets also make NETPRIO_CGROUP more consistent with the rest of cgroups and rename it into CONFIG_CGROUP_NET_PRIO so that for networking, we now have CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG option consistent among networking cgroups, but also among cgroups CONFIG conventions in general as the vast majority has a prefix of CONFIG_CGROUP_<SUBSYS>. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Zefan Li <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-01net: core functions cleanupstephen hemminger
The following functions are not used outside of net/core/dev.c and should be declared static. call_netdevice_notifiers_info __dev_remove_offload netdev_has_any_upper_dev __netdev_adjacent_dev_remove __netdev_adjacent_dev_link_lists __netdev_adjacent_dev_unlink_lists __netdev_adjacent_dev_unlink __netdev_adjacent_dev_link_neighbour __netdev_adjacent_dev_unlink_neighbour And the following are never used and should be deleted netdev_lower_dev_get_private_rcu __netdev_find_adj_rcu Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31net, rps: fix the comment of net_rps_action_and_irq_enable()Zhi Yong Wu
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-21net: use kfree_skb_list() helperEric Dumazet
We can use kfree_skb_list() instead of open coding it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19Merge branch 'master' into for-nextJiri Kosina
Sync with Linus' tree to be able to apply fixes on top of newer things in tree (efi-stub). Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-12-17net: allow netdev_all_upper_get_next_dev_rcu with rtnl lock heldJohn Fastabend
It is useful to be able to walk all upper devices when bringing a device online where the RTNL lock is held. In this case it is safe to walk the all_adj_list because the RTNL lock is used to protect the write side as well. This patch adds a check to see if the rtnl lock is held before throwing a warning in netdev_all_upper_get_next_dev_rcu(). Also because we now have a call site for lockdep_rtnl_is_held() outside COFIG_LOCK_PROVING an inline definition returning 1 is needed. Similar to the rcu_read_lock_is_held(). Fixes: 2a47fa45d4df ("ixgbe: enable l2 forwarding acceleration for macvlans") CC: Veaceslav Falico <vfalico@redhat.com> Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-12-17net: Change skb_get_rxhash to skb_get_hashTom Herbert
Changing name of function as part of making the hash in skbuff to be generic property, not just for receive path. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-14bonding: create bond_first_slave_rcu()dingtianhong
The bond_first_slave_rcu() will be used to instead of bond_first_slave() in rcu_read_lock(). According to the Jay Vosburgh's suggestion, the struct netdev_adjacent should hide from users who wanted to use it directly. so I package a new function to get the first slave of the bond. Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com> Suggested-by: Jay Vosburgh <fubar@us.ibm.com> Suggested-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12net-gro: Prepare GRO stack for the upcoming tunneling supportJerry Chu
This patch modifies the GRO stack to avoid the use of "network_header" and associated macros like ip_hdr() and ipv6_hdr() in order to allow an arbitary number of IP hdrs (v4 or v6) to be used in the encapsulation chain. This lays the foundation for various IP tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later. With this patch, the GRO stack traversing now is mostly based on skb_gro_offset rather than special hdr offsets saved in skb (e.g., skb->network_header). As a result all but the top layer (i.e., the the transport layer) must have hdrs of the same length in order for a pkt to be considered for aggregation. Therefore when adding a new encap layer (e.g., for tunneling), one must check and skip flows (e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a different hdr length. Note that unlike the network header, the transport header can and will continue to be set by the GRO code since there will be at most one "transport layer" in the encap chain. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09net: dev: move inline skb_needs_linearize helper to headerDaniel Borkmann
As we need it elsewhere, move the inline helper function of skb_needs_linearize() over to skbuff.h include file. While at it, also convert the return to 'bool' instead of 'int' and add a proper kernel doc. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06net: introduce dev_consume_skb_any()Eric Dumazet
Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq() helpers to free skbs, both for dropped packets and TX completed ones. We need to separate the two causes to get better diagnostics given by dropwatch or "perf record -e skb:kfree_skb" This patch provides two new helpers, dev_consume_skb_any() and dev_consume_skb_irq() to be used for consumed skbs. __dev_kfree_skb_irq() is slightly optimized to remove one atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06gro: small napi_get_frags() optimEric Dumazet
Remove one useless conditional branch : napi->skb is NULL, so nothing bad can happen. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-02Fix comment typo for alloc_netdev_mqs()Li Zhong
it seems subquue should be subqueue here Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-11-20net: core: Always propagate flag changes to interfacesVlad Yasevich
The following commit: b6c40d68ff6498b7f63ddf97cf0aa818d748dee7 net: only invoke dev->change_rx_flags when device is UP tried to fix a problem with VLAN devices and promiscuouse flag setting. The issue was that VLAN device was setting a flag on an interface that was down, thus resulting in bad promiscuity count. This commit blocked flag propagation to any device that is currently down. A later commit: deede2fabe24e00bd7e246eb81cd5767dc6fcfc7 vlan: Don't propagate flag changes on down interfaces fixed VLAN code to only propagate flags when the VLAN interface is up, thus fixing the same issue as above, only localized to VLAN. The problem we have now is that if we have create a complex stack involving multiple software devices like bridges, bonds, and vlans, then it is possible that the flags would not propagate properly to the physical devices. A simple examle of the scenario is the following: eth0----> bond0 ----> bridge0 ---> vlan50 If bond0 or eth0 happen to be down at the time bond0 is added to the bridge, then eth0 will never have promisc mode set which is currently required for operation as part of the bridge. As a result, packets with vlan50 will be dropped by the interface. The only 2 devices that implement the special flag handling are VLAN and DSA and they both have required code to prevent incorrect flag propagation. As a result we can remove the generic solution introduced in b6c40d68ff6498b7f63ddf97cf0aa818d748dee7 and leave it to the individual devices to decide whether they will block flag propagation or not. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Suggested-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-15macvlan: disable LRO on lower device instead of macvlanMichal Kubeček
A macvlan device has always LRO disabled so that calling dev_disable_lro() on it does nothing. If we need to disable LRO e.g. because - the macvlan device is inserted into a bridge - IPv6 forwarding is enabled for it - it is in a different namespace than lowerdev and IPv4 forwarding is enabled in it we need to disable LRO on its underlying device instead (as we do for 802.1q VLAN devices). v2: use newly introduced netif_is_macvlan() Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14core/dev: do not ignore dmac in dev_forward_skb()Alexei Starovoitov
commit 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") and refactoring 64261f230a91 ("dev: move skb_scrub_packet() after eth_type_trans()") are forcing pkt_type to be PACKET_HOST when skb traverses veth. which means that ip forwarding will kick in inside netns even if skb->eth->h_dest != dev->dev_addr Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb() and in ip_tunnel_rcv() Fixes: 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") CC: Isaku Yamahata <yamahatanetdev@gmail.com> CC: Maciej Zenczykowski <zenczykowski@gmail.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07net: Add layer 2 hardware acceleration operations for macvlan devicesJohn Fastabend
Add a operations structure that allows a network interface to export the fact that it supports package forwarding in hardware between physical interfaces and other mac layer devices assigned to it (such as macvlans). This operaions structure can be used by virtual mac devices to bypass software switching so that forwarding can be done in hardware more efficiently. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-03net: extend net_device allocation to vmalloc()Eric Dumazet
Joby Poriyath provided a xen-netback patch to reduce the size of xenvif structure as some netdev allocation could fail under memory pressure/fragmentation. This patch is handling the problem at the core level, allowing any netdev structures to use vmalloc() if kmalloc() failed. As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT to kzalloc() flags to do this fallback only when really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Joby Poriyath <joby.poriyath@citrix.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25net: fix rtnl notification in atomic contextAlexei Starovoitov
commit 991fb3f74c "dev: always advertise rx_flags changes via netlink" introduced rtnl notification from __dev_set_promiscuity(), which can be called in atomic context. Steps to reproduce: ip tuntap add dev tap1 mode tap ifconfig tap1 up tcpdump -nei tap1 & ip tuntap del dev tap1 mode tap [ 271.627994] device tap1 left promiscuous mode [ 271.639897] BUG: sleeping function called from invalid context at mm/slub.c:940 [ 271.664491] in_atomic(): 1, irqs_disabled(): 0, pid: 3394, name: ip [ 271.677525] INFO: lockdep is turned off. [ 271.690503] CPU: 0 PID: 3394 Comm: ip Tainted: G W 3.12.0-rc3+ #73 [ 271.703996] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012 [ 271.731254] ffffffff81a58506 ffff8807f0d57a58 ffffffff817544e5 ffff88082fa0f428 [ 271.760261] ffff8808071f5f40 ffff8807f0d57a88 ffffffff8108bad1 ffffffff81110ff8 [ 271.790683] 0000000000000010 00000000000000d0 00000000000000d0 ffff8807f0d57af8 [ 271.822332] Call Trace: [ 271.838234] [<ffffffff817544e5>] dump_stack+0x55/0x76 [ 271.854446] [<ffffffff8108bad1>] __might_sleep+0x181/0x240 [ 271.870836] [<ffffffff81110ff8>] ? rcu_irq_exit+0x68/0xb0 [ 271.887076] [<ffffffff811a80be>] kmem_cache_alloc_node+0x4e/0x2a0 [ 271.903368] [<ffffffff810b4ddc>] ? vprintk_emit+0x1dc/0x5a0 [ 271.919716] [<ffffffff81614d67>] ? __alloc_skb+0x57/0x2a0 [ 271.936088] [<ffffffff810b4de0>] ? vprintk_emit+0x1e0/0x5a0 [ 271.952504] [<ffffffff81614d67>] __alloc_skb+0x57/0x2a0 [ 271.968902] [<ffffffff8163a0b2>] rtmsg_ifinfo+0x52/0x100 [ 271.985302] [<ffffffff8162ac6d>] __dev_notify_flags+0xad/0xc0 [ 272.001642] [<ffffffff8162ad0c>] __dev_set_promiscuity+0x8c/0x1c0 [ 272.017917] [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380 [ 272.033961] [<ffffffff8162b109>] dev_set_promiscuity+0x29/0x50 [ 272.049855] [<ffffffff8172e937>] packet_dev_mc+0x87/0xc0 [ 272.065494] [<ffffffff81732052>] packet_notifier+0x1b2/0x380 [ 272.080915] [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380 [ 272.096009] [<ffffffff81761c66>] notifier_call_chain+0x66/0x150 [ 272.110803] [<ffffffff8108503e>] __raw_notifier_call_chain+0xe/0x10 [ 272.125468] [<ffffffff81085056>] raw_notifier_call_chain+0x16/0x20 [ 272.139984] [<ffffffff81620190>] call_netdevice_notifiers_info+0x40/0x70 [ 272.154523] [<ffffffff816201d6>] call_netdevice_notifiers+0x16/0x20 [ 272.168552] [<ffffffff816224c5>] rollback_registered_many+0x145/0x240 [ 272.182263] [<ffffffff81622641>] rollback_registered+0x31/0x40 [ 272.195369] [<ffffffff816229c8>] unregister_netdevice_queue+0x58/0x90 [ 272.208230] [<ffffffff81547ca0>] __tun_detach+0x140/0x340 [ 272.220686] [<ffffffff81547ed6>] tun_chr_close+0x36/0x60 Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25net: add missing dev_put() in __netdev_adjacent_dev_insertNikolay Aleksandrov
I think that a dev_put() is needed in the error path to preserve the proper dev refcount. CC: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-19ipv4: gso: make inet_gso_segment() stackableEric Dumazet
In order to support GSO on IPIP, we need to make inet_gso_segment() stackable. It should not assume network header starts right after mac header. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: include/linux/netdevice.h net/core/sock.c Trivial merge issues. Removal of "extern" for functions declaration in netdevice.h at the same time "const" was added to an argument. Two parallel line additions in net/core/sock.c Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07net: Separate the close_list and the unreg_list v2Eric W. Biederman
Separate the unreg_list and the close_list in dev_close_many preventing dev_close_many from permuting the unreg_list. The permutations of the unreg_list have resulted in cases where the loopback device is accessed it has been freed in code such as dst_ifdown. Resulting in subtle memory corruption. This is the second bug from sharing the storage between the close_list and the unreg_list. The issues that crop up with sharing are apparently too subtle to show up in normal testing or usage, so let's forget about being clever and use two separate lists. v2: Make all callers pass in a close_list to dev_close_many Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07netif_set_xps_queue: make cpu mask constMichael S. Tsirkin
virtio wants to pass in cpumask_of(cpu), make parameter const to avoid build warnings. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/usb/qmi_wwan.c drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h include/net/netfilter/nf_conntrack_synproxy.h include/net/secure_seq.h The conflicts are of two varieties: 1) Conflicts with Joe Perches's 'extern' removal from header file function declarations. Usually it's an argument signature change or a function being added/removed. The resolutions are trivial. 2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds a new value, another changes an existing value. That sort of thing. Signed-off-by: David S. Miller <davem@davemloft.net>