aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2015-07-10sctp: fix ASCONF list handlingMarcelo Ricardo Leitner
[ Upstream commit 2d45a02d0166caf2627fe91897c6ffc3b19514c4 ] ->auto_asconf_splist is per namespace and mangled by functions like sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization. Also, the call to inet_sk_copy_descendant() was backuping ->auto_asconf_list through the copy but was not honoring ->do_auto_asconf, which could lead to list corruption if it was different between both sockets. This commit thus fixes the list handling by using ->addr_wq_lock spinlock to protect the list. A special handling is done upon socket creation and destruction for that. Error handlig on sctp_init_sock() will never return an error after having initialized asconf, so sctp_destroy_sock() can be called without addrq_wq_lock. The lock now will be take on sctp_close_sock(), before locking the socket, so we don't do it in inverse order compared to sctp_addr_wq_timeout_handler(). Instead of taking the lock on sctp_sock_migrate() for copying and restoring the list values, it's preferred to avoid rewritting it by implementing sctp_copy_descendant(). Issue was found with a test application that kept flipping sysctl default_auto_asconf on and off, but one could trigger it by issuing simultaneous setsockopt() calls on multiple sockets or by creating/destroying sockets fast enough. This is only triggerable locally. Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).") Reported-by: Ji Jianwen <jiji@redhat.com> Suggested-by: Neil Horman <nhorman@tuxdriver.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-31tcp: fix child sockets to use system default congestion control if not setNeal Cardwell
Linux 3.17 and earlier are explicitly engineered so that if the app doesn't specifically request a CC module on a listener before the SYN arrives, then the child gets the system default CC when the connection is established. See tcp_init_congestion_control() in 3.17 or earlier, which says "if no choice made yet assign the current value set as default". The change ("net: tcp: assign tcp cong_ops when tcp sk is created") altered these semantics, so that children got their parent listener's congestion control even if the system default had changed after the listener was created. This commit returns to those original semantics from 3.17 and earlier, since they are the original semantics from 2007 in 4d4d3d1e8 ("[TCP]: Congestion control initialization."), and some Linux congestion control workflows depend on that. In summary, if a listener socket specifically sets TCP_CONGESTION to "x", or the route locks the CC module to "x", then the child gets "x". Otherwise the child gets current system default from net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and earlier, and this commit restores that. Fixes: 55d8694fa82c ("net: tcp: assign tcp cong_ops when tcp sk is created") Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-30Merge tag 'mac80211-for-davem-2015-05-28' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== This just has a single docbook build fix. In my confusion I'd already sent the same fix for -next, but Ben Hutchings noted it's necessary in 4.1. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-28mac80211: Fix mac80211.h docbook commentsJonathan Corbet
A couple of enums in mac80211.h became structures recently, but the comments didn't follow suit, leading to errors like: Error(.//include/net/mac80211.h:367): Cannot parse enum! Documentation/DocBook/Makefile:93: recipe for target 'Documentation/DocBook/80211.xml' failed make[1]: *** [Documentation/DocBook/80211.xml] Error 1 Makefile:1361: recipe for target 'mandocs' failed make: *** [mandocs] Error 2 Fix the comments comments accordingly. Added a couple of other small comment fixes while I was there to silence other recently-added docbook warnings. Reported-by: Jim Davis <jim.epost@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-27sctp: Fix mangled IPv4 addresses on a IPv6 listening socketJason Gunthorpe
sctp_v4_map_v6 was subtly writing and reading from members of a union in a way the clobbered data it needed to read before it read it. Zeroing the v6 flowinfo overwrites the v4 sin_addr with 0, meaning that every place that calls sctp_v4_map_v6 gets ::ffff:0.0.0.0 as the result. Reorder things to guarantee correct behaviour no matter what the union layout is. This impacts user space clients that open an IPv6 SCTP socket and receive IPv4 connections. Prior to 299ee user space would see a sockaddr with AF_INET and a correct address, after 299ee the sockaddr is AF_INET6, but the address is wrong. Fixes: 299ee123e198 (sctp: Fixup v4mapped behaviour to comply with Sock API) Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19inet: properly align icsk_ca_privEric Dumazet
tcp_illinois and upcoming tcp_cdg require 64bit alignment of icsk_ca_priv x86 does not care, but other architectures might. Fixes: 05cbc0db03e82 ("ipv4: Create probe timer for tcp PMTU as per RFC4821") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Fan Du <fan.du@intel.com> Acked-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04Merge tag 'mac80211-for-davem-2015-05-04' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== We have only a few fixes right now: * a fix for an issue with hash collision handling in the rhashtable conversion * a merge issue - rhashtable removed default shrinking just before mac80211 was converted, so enable it now * remove an invalid WARN that can trigger with legitimate userspace behaviour * add a struct member missing from kernel-doc that caused a lot of warnings ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2015-05-04 Here's the first bluetooth-next pull request for 4.2: - Various fixes for at86rf230 driver - ieee802154: trace events support for rdev->ops - HCI UART driver refactoring - New Realtek IDs added to btusb driver - Off-by-one fix for rtl8723b in btusb driver - Refactoring of btbcm driver for both UART & USB use Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04mac80211: fix 90 kernel-doc warningsRandy Dunlap
Eliminate 90 of these warnings: Warning(..//include/net/mac80211.h:1682): No description found for parameter 'drv_priv[0]' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-03codel: fix maxpacket/mtu confusionEric Dumazet
Under presence of TSO/GSO/GRO packets, codel at low rates can be quite useless. In following example, not a single packet was ever dropped, while average delay in codel queue is ~100 ms ! qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0) backlog 13626b 3p requeues 0 count 0 lastcount 0 ldelay 96.9ms drop_next 0us maxpacket 9084 ecn_mark 0 drop_overlimit 0 This comes from a confusion of what should be the minimal backlog. It is pretty clear it is not 64KB or whatever max GSO packet ever reached the qdisc. codel intent was to use MTU of the device. After the fix, we finally drop some packets, and rtt/cwnd of my single TCP flow are meeting our expectations. qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0) backlog 6056b 3p requeues 0 count 1 lastcount 1 ldelay 36.3ms drop_next 0us maxpacket 10598 ecn_mark 0 drop_overlimit 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kathleen Nichols <nichols@pollere.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Van Jacobson <vanj@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-30cfg802154: pass name_assign_type to rdev_add_virtual_intf()Varka Bhadram
This code is based on commit 6bab2e19c5ffd ("cfg80211: pass name_assign_type to rdev_add_virtual_intf()") This will expose in sysfs whether the ifname of a IEEE-802.15.4 device is set by userspace or generated by the kernel. We are using two types of name_assign_types o NET_NAME_ENUM: Default interface name provided by kernel o NET_NAME_USER: Interface name provided by user. Signed-off-by: Varka Bhadram <varkab@cdac.in> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30mac802154: add description to mac802154 APIsVarka Bhadram
This patch adds the proper description to the mac802154 core APIs. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-29tcp: prepare CC get_info() access from getsockopt()Eric Dumazet
We would like that optional info provided by Congestion Control modules using netlink can also be read using getsockopt() This patch changes get_info() to put this information in a buffer, instead of skb, like tcp_get_info(), so that following patch can reuse this common infrastructure. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29tcp: add tcpi_bytes_acked to tcp_infoEric Dumazet
This patch tracks total number of bytes acked for a TCP socket. This is the sum of all changes done to tp->snd_una, and allows for precise tracking of delivered data. RFC4898 named this : tcpEStatsAppHCThruOctetsAcked This is a 64bit field, and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->bytes_acked was placed near tp->snd_una for best data locality and minimal performance impact. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Eric Salo <salo@google.com> Cc: Martin Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-26net/bonding: Make DRV macros privateMatan Barak
The bonding modules currently defines four macros with general names that pollute the global namespace: DRV_VERSION DRV_RELDATE DRV_NAME DRV_DESCRIPTION Fixing that by defining a private bonding_priv.h header files which includes those defines. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24inet: fix possible panic in reqsk_queue_unlink()Eric Dumazet
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243 There is a race when reqsk_timer_handler() and tcp_check_req() call inet_csk_reqsk_queue_unlink() on the same req at the same time. Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer"), listener spinlock was held and race could not happen. To solve this bug, we change reqsk_queue_unlink() to not assume req must be found, and we return a status, to conditionally release a refcount on the request sock. This also means tcp_check_req() in non fastopen case might or not consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have to properly handle this. (Same remark for dccp_check_req() and its callers) inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is called 4 times in tcp and 3 times in dccp. Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix verifier memory corruption and other bugs in BPF layer, from Alexei Starovoitov. 2) Add a conservative fix for doing BPF properly in the BPF classifier of the packet scheduler on ingress. Also from Alexei. 3) The SKB scrubber should not clear out the packet MARK and security label, from Herbert Xu. 4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue. 5) Pause handling is not correct in the stmmac driver because it doesn't take into consideration the RX and TX fifo sizes. From Vince Bridgers. 6) Failure path missing unlock in FOU driver, from Wang Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits) net: dsa: use DEVICE_ATTR_RW to declare temp1_max netns: remove BUG_ONs from net_generic() IB/ipoib: Fix ndo_get_iflink sfc: Fix memcpy() with const destination compiler warning. altera tse: Fix network-delays and -retransmissions after high throughput. net: remove unused 'dev' argument from netif_needs_gso() act_mirred: Fix bogus header when redirecting from VLAN inet_diag: fix access to tcp cc information tcp: tcp_get_info() should fetch socket fields once net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state() skbuff: Do not scrub skb mark within the same name space Revert "net: Reset secmark when scrubbing packet" bpf: fix two bugs in verification logic when accessing 'ctx' pointer bpf: fix bpf helpers to use skb->mac_header relative offsets stmmac: Configure Flow Control to work correctly based on rxfifo size stmmac: Enable unicast pause frame detect in GMAC Register 6 stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree stmmac: Add defines and documentation for enabling flow control stmmac: Add properties for transmit and receive fifo sizes stmmac: fix oops on rmmod after assigning ip addr ...
2015-04-17netns: remove BUG_ONs from net_generic()Denys Vlasenko
This inline has ~500 callsites. On 04/14/2015 08:37 PM, David Miller wrote: > That BUG_ON() was added 7 years ago, and I don't remember it ever > triggering or helping us diagnose something, so just remove it and > keep the function inlined. On x86 allyesconfig build: text data bss dec hex filename 82447071 22255384 20627456 125329911 77861f7 vmlinux4 82441375 22255384 20627456 125324215 7784bb7 vmlinux5prime Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Eric W. Biederman <ebiederm@xmission.com> CC: David S. Miller <davem@davemloft.net> CC: Jan Engelhardt <jengelh@medozas.de> CC: Jiri Pirko <jpirko@redhat.com> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17inet_diag: fix access to tcp cc informationEric Dumazet
Two different problems are fixed here : 1) inet_sk_diag_fill() might be called without socket lock held. icsk->icsk_ca_ops can change under us and module be unloaded. -> Access to freed memory. Fix this using rcu_read_lock() to prevent module unload. 2) Some TCP Congestion Control modules provide information but again this is not safe against icsk->icsk_ca_ops change and nla_put() errors were ignored. Some sockets could not get the additional info if skb was almost full. Fix this by returning a status from get_info() handlers and using rcu protection as well. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-15Merge branch 'for-linus-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs update from Al Viro: "Now that net-next went in... Here's the next big chunk - killing ->aio_read() and ->aio_write(). There'll be one more pile today (direct_IO changes and generic_write_checks() cleanups/fixes), but I'd prefer to keep that one separate" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) ->aio_read and ->aio_write removed pcm: another weird API abuse infinibad: weird APIs switched to ->write_iter() kill do_sync_read/do_sync_write fuse: use iov_iter_get_pages() for non-splice path fuse: switch to ->read_iter/->write_iter switch drivers/char/mem.c to ->read_iter/->write_iter make new_sync_{read,write}() static coredump: accept any write method switch /dev/loop to vfs_iter_write() serial2002: switch to __vfs_read/__vfs_write ashmem: use __vfs_read() export __vfs_read() autofs: switch to __vfs_write() new helper: __vfs_write() switch hugetlbfs to ->read_iter() coda: switch to ->read_iter/->write_iter ncpfs: switch to ->read_iter/->write_iter net/9p: remove (now-)unused helpers p9_client_attach(): set fid->uid correctly ...
2015-04-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next A final pull request, I know it's very late but this time I think it's worth a bit of rush. The following patchset contains Netfilter/nf_tables updates for net-next, more specifically concatenation support and dynamic stateful expression instantiation. This also comes with a couple of small patches. One to fix the ebtables.h userspace header and another to get rid of an obsolete example file in tree that describes a nf_tables expression. This time, I decided to paste the original descriptions. This will result in a rather large commit description, but I think these bytes to keep. Patrick McHardy says: ==================== netfilter: nf_tables: concatenation support The following patches add support for concatenations, which allow multi dimensional exact matches in O(1). The basic idea is to split the data registers, currently consisting of 4 registers of 16 bytes each, into smaller units, 16 registers of 4 bytes each, and making sure each register store always leaves the full 32 bit in a well defined state, meaning smaller stores will zero the remaining bits. Based on that, we can load multiple adjacent registers with different values, thereby building a concatenated bigger value, and use that value for set lookups. Sets are changed to use variable sized extensions for their key and data values, removing the fixed limit of 16 bytes while saving memory if less space is needed. As a side effect, these patches will allow some nice optimizations in the future, like using jhash2 in nft_hash, removing the masking in nft_cmp_fast, optimized data comparison using 32 bit word size etc. These are not done so far however. The patches are split up as follows: * the first five patches add length validation to register loads and stores to make sure we stay within bounds and prepare the validation functions for the new addressing mode * the next patches prepare for changing to 32 bit addressing by introducing a struct nft_regs, which holds the verdict register as well as the data registers. The verdict members are moved to a new struct nft_verdict to allow to pull struct nft_data out of the stack. * the next patches contain preparatory conversions of expressions and sets to use 32 bit addressing * the next patch introduces so far unused register conversion helpers for parsing and dumping register numbers over netlink * following is the real conversion to 32 bit addressing, consisting of replacing struct nft_data in struct nft_regs by an array of u32s and actually translating and validating the new register numbers. * the final two patches add support for variable sized data items and variable sized keys / data in set elements The patches have been verified to work correctly with nft binaries using both old and new addressing. ==================== Patrick McHardy says: ==================== netfilter: nf_tables: dynamic stateful expression instantiation The following patches are the grand finale of my nf_tables set work, using all the building blocks put in place by the previous patches to support something like iptables hashlimit, but a lot more powerful. Sets are extended to allow attaching expressions to set elements. The dynset expression dynamically instantiates these expressions based on a template when creating new set elements and evaluates them for all new or updated set members. In combination with concatenations this effectively creates state tables for arbitrary combinations of keys, using the existing expression types to maintain that state. Regular set GC takes care of purging expired states. We currently support two different stateful expressions, counter and limit. Using limit as a template we can express the functionality of hashlimit, but completely unrestricted in the combination of keys. Using counter we can perform accounting for arbitrary flows. The following examples from patch 5/5 show some possibilities. Userspace syntax is still WIP, especially the listing of state tables will most likely be seperated from normal set listings and use a more structured format: 1. Limit the rate of new SSH connections per host, similar to iptables hashlimit: flow ip saddr timeout 60s \ limit 10/second \ accept 2. Account network traffic between each set of /24 networks: flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \ counter 3. Account traffic to each host per user: flow skuid . ip daddr \ counter 4. Account traffic for each combination of source address and TCP flags: flow ip saddr . tcp flags \ counter The resulting set content after a Xmas-scan look like this: { 192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040, 192.168.122.1 . ack : counter packets 74 bytes 3848, 192.168.122.1 . psh | ack : counter packets 35 bytes 3144 } In the future the "expressions attached to elements" will be extended to also support user created non-stateful expressions to allow to efficiently select beween a set of parameter sets, f.i. a set of log statements with different prefixes based on the interface, which currently require one rule each. This will most likely have to wait until the next kernel version though. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Al Viro says: ==================== netdev-related stuff in vfs.git There are several commits sitting in vfs.git that probably ought to go in via net-next.git. First of all, there's merge with vfs.git#iocb - that's Christoph's aio rework, which has triggered conflicts with the ->sendmsg() and ->recvmsg() patches a while ago. It's not so much Christoph's stuff that ought to be in net-next, as (pretty simple) conflict resolution on merge. The next chunk is switch to {compat_,}import_iovec/import_single_range - new safer primitives for initializing iov_iter. The primitives themselves come from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue), conversion of net/socket.c syscalls belongs in net-next, IMO. Next there's afs and rxrpc stuff from dhowells. And then there's sanitizing kernel_sendmsg et.al. + missing inlined helper for "how much data is left in msg->msg_iter" - this stuff is used in e.g. cifs stuff, but it belongs in net-next. That pile is pullable from git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem I'll post the individual patches in there in followups; could you take a look and tell if everything in there is OK with you? ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13tcp/dccp: get rid of central timewait timerEric Dumazet
Using a timer wheel for timewait sockets was nice ~15 years ago when memory was expensive and machines had a single processor. This does not scale, code is ugly and source of huge latencies (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.) We can afford to use an extra 64 bytes per timewait sock and spread timewait load to all cpus to have better behavior. Tested: On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1 on the target (lpaa24) Before patch : lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 419594 lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 437171 While test is running, we can observe 25 or even 33 ms latencies. lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2 lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2 After patch : About 90% increase of throughput : lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 810442 lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 800992 And latencies are kept to minimal values during this load, even if network utilization is 90% higher : lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13netfilter: nf_tables: mark stateful expressionsPatrick McHardy
Add a flag to mark stateful expressions. This is used for dynamic expression instanstiation to limit the usable expressions. Strictly speaking only the dynset expression can not be used in order to avoid recursion, but since dynamically instantiating non-stateful expressions will simply create an identical copy, which behaves no differently than the original, this limits to expressions where it actually makes sense to dynamically instantiate them. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: prepare for expressions associated to set elementsPatrick McHardy
Preparation to attach expressions to set elements: add a set extension type to hold an expression and dump the expression information with the set element. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: add helper functions for expression handlingPatrick McHardy
Add helper functions for initializing, cloning, dumping and destroying a single expression that is not part of a rule. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: variable sized set element keys / dataPatrick McHardy
This patch changes sets to support variable sized set element keys / data up to 64 bytes each by using variable sized set extensions. This allows to use concatenations with bigger data items suchs as IPv6 addresses. As a side effect, small keys/data now don't require the full 16 bytes of struct nft_data anymore but just the space they need. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: support variable sized data in nft_data_init()Patrick McHardy
Add a size argument to nft_data_init() and pass in the available space. This will be used by the following patches to support variable sized set element data. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: switch registers to 32 bit addressingPatrick McHardy
Switch the nf_tables registers from 128 bit addressing to 32 bit addressing to support so called concatenations, where multiple values can be concatenated over multiple registers for O(1) exact matches of multiple dimensions using sets. The old register values are mapped to areas of 128 bits for compatibility. When dumping register numbers, values are expressed using the old values if they refer to the beginning of a 128 bit area for compatibility. To support concatenations, register loads of less than a full 32 bit value need to be padded. This mainly affects the payload and exthdr expressions, which both unconditionally zero the last word before copying the data. Userspace fully passes the testsuite using both old and new register addressing. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: add register parsing/dumping helpersPatrick McHardy
Add helper functions to parse and dump register values in netlink attributes. These helpers will later be changed to take care of translation between the old 128 bit and the new 32 bit register numbers. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: convert sets to u32 data pointersPatrick McHardy
Simple conversion to use u32 pointers to the beginning of the data area to keep follow up patches smaller. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: kill nft_data_cmp()Patrick McHardy
Only needlessly complicates things due to requiring specific argument types. Use memcmp directly. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: use struct nft_verdict within struct nft_dataPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: get rid of NFT_REG_VERDICT usagePatrick McHardy
Replace the array of registers passed to expressions by a struct nft_regs, containing the verdict as a seperate member, which aliases to the NFT_REG_VERDICT register. This is needed to seperate the verdict from the data registers completely, so their size can be changed. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: introduce nft_validate_register_load()Patrick McHardy
Change nft_validate_input_register() to not only validate the input register number, but also the length of the load, and rename it to nft_validate_register_load() to reflect that change. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: kill nft_validate_output_register()Patrick McHardy
All users of nft_validate_register_store() first invoke nft_validate_output_register(). There is in fact no use for using it on its own, so simplify the code by folding the functionality into nft_validate_register_store() and kill it. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: rename nft_validate_data_load()Patrick McHardy
The existing name is ambiguous, data is loaded as well when we read from a register. Rename to nft_validate_register_store() for clarity and consistency with the upcoming patch to introduce its counterpart, nft_validate_register_load(). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13netfilter: nf_tables: validate len in nft_validate_data_load()Patrick McHardy
For values spanning multiple registers, we need to validate that enough space is available from the destination register onwards. Add a len argument to nft_validate_data_load() and consolidate the existing length validations in preparation of that. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-12Merge tag 'mac80211-next-for-davem-2015-04-10' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== There isn't much left, but we have * new mac80211 internal software queue to allow drivers to have shorter hardware queues and pull on-demand * use rhashtable for mac80211 station table * minstrel rate control debug improvements and some refactoring * fix noisy message about TX power reduction * fix continuous message printing and activity if CRDA doesn't respond * fix VHT-related capabilities with "iw connect" or "iwconfig ..." * fix Kconfig for cfg80211 wireless extensions compatibility ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-119p: switch p9_client_read() to passing struct iov_iter *Al Viro
... and make it loop Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-119p: switch p9_client_write() to passing it struct iov_iter *Al Viro
... and make it loop until it's done Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11net/9p: switch the guts of p9_client_{read,write}() to iov_iterAl Viro
... and have get_user_pages_fast() mapping fewer pages than requested to generate a short read/write. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-10rtnetlink: Mark name argument of rtnl_create_link() constThomas Graf
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2015-04-09 We've had enough new patches during the past week (especially from Marcel) that it'd be good to still get these queued for 4.1. The majority of the changes are from Marcel with lots of cleanup & refactoring patches for the HCI UART driver. Marcel also split out some Broadcom & Intel vendor specific functionality into two new btintel & btbcm modules. In addition to the HCI driver changes there's the completion of our local OOB data interface for pairing, added support for requesting remote LE features when connecting, as well as a couple of minor fixes for mac802154. Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. They are: * nf_tables set timeout infrastructure from Patrick Mchardy. 1) Add support for set timeout support. 2) Add support for set element timeouts using the new set extension infrastructure. 4) Add garbage collection helper functions to get rid of stale elements. Elements are accumulated in a batch that are asynchronously released via RCU when the batch is full. 5) Add garbage collection synchronization helpers. This introduces a new element busy bit to address concurrent access from the netlink API and the garbage collector. 5) Add timeout support for the nft_hash set implementation. The garbage collector peridically checks for stale elements from the workqueue. * iptables/nftables cgroup fixes: 6) Ignore non full-socket objects from the input path, otherwise cgroup match may crash, from Daniel Borkmann. 7) Fix cgroup in nf_tables. 8) Save some cycles from xt_socket by skipping packet header parsing when skb->sk is already set because of early demux. Also from Daniel. * br_netfilter updates from Florian Westphal. 9) Save frag_max_size and restore it from the forward path too. 10) Use a per-cpu area to restore the original source MAC address when traffic is DNAT'ed. 11) Add helper functions to access physical devices. 12) Use these new physdev helper function from xt_physdev. 13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter state information. 14) Annotate original layer 2 protocol number in nf_bridge info, instead of using kludgy flags. 15) Also annotate the pkttype mangling when the packet travels back and forth from the IP to the bridge layer, instead of using a flag. * More nf_tables set enhancement from Patrick: 16) Fix possible usage of set variant that doesn't support timeouts. 17) Avoid spurious "set is full" errors from Netlink API when there are pending stale elements scheduled to be released. 18) Restrict loop checks to set maps. 19) Add support for dynamic set updates from the packet path. 20) Add support to store optional user data (eg. comments) per set element. BTW, I have also pulled net-next into nf-next to anticipate the conflict resolution between your okfn() signature changes and Florian's br_netfilter updates. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09mac802154: fix transmission power datatypeVarka Bhadram
Netlink attribute for the power is s8. But for the driver level operations we are collection power level value into integer. It has to be change to s8 from int. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-09mac802154: fix typo for deviceVarka Bhadram
Signed-off-by: Varka Bhadram <varkab@cdac.in> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-09Bluetooth: Read LE remote features during connection establishmentMarcel Holtmann
When establishing a Bluetooth LE connection, read the remote used features mask to determine which features are supported. This was not really needed with Bluetooth 4.0, but since Bluetooth 4.1 and also 4.2 have introduced new optional features, this becomes more important. This works the same as with BR/EDR where the connection enters the BT_CONFIG stage and hci_connect_cfm call is delayed until the remote features have been retrieved. Only after successfully receiving the remote features, the connection enters the BT_CONNECTED state. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-04-09net: switch importing msghdr from userland to {compat_,}import_iovec()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09Merge branch 'iocb' into for-davemAl Viro
trivial conflict in net/socket.c and non-trivial one in crypto - that one had evaded aio_complete() removal. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>