aboutsummaryrefslogtreecommitdiff
path: root/net/packet
AgeCommit message (Collapse)Author
2015-12-09packet: race condition in packet_bindFrancesco Ruggeri
[ Upstream commit 30f7ea1c2b5f5fb7462c5ae44fe2e40cb2d6a474 ] There is a race conditions between packet_notifier and packet_bind{_spkt}. It happens if packet_notifier(NETDEV_UNREGISTER) executes between the time packet_bind{_spkt} takes a reference on the new netdevice and the time packet_do_bind sets po->ifindex. In this case the notification can be missed. If this happens during a dev_change_net_namespace this can result in the netdevice to be moved to the new namespace while the packet_sock in the old namespace still holds a reference on it. When the netdevice is later deleted in the new namespace the deletion hangs since the packet_sock is not found in the new namespace' &net->packet.sklist. It can be reproduced with the script below. This patch makes packet_do_bind check again for the presence of the netdevice in the packet_sock's namespace after the synchronize_net in unregister_prot_hook. More in general it also uses the rcu lock for the duration of the bind to stop dev_change_net_namespace/rollback_registered_many from going past the synchronize_net following unlist_netdevice, so that no NETDEV_UNREGISTER notifications can happen on the new netdevice while the bind is executing. In order to do this some code from packet_bind{_spkt} is consolidated into packet_do_dev. import socket, os, time, sys proto=7 realDev='em1' vlanId=400 if len(sys.argv) > 1: vlanId=int(sys.argv[1]) dev='vlan%d' % vlanId os.system('taskset -p 0x10 %d' % os.getpid()) s = socket.socket(socket.PF_PACKET, socket.SOCK_RAW, proto) os.system('ip link add link %s name %s type vlan id %d' % (realDev, dev, vlanId)) os.system('ip netns add dummy') pid=os.fork() if pid == 0: # dev should be moved while packet_do_bind is in synchronize net os.system('taskset -p 0x20000 %d' % os.getpid()) os.system('ip link set %s netns dummy' % dev) os.system('ip netns exec dummy ip link del %s' % dev) s.close() sys.exit(0) time.sleep(.004) try: s.bind(('%s' % dev, proto+1)) except: print 'Could not bind socket' s.close() os.system('ip netns del dummy') sys.exit(0) os.waitpid(pid, 0) s.close() os.system('ip netns del dummy') sys.exit(0) Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29packet: tpacket_snd(): fix signed/unsigned comparisonAlexander Drozdov
[ Upstream commit dbd46ab412b8fb395f2b0ff6f6a7eec9df311550 ] tpacket_fill_skb() can return a negative value (-errno) which is stored in tp_len variable. In that case the following condition will be (but shouldn't be) true: tp_len > dev->mtu + dev->hard_header_len as dev->mtu and dev->hard_header_len are both unsigned. That may lead to just returning an incorrect EMSGSIZE errno to the user. Fixes: 52f1454f629fa ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case") Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29packet: missing dev_put() in packet_do_bind()Lars Westerhoff
[ Upstream commit 158cd4af8dedbda0d612d448c724c715d0dda649 ] When binding a PF_PACKET socket, the use count of the bound interface is always increased with dev_hold in dev_get_by_{index,name}. However, when rebound with the same protocol and device as in the previous bind the use count of the interface was not decreased. Ultimately, this caused the deletion of the interface to fail with the following message: unregister_netdevice: waiting for dummy0 to become free. Usage count = 1 This patch moves the dev_put out of the conditional part that was only executed when either the protocol or device changed on a bind. Fixes: 902fefb82ef7 ('packet: improve socket create/bind latency in some cases') Signed-off-by: Lars Westerhoff <lars.westerhoff@newtec.eu> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10packet: avoid out of bounds read in round robin fanoutWillem de Bruijn
[ Upstream commit 468479e6043c84f5a65299cc07cb08a22a28c2b1 ] PACKET_FANOUT_LB computes f->rr_cur such that it is modulo f->num_members. It returns the old value unconditionally, but f->num_members may have changed since the last store. Ensure that the return value is always < num. When modifying the logic, simplify it further by replacing the loop with an unconditional atomic increment. Fixes: dc99f600698d ("packet: Add fanout support.") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10packet: read num_members once in packet_rcv_fanout()Eric Dumazet
[ Upstream commit f98f4514d07871da7a113dd9e3e330743fd70ae4 ] We need to tell compiler it must not read f->num_members multiple times. Otherwise testing if num is not zero is flaky, and we could attempt an invalid divide by 0 in fanout_demux_cpu() Note bug was present in packet_rcv_fanout_hash() and packet_rcv_fanout_lb() but final 3.1 had a simple location after commit 95ec3eb417115fb ("packet: Add 'cpu' fanout policy.") Fixes: dc99f600698dc ("packet: Add fanout support.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-10af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).Kretschmer, Mathias
This patch fixes an issue where the send(MSG_DONTWAIT) call on a TX_RING is not fully non-blocking in cases where the device's sndBuf is full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly occuring error code (most likely EGAIN) to the caller. As the fast-path stays as it is, we keep the unlikely() around skb == NULL. Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23af_packet: pass checksum validation status to the userAlexander Drozdov
Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the af_packet user that at least the transport header checksum has been already validated. For now, the flag may be set for incoming packets only. Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23af_packet: make tpacket_rcv to not set status value before run_filterAlexander Drozdov
It is just an optimization. We don't need the value of status variable if the packet is filtered. Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12net: Introduce possible_net_tEric W. Biederman
Having to say > #ifdef CONFIG_NET_NS > struct net *net; > #endif in structures is a little bit wordy and a little bit error prone. Instead it is possible to say: > typedef struct { > #ifdef CONFIG_NET_NS > struct net *net; > #endif > } possible_net_t; And then in a header say: > possible_net_t net; Which is cleaner and easier to use and easier to test, as the possible_net_t is always there no matter what the compile options. Further this allows read_pnet and write_pnet to be functions in all cases which is better at catching typos. This change adds possible_net_t, updates the definitions of read_pnet and write_pnet, updates optional struct net * variables that write_pnet uses on to have the type possible_net_t, and finally fixes up the b0rked users of read_pnet and write_pnet. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/cadence/macb.c Overlapping changes in macb driver, mostly fixes and cleanups in 'net' overlapping with the integration of at91_ether into macb in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09net: delete stale packet_mclist entriesFrancesco Ruggeri
When an interface is deleted from a net namespace the ifindex in the corresponding entries in PF_PACKET sockets' mclists becomes stale. This can create inconsistencies if later an interface with the same ifindex is moved from a different namespace (not that unlikely since ifindexes are per-namespace). In particular we saw problems with dev->promiscuity, resulting in "promiscuity touches roof, set promiscuity failed. promiscuity feature of device might be broken" warnings and EOVERFLOW failures of setsockopt(PACKET_ADD_MEMBERSHIP). This patch deletes the mclist entries for interfaces that are deleted. Since this now causes setsockopt(PACKET_DROP_MEMBERSHIP) to fail with EADDRNOTAVAIL if called after the interface is deleted, also make packet_mc_drop not fail. Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/rocker/rocker.c The rocker commit was two overlapping changes, one to rename the ->vport member to ->pport, and another making the bitmask expression use '1ULL' instead of plain '1'. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: Remove iocb argument from sendmsg and recvmsgYing Xue
After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: add common accessor for setting dropcount on packetsEyal Birger
As part of an effort to move skb->dropcount to skb->cb[], use a common function in order to set dropcount in struct sk_buff. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: use common macro for assering skb->cb[] available size in protocol familiesEyal Birger
As part of an effort to move skb->dropcount to skb->cb[] use a common macro in protocol families using skb->cb[] for ancillary data to validate available room in skb->cb[]. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: packet: use sockaddr_ll fields as storage for skb original length in ↵Eyal Birger
recvmsg path As part of an effort to move skb->dropcount to skb->cb[], 4 bytes of additional room are needed in skb->cb[] in packet sockets. Store the skb original length in the first two fields of sockaddr_ll (sll_family and sll_protocol) as they can be derived from the skb when needed. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-24af_packet: don't pass empty blocks for PACKET_V3Alexander Drozdov
Before da413eec729d ("packet: Fixed TPACKET V3 to signal poll when block is closed rather than every packet") poll listening for an af_packet socket was not signaled if there was no packets to process. After the patch poll is signaled evety time when block retire timer expires. That happens because af_packet closes the current block on timeout even if the block is empty. Passing empty blocks to the user not only wastes CPU but also wastes ring buffer space increasing probability of packets dropping on small timeouts. Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Cc: Dan Collins <dan@dcollins.co.nz> Cc: Willem de Bruijn <willemb@google.com> Cc: Guy Harris <guy@alum.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-21af_packet: allow packets defragmentation not only for hash fanout typeAlexander Drozdov
Packets defragmentation was introduced for PACKET_FANOUT_HASH only, see 7736d33f4262 ("packet: Add pre-defragmentation support for ipv4 fanouts") It may be useful to have defragmentation enabled regardless of fanout type. Without that, the AF_PACKET user may have to: 1. Collect fragments from different rings 2. Defragment by itself Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18netlink: make nlmsg_end() and genlmsg_end() voidJohannes Berg
Contrary to common expectations for an "int" return, these functions return only a positive value -- if used correctly they cannot even return 0 because the message header will necessarily be in the skb. This makes the very common pattern of if (genlmsg_end(...) < 0) { ... } be a whole bunch of dead code. Many places also simply do return nlmsg_end(...); and the caller is expected to deal with it. This also commonly (at least for me) causes errors, because it is very common to write if (my_function(...)) /* error condition */ and if my_function() does "return nlmsg_end()" this is of course wrong. Additionally, there's not a single place in the kernel that actually needs the message length returned, and if anyone needs it later then it'll be very easy to just use skb->len there. Remove this, and make the functions void. This removes a bunch of dead code as described above. The patch adds lines because I did - return nlmsg_end(...); + nlmsg_end(...); + return 0; I could have preserved all the function's return values by returning skb->len, but instead I've audited all the places calling the affected functions and found that none cared. A few places actually compared the return value with <= 0 in dump functionality, but that could just be changed to < 0 with no change in behaviour, so I opted for the more efficient version. One instance of the error I've made numerous times now is also present in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't check for <0 or <=0 and thus broke out of the loop every single time. I've preserved this since it will (I think) have caused the messages to userspace to be formatted differently with just a single message for every SKB returned to userspace. It's possible that this isn't needed for the tools that actually use this, but I don't even know what they are so couldn't test that changing this behaviour would be acceptable. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/xen-netfront.c Minor overlapping changes in xen-netfront.c, mostly to do with some buffer management changes alongside the split of stats into TX and RX. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13net: rename vlan_tx_* helpers since "tx" is misleading thereJiri Pirko
The same macros are used for rx as well. So rename it. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12packet: make packet too small warning match conditionWillem de Bruijn
The expression in ll_header_truncated() tests less than or equal, but the warning prints less than. Update the warning. Reported-by: Jouni Malinen <jkmalinen@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-11packet: bail out of packet_snd() if L2 header creation failsChristoph Jaeger
Due to a misplaced parenthesis, the expression (unlikely(offset) < 0), which expands to (__builtin_expect(!!(offset), 0) < 0), never evaluates to true. Therefore, when sending packets with PF_PACKET/SOCK_DGRAM, packet_snd() does not abort as intended if the creation of the layer 2 header fails. Spotted by Coverity - CID 1259975 ("Operands don't affect result"). Fixes: 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 header") Signed-off-by: Christoph Jaeger <cj@linux.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-22packet: Fixed TPACKET V3 to signal poll when block is closed rather than ↵Dan Collins
every packet Make TPACKET_V3 signal poll when block is closed rather than for every packet. Side effect is that poll will be signaled when block retire timer expires which didn't previously happen. Issue was visible when sending packets at a very low frequency such that all blocks are retired before packets are received by TPACKET_V3. This caused avoidable packet loss. The fix ensures that the signal is sent when blocks are closed which covers the normal path where the block is filled as well as the path where the timer expires. The case where a block is filled without moving to the next block (ie. all blocks are full) will still cause poll to be signaled. Signed-off-by: Dan Collins <dan@dcollins.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) New offloading infrastructure and example 'rocker' driver for offloading of switching and routing to hardware. This work was done by a large group of dedicated individuals, not limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend, Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu 2) Start making the networking operate on IOV iterators instead of modifying iov objects in-situ during transfers. Thanks to Al Viro and Herbert Xu. 3) A set of new netlink interfaces for the TIPC stack, from Richard Alpe. 4) Remove unnecessary looping during ipv6 routing lookups, from Martin KaFai Lau. 5) Add PAUSE frame generation support to gianfar driver, from Matei Pavaluca. 6) Allow for larger reordering levels in TCP, which are easily achievable in the real world right now, from Eric Dumazet. 7) Add a variable of napi_schedule that doesn't need to disable cpu interrupts, from Eric Dumazet. 8) Use a doubly linked list to optimize neigh_parms_release(), from Nicolas Dichtel. 9) Various enhancements to the kernel BPF verifier, and allow eBPF programs to actually be attached to sockets. From Alexei Starovoitov. 10) Support TSO/LSO in sunvnet driver, from David L Stevens. 11) Allow controlling ECN usage via routing metrics, from Florian Westphal. 12) Remote checksum offload, from Tom Herbert. 13) Add split-header receive, BQL, and xmit_more support to amd-xgbe driver, from Thomas Lendacky. 14) Add MPLS support to openvswitch, from Simon Horman. 15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen Klassert. 16) Do gro flushes on a per-device basis using a timer, from Eric Dumazet. This tries to resolve the conflicting goals between the desired handling of bulk vs. RPC-like traffic. 17) Allow userspace to ask for the CPU upon what a packet was received/steered, via SO_INCOMING_CPU. From Eric Dumazet. 18) Limit GSO packets to half the current congestion window, from Eric Dumazet. 19) Add a generic helper so that all drivers set their RSS keys in a consistent way, from Eric Dumazet. 20) Add xmit_more support to enic driver, from Govindarajulu Varadarajan. 21) Add VLAN packet scheduler action, from Jiri Pirko. 22) Support configurable RSS hash functions via ethtool, from Eyal Perry. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits) Fix race condition between vxlan_sock_add and vxlan_sock_release net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header net/mlx4: Add support for A0 steering net/mlx4: Refactor QUERY_PORT net/mlx4_core: Add explicit error message when rule doesn't meet configuration net/mlx4: Add A0 hybrid steering net/mlx4: Add mlx4_bitmap zone allocator net/mlx4: Add a check if there are too many reserved QPs net/mlx4: Change QP allocation scheme net/mlx4_core: Use tasklet for user-space CQ completion events net/mlx4_core: Mask out host side virtualization features for guests net/mlx4_en: Set csum level for encapsulated packets be2net: Export tunnel offloads only when a VxLAN tunnel is created gianfar: Fix dma check map error when DMA_API_DEBUG is enabled cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call net: fec: only enable mdio interrupt before phy device link up net: fec: clear all interrupt events to support i.MX6SX net: fec: reset fep link status in suspend function net: sock: fix access via invalid file descriptor net: introduce helper macro for_each_cmsghdr ...
2014-12-09put iov_iter into msghdrAl Viro
Note that the code _using_ ->msg_iter at that point will be very unhappy with anything other than unshifted iovec-backed iov_iter. We still need to convert users to proper primitives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-09af_packet: virtio 1.0 stubsMichael S. Tsirkin
This merely fixes sparse warnings, without actually adding support for the new APIs. Still working out the best way to enable the new functionality. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-11-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2014-11-24af_packet: fix sparse warningMichael S. Tsirkin
af_packet produces lots of these: net/packet/af_packet.c:384:39: warning: incorrect type in return expression (different modifiers) net/packet/af_packet.c:384:39: expected struct page [pure] * net/packet/af_packet.c:384:39: got struct page * this seems to be because sparse does not realize that _pure refers to function, not the returned pointer. Tweak code slightly to avoid the warning. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-24switch AF_PACKET and AF_UNIX to skb_copy_datagram_from_iter()Al Viro
... and kill skb_copy_datagram_iovec() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24new helper: memcpy_to_msg()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24new helper: memcpy_from_msg()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-21packet: make packet_snd fail on len smaller than l2 headerWillem de Bruijn
When sending packets out with PF_PACKET, SOCK_RAW, ensure that the packet is at least as long as the device's expected link layer header. This check already exists in tpacket_snd, but not in packet_snd. Also rate limit the warning in tpacket_snd. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05net: Add and use skb_copy_datagram_msg() helper.David S. Miller
This encapsulates all of the skb_copy_datagram_iovec() callers with call argument signature "skb, offset, msghdr->msg_iov, length". When we move to iov_iters in the networking, the iov_iter object will sit in the msghdr. Having a helper like this means there will be less places to touch during that transformation. Based upon descriptions and patch from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01net: Pass a "more" indication down into netdev_start_xmit() code paths.David S. Miller
For now it will always be false. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01net: Do txq_trans_update() in netdev_start_xmit()David S. Miller
That way we don't have to audit every call site to make sure it is doing this properly. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-29net: add skb_get_tx_queue() helperDaniel Borkmann
Replace occurences of skb_get_queue_mapping() and follow-up netdev_get_tx_queue() with an actual helper function. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24net: Add ops->ndo_xmit_flush()David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-21packet: handle too big packets for PACKET_V3Eric Dumazet
af_packet can currently overwrite kernel memory by out of bound accesses, because it assumed a [new] block can always hold one frame. This is not generally the case, even if most existing tools do it right. This patch clamps too long frames as API permits, and issue a one time error on syslog. [ 394.357639] tpacket_rcv: packet too big, clamped from 5042 to 3966. macoff=82 In this example, packet header tp_snaplen was set to 3966, and tp_len was set to 5042 (skb->len) Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Acked-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-29packet: remove deprecated syststamp timestampWillem de Bruijn
No device driver will ever return an skb_shared_info structure with syststamp non-zero, so remove the branch that tests for this and optionally marks the packet timestamp as TP_STATUS_TS_SYS_HARDWARE. Do not remove the definition TP_STATUS_TS_SYS_HARDWARE, as processes may refer to it. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15packet: remove unnecessary break after returnFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-24net: Use netlink_ns_capable to verify the permisions of netlink messagesEric W. Biederman
It is possible by passing a netlink socket to a more privileged executable and then to fool that executable into writing to the socket data that happens to be valid netlink message to do something that privileged executable did not intend to do. To keep this from happening replace bare capable and ns_capable calls with netlink_capable, netlink_net_calls and netlink_ns_capable calls. Which act the same as the previous calls except they verify that the opener of the socket had the desired permissions as well. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-24net: Move the permission check in sock_diag_put_filterinfo to packet_diag_dumpEric W. Biederman
The permission check in sock_diag_put_filterinfo is wrong, and it is so removed from it's sources it is not clear why it is wrong. Move the computation into packet_diag_dump and pass a bool of the result into sock_diag_filterinfo. This does not yet correct the capability check but instead simply moves it to make it clear what is going on. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-22net: Fix ns_capable check in sock_diag_put_filterinfoAndrew Lutomirski
The caller needs capabilities on the namespace being queried, not on their own namespace. This is a security bug, although it likely has only a minor impact. Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-11net: Fix use after free by removing length arg from sk_data_ready callbacks.David S. Miller
Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-03packet: fix packet_direct_xmit for BQL enabled driversDaniel Borkmann
Currently, in packet_direct_xmit() we test the assigned netdevice queue for netif_xmit_frozen_or_stopped() before doing an ndo_start_xmit(). This can have the side-effect that BQL enabled drivers which make use of netdev_tx_sent_queue() internally, set __QUEUE_STATE_STACK_XOFF from within the stack and would not fully fill the device's TX ring from packet sockets with PACKET_QDISC_BYPASS enabled. Instead, use a test without BQL bit so that bursts can be absorbed into the NICs TX ring. Fix and code suggested by Eric Dumazet, thanks! Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-03packet: report tx_dropped in packet_direct_xmitDaniel Borkmann
Since commit 015f0688f57c ("net: net: add a core netdev->tx_dropped counter"), we can now account for TX drops from within the core stack instead of drivers. Therefore, fix packet_direct_xmit() and increase drop count when we encounter a problem before driver's xmit function was called (we do not want to doubly account for it). Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28packet: respect devices with LLTX flag in direct xmitDaniel Borkmann
Quite often it can be useful to test with dummy or similar devices as a blackhole sink for skbs. Such devices are only equipped with a single txq, but marked as NETIF_F_LLTX as they do not require locking their internal queues on xmit (or implement locking themselves). Therefore, rather use HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will be respected. trafgen mmap/TX_RING example against dummy device with config foo: { fill(0xff, 64) } results in the following performance improvements for such scenarios on an ordinary Core i7/2.80GHz: Before: Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs): 160,975,944,159 instructions:k # 0.55 insns per cycle ( +- 0.09% ) 293,319,390,278 cycles:k # 0.000 GHz ( +- 0.35% ) 192,501,104 branch-misses:k ( +- 1.63% ) 831 context-switches:k ( +- 9.18% ) 7 cpu-migrations:k ( +- 7.40% ) 69,382 cache-misses:k # 0.010 % of all cache refs ( +- 2.18% ) 671,552,021 cache-references:k ( +- 1.29% ) 22.856401569 seconds time elapsed ( +- 0.33% ) After: Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs): 133,788,739,692 instructions:k # 0.92 insns per cycle ( +- 0.06% ) 145,853,213,256 cycles:k # 0.000 GHz ( +- 0.17% ) 59,867,100 branch-misses:k ( +- 4.72% ) 384 context-switches:k ( +- 3.76% ) 6 cpu-migrations:k ( +- 6.28% ) 70,304 cache-misses:k # 0.077 % of all cache refs ( +- 1.73% ) 90,879,408 cache-references:k ( +- 1.35% ) 11.719372413 seconds time elapsed ( +- 0.24% ) Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-26net: Rename skb->rxhash to skb->hashTom Herbert
The packet hash can be considered a property of the packet, not just on RX path. This patch changes name of rxhash and l4_rxhash skbuff fields to be hash and l4_hash respectively. This includes changing uses of the field in the code which don't call the access functions. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-28packet: allow to transmit +4 byte in TX_RING slot for VLAN caseDaniel Borkmann
Commit 57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.") added the possibility for non-mmaped frames to send extra 4 byte for VLAN header so the MTU increases from 1500 to 1504 byte, for example. Commit cbd89acb9eb2 ("af_packet: fix for sending VLAN frames via packet_mmap") attempted to fix that for the mmap part but was reverted as it caused regressions while using eth_type_trans() on output path. Lets just act analogous to 57f89bfa2140 and add a similar logic to TX_RING. We presume size_max as overcharged with +4 bytes and later on after skb has been built by tpacket_fill_skb() check for ETH_P_8021Q header on packets larger than normal MTU. Can be easily reproduced with a slightly modified trafgen in mmap(2) mode, test cases: { fill(0xff, 12) const16(0x8100) fill(0xff, <1504|1505>) } { fill(0xff, 12) const16(0x0806) fill(0xff, <1500|1501>) } Note that we need to do the test right after tpacket_fill_skb() as sockets can have PACKET_LOSS set where we would not fail but instead just continue to traverse the ring. Reported-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Ben Greear <greearb@candelatech.com> Cc: Phil Sutter <phil@nwl.cc> Tested-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>