aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-03-04io-wq: provide an io_wq_put_and_exit() helperJens Axboe
If we put the io-wq from io_uring, we really want it to exit. Provide a helper that does that for us. Couple that with not having the manager hold a reference to the 'wq' and the normal SQPOLL exit will tear down the io-wq context appropriate. On the io-wq side, our wq context is per task, so only the task itself is manipulating ->manager and hence it's safe to check and clear without any extra locking. We just need to ensure that the manager task stays around, in case it exits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io_uring: don't use complete_all() on SQPOLL thread exitJens Axboe
We want to reuse this completion, and a single complete should do just fine. Ensure that we park ourselves first if requested, as that is what lead to the initial deadlock in this area. If we've got someone attempting to park us, then we can't proceed without having them finish first. Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io_uring: run fallback on cancellationPavel Begunkov
io_uring_try_cancel_requests() matches not only current's requests, but also of other exiting tasks, so we need to actively cancel them and not just wait, especially since the function can be called on flush during do_exit() -> exit_files(). Even if it's not a problem for now, it's much nicer to know that the function tries to cancel everything it can. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io_uring: SQPOLL stop error handling fixesJens Axboe
If we fail to fork an SQPOLL worker, we can hit cancel, and hence attempted thread stop, with the thread already being stopped. Ensure we check for that. Also guard thread stop fully by the sqd mutex, just like we do for park. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: fix double put of 'wq' in error pathJens Axboe
We are already freeing the wq struct in both spots, so don't put it and get it freed twice. Reported-by: syzbot+7bf785eedca35ca05501@syzkaller.appspotmail.com Fixes: 4fb6ac326204 ("io-wq: improve manager/worker handling over exec") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: wait for manager exit on wq destroyJens Axboe
The manager waits for the workers, hence the manager is always valid if workers are running. Now also have wq destroy wait for the manager on exit, so we now everything is gone. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: rename wq->done completion to wq->startedJens Axboe
This is a leftover from a different use cases, it's used to wait for the manager to startup. Rename it as such. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: don't ask for a new worker if we're exitingJens Axboe
If we're in the process of shutting down the async context, then don't create new workers if we already have at least the fixed one. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: have manager wait for all workers to exitJens Axboe
Instead of having to wait separately on workers and manager, just have the manager wait on the workers. We use an atomic_t for the reference here, as we need to start at 0 and allow increment from that. Since the number of workers is naturally capped by the allowed nr of processes, and that uses an int, there is no risk of overflow. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-01io-wq: wait for worker startup when forking a new oneJens Axboe
We need to have our worker count updated before continuing, to avoid cases where we repeatedly think we need a new worker, but a fork is already in progress. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-28Merge tag 'xfs-5.12-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull more xfs updates from Darrick Wong: "The most notable fix here prevents premature reuse of freed metadata blocks, and adding the ability to detect accidental nested transactions, which are not allowed here. - Restore a disused sysctl control knob that was inadvertently dropped during the merge window to avoid fstests regressions. - Don't speculatively release freed blocks from the busy list until we're actually allocating them, which fixes a rare log recovery regression. - Don't nest transactions when scanning for free space. - Add an idiot^Wmaintainer light to detect nested transactions. ;)" * tag 'xfs-5.12-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: use current->journal_info for detecting transaction recursion xfs: don't nest transactions when scanning for eofblocks xfs: don't reuse busy extents on extent trim xfs: restore speculative_cow_prealloc_lifetime sysctl
2021-02-28Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: "A few stragglers (and one due to me missing it originally), and fixes for changes in this merge window mostly. In particular: - blktrace cleanups (Chaitanya, Greg) - Kill dead blk_pm_* functions (Bart) - Fixes for the bio alloc changes (Christoph) - Fix for the partition changes (Christoph, Ming) - Fix for turning off iopoll with polled IO inflight (Jeffle) - nbd disconnect fix (Josef) - loop fsync error fix (Mauricio) - kyber update depth fix (Yang) - max_sectors alignment fix (Mikulas) - Add bio_max_segs helper (Matthew)" * tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits) block: Add bio_max_segs blktrace: fix documentation for blk_fill_rw() block: memory allocations in bounce_clone_bio must not fail block: remove the gfp_mask argument to bounce_clone_bio block: fix bounce_clone_bio for passthrough bios block-crypto-fallback: use a bio_set for splitting bios block: fix logging on capacity change blk-settings: align max_sectors on "logical_block_size" boundary block: reopen the device in blkdev_reread_part block: don't skip empty device in in disk_uevent blktrace: remove debugfs file dentries from struct blk_trace nbd: handle device refs for DESTROY_ON_DISCONNECT properly kyber: introduce kyber_depth_updated() loop: fix I/O error on fsync() in detached loop devices block: fix potential IO hang when turning off io_poll block: get rid of the trace rq insert wrapper blktrace: fix blk_rq_merge documentation blktrace: fix blk_rq_issue documentation blktrace: add blk_fill_rwbs documentation comment block: remove superfluous param in blk_fill_rwbs() ...
2021-02-27Merge tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring thread rewrite from Jens Axboe: "This converts the io-wq workers to be forked off the tasks in question instead of being kernel threads that assume various bits of the original task identity. This kills > 400 lines of code from io_uring/io-wq, and it's the worst part of the code. We've had several bugs in this area, and the worry is always that we could be missing some pieces for file types doing unusual things (recent /dev/tty example comes to mind, userfaultfd reads installing file descriptors is another fun one... - both of which need special handling, and I bet it's not the last weird oddity we'll find). With these identical workers, we can have full confidence that we're never missing anything. That, in itself, is a huge win. Outside of that, it's also more efficient since we're not wasting space and code on tracking state, or switching between different states. I'm sure we're going to find little things to patch up after this series, but testing has been pretty thorough, from the usual regression suite to production. Any issue that may crop up should be manageable. There's also a nice series of further reductions we can do on top of this, but I wanted to get the meat of it out sooner rather than later. The general worry here isn't that it's fundamentally broken. Most of the little issues we've found over the last week have been related to just changes in how thread startup/exit is done, since that's the main difference between using kthreads and these kinds of threads. In fact, if all goes according to plan, I want to get this into the 5.10 and 5.11 stable branches as well. That said, the changes outside of io_uring/io-wq are: - arch setup, simple one-liner to each arch copy_thread() implementation. - Removal of net and proc restrictions for io_uring, they are no longer needed or useful" * tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block: (30 commits) io-wq: remove now unused IO_WQ_BIT_ERROR io_uring: fix SQPOLL thread handling over exec io-wq: improve manager/worker handling over exec io_uring: ensure SQPOLL startup is triggered before error shutdown io-wq: make buffered file write hashed work map per-ctx io-wq: fix race around io_worker grabbing io-wq: fix races around manager/worker creation and task exit io_uring: ensure io-wq context is always destroyed for tasks arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread() io_uring: cleanup ->user usage io-wq: remove nr_process accounting io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS net: remove cmsg restriction from io_uring based send/recvmsg calls Revert "proc: don't allow async path resolution of /proc/self components" Revert "proc: don't allow async path resolution of /proc/thread-self components" io_uring: move SQPOLL thread io-wq forked worker io-wq: make io_wq_fork_thread() available to other users io-wq: only remove worker from free_list, if it was there io_uring: remove io_identity io_uring: remove any grabbing of context ...
2021-02-27Merge branch 'work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff pile - no common topic here" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: whack-a-mole: don't open-code iminor/imajor 9p: fix misuse of sscanf() in v9fs_stat2inode() audit_alloc_mark(): don't open-code ERR_CAST() fs/inode.c: make inode_init_always() initialize i_ino to 0 vfs: don't unnecessarily clone write access for writable fds
2021-02-26block: Add bio_max_segsMatthew Wilcox (Oracle)
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the sign to be the same. Introduce bio_max_segs() and change BIO_MAX_PAGES to be unsigned to make it easier for the users. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-26Merge tag '5.12-smb3-part1' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs updates from Steve French: - improvements to mode bit conversion, chmod and chown when using cifsacl mount option - two new mount options for controlling attribute caching - improvements to crediting and reconnect, improved debugging - reconnect fix - add SMB3.1.1 dialect to default dialects for vers=3 * tag '5.12-smb3-part1' of git://git.samba.org/sfrench/cifs-2.6: (27 commits) cifs: update internal version number cifs: use discard iterator to discard unneeded network data more efficiently cifs: introduce helper for finding referral server to improve DFS target resolution cifs: check all path components in resolved dfs target cifs: fix DFS failover cifs: fix nodfs mount option cifs: fix handling of escaped ',' in the password mount argument cifs: Add new parameter "acregmax" for distinct file and directory metadata timeout cifs: convert revalidate of directories to using directory metadata cache timeout cifs: Add new mount parameter "acdirmax" to allow caching directory metadata cifs: If a corrupted DACL is returned by the server, bail out. cifs: minor simplification to smb2_is_network_name_deleted TCON Reconnect during STATUS_NETWORK_NAME_DELETED cifs: cleanup a few le16 vs. le32 uses in cifsacl.c cifs: Change SIDs in ACEs while transferring file ownership. cifs: Retain old ACEs when converting between mode bits and ACL. cifs: Fix cifsacl ACE mask for group and others. cifs: clarify hostname vs ip address in /proc/fs/cifs/DebugData cifs: change confusing field serverName (to ip_addr) cifs: Fix inconsistent IS_ERR and PTR_ERR ...
2021-02-26Merge tag 'for-5.12/io_uring-2021-02-25' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more io_uring updates from Jens Axboe: "A collection of later fixes that we should get into this release: - Series of submission cleanups (Pavel) - A few fixes for issues from earlier this merge window (Pavel, me) - IOPOLL resubmission fix - task_work locking fix (Hao)" * tag 'for-5.12/io_uring-2021-02-25' of git://git.kernel.dk/linux-block: (25 commits) Revert "io_uring: wait potential ->release() on resurrect" io_uring: fix locked_free_list caches_free() io_uring: don't attempt IO reissue from the ring exit path io_uring: clear request count when freeing caches io_uring: run task_work on io_uring_register() io_uring: fix leaving invalid req->flags io_uring: wait potential ->release() on resurrect io_uring: keep generic rsrc infra generic io_uring: zero ref_node after killing it io_uring: make the !CONFIG_NET helpers a bit more robust io_uring: don't hold uring_lock when calling io_run_task_work* io_uring: fail io-wq submission from a task_work io_uring: don't take uring_lock during iowq cancel io_uring: fail links more in io_submit_sqe() io_uring: don't do async setup for links' heads io_uring: do io_*_prep() early in io_submit_sqe() io_uring: split sqe-prep and async setup io_uring: don't submit link on error io_uring: move req link into submit_state io_uring: move io_init_req() into io_submit_sqe() ...
2021-02-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "118 patches: - The rest of MM. Includes kfence - another runtime memory validator. Not as thorough as KASAN, but it has unmeasurable overhead and is intended to be usable in production builds. - Everything else Subsystems affected by this patch series: alpha, procfs, sysctl, misc, core-kernel, MAINTAINERS, lib, bitops, checkpatch, init, coredump, seq_file, gdb, ubsan, initramfs, and mm (thp, cma, vmstat, memory-hotplug, mlock, rmap, zswap, zsmalloc, cleanups, kfence, kasan2, and pagemap2)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits) MIPS: make userspace mapping young by default initramfs: panic with memory information ubsan: remove overflow checks kgdb: fix to kill breakpoints on initmem after boot scripts/gdb: fix list_for_each x86: fix seq_file iteration for pat/memtype.c seq_file: document how per-entry resources are managed. fs/coredump: use kmap_local_page() init/Kconfig: fix a typo in CC_VERSION_TEXT help text init: clean up early_param_on_off() macro init/version.c: remove Version_<LINUX_VERSION_CODE> symbol checkpatch: do not apply "initialise globals to 0" check to BPF progs checkpatch: don't warn about colon termination in linker scripts checkpatch: add kmalloc_array_node to unnecessary OOM message check checkpatch: add warning for avoiding .L prefix symbols in assembly files checkpatch: improve TYPECAST_INT_CONSTANT test message checkpatch: prefer ftrace over function entry/exit printks checkpatch: trivial style fixes checkpatch: ignore warning designated initializers using NR_CPUS checkpatch: improve blank line after declaration test ...
2021-02-26fs/coredump: use kmap_local_page()Ira Weiny
In dump_user_range() there is no reason for the mapping to be global. Use kmap_local_page() rather than kmap. Link: https://lkml.kernel.org/r/20210203223328.558945-1-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26proc: use kvzalloc for our kernel bufferJosef Bacik
Since sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blindly kzalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, so use kvzalloc for these buffers. [willy@infradead.org: changelog tweaks] Link: https://lkml.kernel.org/r/6345270a2c1160b89dd5e6715461f388176899d1.1612972413.git.josef@toxicpanda.com Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> CC: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26proc/wchan: use printk format instead of lookup_symbol_name()Helge Deller
To resolve the symbol fuction name for wchan, use the printk format specifier %ps instead of manually looking up the symbol function name via lookup_symbol_name(). Link: https://lkml.kernel.org/r/20201217165413.GA1959@ls3530.fritz.box Signed-off-by: Helge Deller <deller@gmx.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26iomap: use mapping_seek_hole_dataMatthew Wilcox (Oracle)
Enhance mapping_seek_hole_data() to handle partially uptodate pages and convert the iomap seek code to call it. Link: https://lkml.kernel.org/r/20201112212641.27837-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS Client Updates from Anna Schumaker: "New Features: - Support for eager writes, and the write=eager and write=wait mount options - Other Bugfixes and Cleanups: - Fix typos in some comments - Fix up fall-through warnings for Clang - Cleanups to the NFS readpage codepath - Remove FMR support in rpcrdma_convert_iovs() - Various other cleanups to xprtrdma - Fix xprtrdma pad optimization for servers that don't support RFC 8797 - Improvements to rpcrdma tracepoints - Fix up nfs4_bitmask_adjust() - Optimize sparse writes past the end of files" * tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits) NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo NFS: Set the stable writes flag when initialising the super block NFS: Add mount options supporting eager writes NFS: Add support for eager writes NFS: 'flags' field should be unsigned in struct nfs_server NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache NFS: Always clear an invalid mapping when attempting a buffered write NFS: Optimise sparse writes past the end of file NFS: Fix documenting comment for nfs_revalidate_file_size() NFSv4: Fixes for nfs4_bitmask_adjust() xprtrdma: Clean up rpcrdma_prepare_readch() rpcrdma: Capture bytes received in Receive completion tracepoints xprtrdma: Pad optimization, revisited rpcrdma: Fix comments about reverse-direction operation xprtrdma: Refactor invocations of offset_in_page() xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map() xprtrdma: Remove FMR support in rpcrdma_convert_iovs() NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async() NFS: Call readpage_async_filler() from nfs_readpage_async() NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc ...
2021-02-25cifs: update internal version numberSteve French
To 2.31 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: use discard iterator to discard unneeded network data more efficientlyDavid Howells
The iterator, ITER_DISCARD, that can only be used in READ mode and just discards any data copied to it, was added to allow a network filesystem to discard any unwanted data sent by a server. Convert cifs_discard_from_socket() to use this. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: introduce helper for finding referral server to improve DFS target ↵Paulo Alcantara
resolution Some servers seem to mistakenly report different values for capabilities and share flags, so we can't always rely on those values to decide whether the resolved target can handle any new DFS referrals. Add a new helper is_referral_server() to check if all resolved targets can handle new DFS referrals by directly looking at the GET_DFS_REFERRAL.ReferralHeaderFlags value as specified in MS-DFSC 2.2.4 RESP_GET_DFS_REFERRAL in addition to is_tcon_dfs(). Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.11 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: check all path components in resolved dfs targetPaulo Alcantara
Handle the case where a resolved target share is like //server/users/dir, and the user "foo" has no read permission to access the parent folder "users" but has access to the final path component "dir". is_path_remote() already implements that, so call it directly. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.11 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: fix DFS failoverPaulo Alcantara
In do_dfs_failover(), the mount_get_conns() function requires the full fs context in order to get new connection to server, so clone the original context and change it accordingly when retrying the DFS targets in the referral. If failover was successful, then update original context with the new UNC, prefix path and ip address. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.11 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: fix nodfs mount optionPaulo Alcantara
Skip DFS resolving when mounting with 'nodfs' even if CONFIG_CIFS_DFS_UPCALL is enabled. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.11 Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: fix handling of escaped ',' in the password mount argumentRonnie Sahlberg
Passwords can contain ',' which are also used as the separator between mount options. Mount.cifs will escape all ',' characters as the string ",,". Update parsing of the mount options to detect ",," and treat it as a single 'c' character. Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Cc: stable@vger.kernel.org # 5.11 Reported-by: Simon Taylor <simon@simon-taylor.me.uk> Tested-by: Simon Taylor <simon@simon-taylor.me.uk> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Miscellaneous ext4 cleanups and bug fixes. Pretty boring this cycle..." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: add .kunitconfig fragment to enable ext4-specific tests ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it ext4: reset retry counter when ext4_alloc_file_blocks() makes progress ext4: fix potential htree index checksum corruption ext4: factor out htree rep invariant check ext4: Change list_for_each* to list_for_each_entry* ext4: don't try to processed freed blocks until mballoc is initialized ext4: use DEFINE_MUTEX() for mutex lock
2021-02-25cifs: Add new parameter "acregmax" for distinct file and directory metadata ↵Steve French
timeout The new optional mount parameter "acregmax" allows a different timeout for file metadata ("acdirmax" now allows controlling timeout for directory metadata). Setting "actimeo" still works as before, and changes timeout for both files and directories, but specifying "acregmax" or "acdirmax" allows overriding the default more granularly which can be a big performance benefit on some workloads. "acregmax" is already used by NFS as a mount parameter (albeit with a larger default and thus looser caching). Suggested-by: Tom Talpey <tom@talpey.com> Reviewed-By: Tom Talpey <tom@talpey.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: convert revalidate of directories to using directory metadata cache ↵Steve French
timeout The new optional mount parm, "acdirmax" allows caching the metadata for a directory longer than file metadata, which can be very helpful for performance. Convert cifs_inode_needs_reval to check acdirmax for revalidating directory metadata. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-By: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: Add new mount parameter "acdirmax" to allow caching directory metadataSteve French
nfs and cifs on Linux currently have a mount parameter "actimeo" to control metadata (attribute) caching but cifs does not have additional mount parameters to allow distinguishing between caching directory metadata (e.g. needed to revalidate paths) and that for files. Add new mount parameter "acdirmax" to allow caching metadata for directories more loosely than file data. NFS adjusts metadata caching from acdirmin to acdirmax (and another two mount parms for files) but to reduce complexity, it is safer to just introduce the one mount parm to allow caching directories longer. The defaults for acdirmax and actimeo (for cifs.ko) are conservative, 1 second (NFS defaults acdirmax to 60 seconds). For many workloads, setting acdirmax to a higher value is safe and will improve performance. This patch leaves unchanged the default values for caching metadata for files and directories but gives the user more flexibility in adjusting them safely for their workload via the new mount parm. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-By: Tom Talpey <tom@talpey.com>
2021-02-25io-wq: remove now unused IO_WQ_BIT_ERRORJens Axboe
This flag is now dead, remove it. Fixes: 1cbd9c2bcf02 ("io-wq: don't create any IO workers upfront") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io_uring: fix SQPOLL thread handling over execJens Axboe
Just like the changes for io-wq, ensure that we re-fork the SQPOLL thread if the owner execs. Mark the ctx sq thread as sqo_exec if it dies, and the ring as needing a wakeup which will force the task to enter the kernel. When it does, setup the new thread and proceed as usual. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io-wq: improve manager/worker handling over execJens Axboe
exec will cancel any threads, including the ones that io-wq is using. This isn't a problem, in fact we'd prefer it to be that way since it means we know that any async work cancels naturally without having to handle it proactively. But it does mean that we need to setup a new manager, as the manager and workers are gone. Handle this at queue time, and cancel work if we fail. Since the manager can go away without us noticing, ensure that the manager itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to io_wq_put() to reflect that. In the future we can now simplify exec cancelation handling, for now just leave it the same. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io_uring: ensure SQPOLL startup is triggered before error shutdownJens Axboe
syzbot reports the following hang: INFO: task syz-executor.0:12538 can't die for more than 143 seconds. task:syz-executor.0 state:D stack:28352 pid:12538 ppid: 8423 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4324 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5075 schedule+0xcf/0x270 kernel/sched/core.c:5154 schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x168/0x270 kernel/sched/completion.c:138 io_sq_thread_finish+0x96/0x580 fs/io_uring.c:7152 io_sq_offload_create fs/io_uring.c:7929 [inline] io_uring_create fs/io_uring.c:9465 [inline] io_uring_setup+0x1fb2/0x2c20 fs/io_uring.c:9550 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae which is due to exiting after the SQPOLL thread has been created, but hasn't been started yet. Ensure that we always complete the startup side when waiting for it to exit. Reported-by: syzbot+c927c937cba8ef66dd4a@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io-wq: make buffered file write hashed work map per-ctxJens Axboe
Before the io-wq thread change, we maintained a hash work map and lock per-node per-ring. That wasn't ideal, as we really wanted it to be per ring. But now that we have per-task workers, the hash map ends up being just per-task. That'll work just fine for the normal case of having one task use a ring, but if you share the ring between tasks, then it's considerably worse than it was before. Make the hash map per ctx instead, which provides full per-ctx buffered write serialization on hashed writes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25xfs: use current->journal_info for detecting transaction recursionDave Chinner
Because the iomap code using PF_MEMALLOC_NOFS to detect transaction recursion in XFS is just wrong. Remove it from the iomap code and replace it with XFS specific internal checks using current->journal_info instead. [djwong: This change also realigns the lifetime of NOFS flag changes to match the incore transaction, instead of the inconsistent scheme we have now.] Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25xfs: don't nest transactions when scanning for eofblocksDarrick J. Wong
Brian Foster reported a lockdep warning on xfs/167: ============================================ WARNING: possible recursive locking detected 5.11.0-rc4 #35 Tainted: G W I -------------------------------------------- fsstress/17733 is trying to acquire lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs] but task is already holding lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs] stack backtrace: CPU: 38 PID: 17733 Comm: fsstress Tainted: G W I 5.11.0-rc4 #35 Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018 Call Trace: dump_stack+0x8b/0xb0 __lock_acquire.cold+0x159/0x2ab lock_acquire+0x116/0x370 xfs_trans_alloc+0x1ad/0x310 [xfs] xfs_free_eofblocks+0x104/0x1d0 [xfs] xfs_blockgc_scan_inode+0x24/0x60 [xfs] xfs_inode_walk_ag+0x202/0x4b0 [xfs] xfs_inode_walk+0x66/0xc0 [xfs] xfs_trans_alloc+0x160/0x310 [xfs] xfs_trans_alloc_inode+0x5f/0x160 [xfs] xfs_alloc_file_space+0x105/0x300 [xfs] xfs_file_fallocate+0x270/0x460 [xfs] vfs_fallocate+0x14d/0x3d0 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The cause of this is the new code that spurs a scan to garbage collect speculative preallocations if we fail to reserve enough blocks while allocating a transaction. While the warning itself is a fairly benign lockdep complaint, it does expose a potential livelock if the rwsem behavior ever changes with regards to nesting read locks when someone's waiting for a write lock. Fix this by freeing the transaction and jumping back to xfs_trans_alloc like this patch in the V4 submission[1]. [1] https://lore.kernel.org/linux-xfs/161142798066.2171939.9311024588681972086.stgit@magnolia/ Fixes: a1a7d05a0576 ("xfs: flush speculative space allocations when we run out of space") Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25xfs: don't reuse busy extents on extent trimBrian Foster
Freed extents are marked busy from the point the freeing transaction commits until the associated CIL context is checkpointed to the log. This prevents reuse and overwrite of recently freed blocks before the changes are committed to disk, which can lead to corruption after a crash. The exception to this rule is that metadata allocation is allowed to reuse busy extents because metadata changes are also logged. As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS has allowed modification or complete invalidation of outstanding busy extents for metadata allocations. This implementation assumes that use of the associated extent is imminent, which is not always the case. For example, the trimmed extent might not satisfy the minimum length of the allocation request, or the allocation algorithm might be involved in a search for the optimal result based on locality. generic/019 reproduces a corruption caused by this scenario. First, a metadata block (usually a bmbt or symlink block) is freed from an inode. A subsequent bmbt split on an unrelated inode attempts a near mode allocation request that invalidates the busy block during the search, but does not ultimately allocate it. Due to the busy state invalidation, the block is no longer considered busy to subsequent allocation. A direct I/O write request immediately allocates the block and writes to it. Finally, the filesystem crashes while in a state where the initial metadata block free had not committed to the on-disk log. After recovery, the original metadata block is in its original location as expected, but has been corrupted by the aforementioned dio. This demonstrates that it is fundamentally unsafe to modify busy extent state for extents that are not guaranteed to be allocated. This applies to pretty much all of the code paths that currently trim busy extents for one reason or another. Therefore to address this problem, drop the reuse mechanism from the busy extent trim path. This code already knows how to return partial non-busy ranges of the targeted free extent and higher level code tracks the busy state of the allocation attempt. If a block allocation fails where one or more candidate extents is busy, we force the log and retry the allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25Revert "io_uring: wait potential ->release() on resurrect"Jens Axboe
This reverts commit 88f171ab7798a1ed0b9e39867ee16f307466e870. I ran into a case where the ref resurrect now spins, so revert this change for now until we can further investigate why it's broken. The bug seems to indicate spinning on the lock itself, likely there's some ABBA deadlock involved: [<0>] __percpu_ref_switch_mode+0x45/0x180 [<0>] percpu_ref_resurrect+0x46/0x70 [<0>] io_refs_resurrect+0x25/0xa0 [<0>] __io_uring_register+0x135/0x10c0 [<0>] __x64_sys_io_uring_register+0xc2/0x1a0 [<0>] do_syscall_64+0x42/0x110 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "A few small subsystems and some of MM. 172 patches. Subsystems affected by this patch series: hexagon, scripts, ntfs, ocfs2, vfs, and mm (slab-generic, slab, slub, debug, pagecache, swap, memcg, pagemap, mprotect, mremap, page-reporting, vmalloc, kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction, mempolicy, oom-kill, hugetlbfs, and migration)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (172 commits) mm/migrate: remove unneeded semicolons hugetlbfs: remove unneeded return value of hugetlb_vmtruncate() hugetlbfs: fix some comment typos hugetlbfs: correct some obsolete comments about inode i_mutex hugetlbfs: make hugepage size conversion more readable hugetlbfs: remove meaningless variable avoid_reserve hugetlbfs: correct obsolete function name in hugetlbfs_read_iter() hugetlbfs: use helper macro default_hstate in init_hugetlbfs_fs hugetlbfs: remove useless BUG_ON(!inode) in hugetlbfs_setattr() hugetlbfs: remove special hugetlbfs_set_page_dirty() mm/hugetlb: change hugetlb_reserve_pages() to type bool mm, oom: fix a comment in dump_task() mm/mempolicy: use helper range_in_vma() in queue_pages_test_walk() numa balancing: migrate on fault among multiple bound nodes mm, compaction: make fast_isolate_freepages() stay within zone mm/compaction: fix misbehaviors of fast_find_migrateblock() mm/compaction: correct deferral logic for proactive compaction mm/compaction: remove duplicated VM_BUG_ON_PAGE !PageLocked mm/compaction: remove rcu_read_lock during page compaction z3fold: simplify the zhdr initialization code in init_z3fold_page() ...
2021-02-24hugetlbfs: remove unneeded return value of hugetlb_vmtruncate()Miaohe Lin
The function hugetlb_vmtruncate() is guaranteed to always success since commit 7aa91e104028 ("hugetlb: allow extending ftruncate on hugetlbfs"). So we should remove the unneeded return value which is always 0. Link: https://lkml.kernel.org/r/20210208084637.47789-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24hugetlbfs: fix some comment typosMiaohe Lin
Fix typos reserv to reserve, minimim to minimum. No functional change intended. Link: https://lkml.kernel.org/r/20210130092351.28072-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24hugetlbfs: correct some obsolete comments about inode i_mutexMiaohe Lin
Since commit 9902af79c01a ("parallel lookups: actual switch to rwsem"), i_mutex of inode is converted to i_rwsem. So replace i_mutex with i_rwsem to make comments up to date. Link: https://lkml.kernel.org/r/20210127093111.36672-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24hugetlbfs: make hugepage size conversion more readableMiaohe Lin
The calculation 1U << (h->order + PAGE_SHIFT - 10) is actually equal to (PAGE_SHIFT << (h->order)) >> 10. So we can make it more readable by replace it with huge_page_size(h) >> 10. Link: https://lkml.kernel.org/r/20210122083141.24548-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24hugetlbfs: remove meaningless variable avoid_reserveMiaohe Lin
The variable avoid_reserve is meaningless because we never changed its value and just passed it to alloc_huge_page(). So remove it to make code more clear that in hugetlbfs_fallocate, we never avoid reserve when alloc hugepage yet. Also add a comment offered by Mike Kravetz to explain this. Link: https://lkml.kernel.org/r/20210120071508.9078-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24hugetlbfs: correct obsolete function name in hugetlbfs_read_iter()Miaohe Lin
Since commit 36e789144267 ("kill do_generic_mapping_read"), the function do_generic_mapping_read() is renamed to do_generic_file_read(). And then commit 47c27bc46946 ("fs: pass iocb to do_generic_file_read") renamed it to generic_file_buffered_read(). So replace do_generic_mapping_read() with generic_file_buffered_read() to keep comment uptodate. Link: https://lkml.kernel.org/r/20210118063210.47118-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>