aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2022-12-14Merge tag 'hardening-v6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: - Convert flexible array members, fix -Wstringop-overflow warnings, and fix KCFI function type mismatches that went ignored by maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook) - Remove the remaining side-effect users of ksize() by converting dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add more __alloc_size attributes, and introduce full testing of all allocator functions. Finally remove the ksize() side-effect so that each allocation-aware checker can finally behave without exceptions - Introduce oops_limit (default 10,000) and warn_limit (default off) to provide greater granularity of control for panic_on_oops and panic_on_warn (Jann Horn, Kees Cook) - Introduce overflows_type() and castable_to_type() helpers for cleaner overflow checking - Improve code generation for strscpy() and update str*() kern-doc - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests - Always use a non-NULL argument for prepare_kernel_cred() - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell) - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin Li) - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu) - Fix um vs FORTIFY warnings for always-NULL arguments * tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits) ksmbd: replace one-element arrays with flexible-array members hpet: Replace one-element array with flexible-array member um: virt-pci: Avoid GCC non-NULL warning signal: Initialize the info in ksignal lib: fortify_kunit: build without structleak plugin panic: Expose "warn_count" to sysfs panic: Introduce warn_limit panic: Consolidate open-coded panic_on_warn checks exit: Allow oops_limit to be disabled exit: Expose "oops_count" to sysfs exit: Put an upper limit on how often we can oops panic: Separate sysctl logic from CONFIG_SMP mm/pgtable: Fix multiple -Wstringop-overflow warnings mm: Make ksize() a reporting-only function kunit/fortify: Validate __alloc_size attribute results drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid() drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid() driver core: Add __alloc_size hint to devm allocators overflow: Introduce overflows_type() and castable_to_type() coredump: Proactively round up to kmalloc bucket size ...
2022-12-12Merge tag 'for-6.2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This round there are a lot of cleanups and moved code so the diffstat looks huge, otherwise there are some nice performance improvements and an update to raid56 reliability. User visible features: - raid56 reliability vs performance trade off: - fix destructive RMW for raid5 data (raid6 still needs work): do full checksum verification for all data during RMW cycle, this should prevent rewriting potentially corrupted data without notice - stripes are cached in memory which should reduce the performance impact but still can hurt some workloads - checksums are verified after repair again - this is the last option without introducing additional features (write intent bitmap, journal, another tree), the extra checksum read/verification was supposed to be avoided by the original implementation exactly for performance reasons but that caused all the reliability problems - discard=async by default for devices that support it - implement emergency flush reserve to avoid almost all unnecessary transaction aborts due to ENOSPC in cases where there are too many delayed refs or delayed allocation - skip block group synchronization if there's no change in used bytes, can reduce transaction commit count for some workloads Performance improvements: - fiemap and lseek: - overall speedup due to skipping unnecessary or duplicate searches (-40% run time) - cache some data structures and sharedness of extents (-30% run time) - send: - faster backref resolution when finding clones - cached leaf to root mapping for faster backref walking - improved clone/sharing detection - overall run time improvements (-70%) Core: - module initialization converted to a table of function pointers run in a sequence - preparation for fscrypt, extend passing file names across calls, dir item can store encryption status - raid56 updates: - more accurate error tracking of sectors within stripe - simplify recovery path and remove dedicated endio worker kthread - simplify scrub call paths - refactoring to support the extra data checksum verification during RMW cycle - tree block parentness checks consolidated and done at metadata read time - improved error handling - cleanups: - move a lot of code for better synchronization between kernel and user space sources, split big files - enum cleanups - GFP flag cleanups - header file cleanups, prototypes, dependencies - redundant parameter cleanups - inline extent handling simplifications - inode parameter conversion - data structure cleanups, reductions, renames, merges" * tag 'for-6.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (249 commits) btrfs: print transaction aborted messages with an error level btrfs: sync some cleanups from progs into uapi/btrfs.h btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a range btrfs: fix extent map use-after-free when handling missing device in read_one_chunk btrfs: remove outdated logic from overwrite_item() and add assertion btrfs: unify overwrite_item() and do_overwrite_item() btrfs: replace strncpy() with strscpy() btrfs: fix uninitialized variable in find_first_clear_extent_bit btrfs: fix uninitialized parent in insert_state btrfs: add might_sleep() annotations btrfs: add stack helpers for a few btrfs items btrfs: add nr_global_roots to the super block definition btrfs: remove BTRFS_LEAF_DATA_OFFSET btrfs: add helpers for manipulating leaf items and data btrfs: add eb to btrfs_node_key_ptr_offset btrfs: pass the extent buffer for the btrfs_item_nr helpers btrfs: move the csum helpers into ctree.h btrfs: move eb offset helpers into extent_io.h btrfs: move file_extent_item helpers into file-item.h btrfs: move leaf_data_end into ctree.c ...
2022-12-12Merge tag 'fs.acl.rework.v6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull VFS acl updates from Christian Brauner: "This contains the work that builds a dedicated vfs posix acl api. The origins of this work trace back to v5.19 but it took quite a while to understand the various filesystem specific implementations in sufficient detail and also come up with an acceptable solution. As we discussed and seen multiple times the current state of how posix acls are handled isn't nice and comes with a lot of problems: The current way of handling posix acls via the generic xattr api is error prone, hard to maintain, and type unsafe for the vfs until we call into the filesystem's dedicated get and set inode operations. It is already the case that posix acls are special-cased to death all the way through the vfs. There are an uncounted number of hacks that operate on the uapi posix acl struct instead of the dedicated vfs struct posix_acl. And the vfs must be involved in order to interpret and fixup posix acls before storing them to the backing store, caching them, reporting them to userspace, or for permission checking. Currently a range of hacks and duct tape exist to make this work. As with most things this is really no ones fault it's just something that happened over time. But the code is hard to understand and difficult to maintain and one is constantly at risk of introducing bugs and regressions when having to touch it. Instead of continuing to hack posix acls through the xattr handlers this series builds a dedicated posix acl api solely around the get and set inode operations. Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl() helpers must be used in order to interact with posix acls. They operate directly on the vfs internal struct posix_acl instead of abusing the uapi posix acl struct as we currently do. In the end this removes all of the hackiness, makes the codepaths easier to maintain, and gets us type safety. This series passes the LTP and xfstests suites without any regressions. For xfstests the following combinations were tested: - xfs - ext4 - btrfs - overlayfs - overlayfs on top of idmapped mounts - orangefs - (limited) cifs There's more simplifications for posix acls that we can make in the future if the basic api has made it. A few implementation details: - The series makes sure to retain exactly the same security and integrity module permission checks. Especially for the integrity modules this api is a win because right now they convert the uapi posix acl struct passed to them via a void pointer into the vfs struct posix_acl format to perform permission checking on the mode. There's a new dedicated security hook for setting posix acls which passes the vfs struct posix_acl not a void pointer. Basing checking on the posix acl stored in the uapi format is really unreliable. The vfs currently hacks around directly in the uapi struct storing values that frankly the security and integrity modules can't correctly interpret as evidenced by bugs we reported and fixed in this area. It's not necessarily even their fault it's just that the format we provide to them is sub optimal. - Some filesystems like 9p and cifs need access to the dentry in order to get and set posix acls which is why they either only partially or not even at all implement get and set inode operations. For example, cifs allows setxattr() and getxattr() operations but doesn't allow permission checking based on posix acls because it can't implement a get acl inode operation. Thus, this patch series updates the set acl inode operation to take a dentry instead of an inode argument. However, for the get acl inode operation we can't do this as the old get acl method is called in e.g., generic_permission() and inode_permission(). These helpers in turn are called in various filesystem's permission inode operation. So passing a dentry argument to the old get acl inode operation would amount to passing a dentry to the permission inode operation which we shouldn't and probably can't do. So instead of extending the existing inode operation Christoph suggested to add a new one. He also requested to ensure that the get and set acl inode operation taking a dentry are consistently named. So for this version the old get acl operation is renamed to ->get_inode_acl() and a new ->get_acl() inode operation taking a dentry is added. With this we can give both 9p and cifs get and set acl inode operations and in turn remove their complex custom posix xattr handlers. In the future I hope to get rid of the inode method duplication but it isn't like we have never had this situation. Readdir is just one example. And frankly, the overall gain in type safety and the more pleasant api wise are simply too big of a benefit to not accept this duplication for a while. - We've done a full audit of every codepaths using variant of the current generic xattr api to get and set posix acls and surprisingly it isn't that many places. There's of course always a chance that we might have missed some and if so I'm sure we'll find them soon enough. The crucial codepaths to be converted are obviously stacking filesystems such as ecryptfs and overlayfs. For a list of all callers currently using generic xattr api helpers see [2] including comments whether they support posix acls or not. - The old vfs generic posix acl infrastructure doesn't obey the create and replace semantics promised on the setxattr(2) manpage. This patch series doesn't address this. It really is something we should revisit later though. The patches are roughly organized as follows: (1) Change existing set acl inode operation to take a dentry argument (Intended to be a non-functional change) (2) Rename existing get acl method (Intended to be a non-functional change) (3) Implement get and set acl inode operations for filesystems that couldn't implement one before because of the missing dentry. That's mostly 9p and cifs (Intended to be a non-functional change) (4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl() including security and integrity hooks (Intended to be a non-functional change) (5) Implement get and set acl inode operations for stacking filesystems (Intended to be a non-functional change) (6) Switch posix acl handling in stacking filesystems to new posix acl api now that all filesystems it can stack upon support it. (7) Switch vfs to new posix acl api (semantical change) (8) Remove all now unused helpers (9) Additional regression fixes reported after we merged this into linux-next Thanks to Seth for a lot of good discussion around this and encouragement and input from Christoph" * tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits) posix_acl: Fix the type of sentinel in get_acl orangefs: fix mode handling ovl: call posix_acl_release() after error checking evm: remove dead code in evm_inode_set_acl() cifs: check whether acl is valid early acl: make vfs_posix_acl_to_xattr() static acl: remove a slew of now unused helpers 9p: use stub posix acl handlers cifs: use stub posix acl handlers ovl: use stub posix acl handlers ecryptfs: use stub posix acl handlers evm: remove evm_xattr_acl_change() xattr: use posix acl api ovl: use posix acl api ovl: implement set acl method ovl: implement get acl method ecryptfs: implement set acl method ecryptfs: implement get acl method ksmbd: use vfs_remove_acl() acl: add vfs_remove_acl() ...
2022-12-12Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs updates from Al Viro: "misc pile" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: sysv: Fix sysv_nblocks() returns wrong value get rid of INT_LIMIT, use type_max() instead btrfs: replace INT_LIMIT(loff_t) with OFFSET_MAX fs: simplify vfs_get_super fs: drop useless condition from inode_needs_update_time
2022-12-12Merge tag 'pull-iov_iter' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter work; most of that is about getting rid of direction misannotations and (hopefully) preventing more of the same for the future" * tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: use less confusing names for iov_iter direction initializers iov_iter: saner checks for attempt to copy to/from iterator [xen] fix "direction" argument of iov_iter_kvec() [vhost] fix 'direction' argument of iov_iter_{init,bvec}() [target] fix iov_iter_bvec() "direction" argument [s390] memcpy_real(): WRITE is "data source", not destination... [s390] zcore: WRITE is "data source", not destination... [infiniband] READ is "data destination", not source... [fsi] WRITE is "data source", not destination... [s390] copy_oldmem_kernel() - WRITE is "data source", not destination csum_and_copy_to_iter(): handle ITER_DISCARD get rid of unlikely() on page_copy_sane() calls
2022-12-05btrfs: print transaction aborted messages with an error levelFilipe Manana
Currently we print the transaction aborted message with a debug level, but a transaction abort is an exceptional event that indicates something went wrong and it's useful to have it printed with an error level as it helps analysing problems in a production environment, where debug level messages are typically not logged. For example reports from syzbot never include the transaction aborted message, since the log level on the test machines is above the debug level. So change the log level from debug to error. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a rangeFilipe Manana
If we get -ENOMEM while dropping file extent items in a given range, at btrfs_drop_extents(), due to failure to allocate memory when attempting to increment the reference count for an extent or drop the reference count, we handle it with a BUG_ON(). This is excessive, instead we can simply abort the transaction and return the error to the caller. In fact most callers of btrfs_drop_extents(), directly or indirectly, already abort the transaction if btrfs_drop_extents() returns any error. Also, we already have error paths at btrfs_drop_extents() that may return -ENOMEM and in those cases we abort the transaction, like for example anything that changes the b+tree may return -ENOMEM due to a failure to allocate a new extent buffer when COWing an existing extent buffer, such as a call to btrfs_duplicate_item() for example. So replace the BUG_ON() calls with proper logic to abort the transaction and return the error. Reported-by: syzbot+0b1fb6b0108c27419f9f@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000089773e05ee4b9cb4@google.com/ CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: fix extent map use-after-free when handling missing device in ↵void0red
read_one_chunk Store the error code before freeing the extent_map. Though it's reference counted structure, in that function it's the first and last allocation so this would lead to a potential use-after-free. The error can happen eg. when chunk is stored on a missing device and the degraded mount option is missing. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216721 Reported-by: eriri <1527030098@qq.com> Fixes: adfb69af7d8c ("btrfs: add_missing_dev() should return the actual error") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: void0red <void0red@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove outdated logic from overwrite_item() and add assertionFilipe Manana
As of commit 193df6245704 ("btrfs: search for last logged dir index if it's not cached in the inode"), the overwrite_item() function is always called for a root that is from a fs/subvolume tree. In other words, now it's only used during log replay to modify a fs/subvolume tree. Therefore we can remove the logic that checks if we are dealing with a log tree at overwrite_item(). So remove that logic, replacing it with an assertion and document that if we ever need to support a log root there, we will need to clone the leaf from the fs/subvolume tree and then release it before modifying the log tree, which is needed to avoid a potential deadlock, similar to the one recently fixed by a patch with the subject: "btrfs: do not modify log tree while holding a leaf from fs tree locked" Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: unify overwrite_item() and do_overwrite_item()Filipe Manana
After commit 193df6245704 ("btrfs: search for last logged dir index if it's not cached in the inode"), there are no more callers of do_overwrite_item(), except overwrite_item(). Originally both used to be the same function, but were split in commit 086dcbfa50d3 ("btrfs: insert items in batches when logging a directory when possible"), as there was the need to execute all logic of overwrite_item() but skip the tree search, since in the context of directory logging we already had a path with a leaf to copy data from. So unify them again as there is no more need to have them split. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: replace strncpy() with strscpy()Artem Chernyshev
Using strncpy() on NUL-terminated strings are deprecated. To avoid possible forming of non-terminated string strscpy() should be used. Found by Linux Verification Center (linuxtesting.org) with SVACE. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: fix uninitialized variable in find_first_clear_extent_bitJosef Bacik
This was caught when syncing extent-io-tree.c into btrfs-progs. This however isn't really a problem, the only way next would be uninitialized is if we found the range we were looking for, and in this case we don't care about next. However it's a compile error, so fix it up. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: fix uninitialized parent in insert_stateJosef Bacik
I don't know how this isn't caught when we build this in the kernel, but while syncing extent-io-tree.c into btrfs-progs I got an error because parent could potentially be uninitialized when we link in a new node, specifically when the extent_io_tree is empty. This means we could have garbage in the parent color. I don't know what the ramifications are of that, but it's probably not great, so fix this by initializing parent to NULL. I spot checked all of our other usages in btrfs and we appear to be doing the correct thing everywhere else. Fixes: c7e118cf98c7 ("btrfs: open code rbtree search in insert_state") CC: stable@vger.kernel.org # 6.0+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add might_sleep() annotationsChenXiaoSong
Add annotations to functions that might sleep due to allocations or IO and could be called from various contexts. In case of btrfs_search_slot it's not obvious why it would sleep: btrfs_search_slot setup_nodes_for_search reada_for_balance btrfs_readahead_node_child btrfs_readahead_tree_block btrfs_find_create_tree_block alloc_extent_buffer kmem_cache_zalloc /* allocate memory non-atomically, might sleep */ kmem_cache_alloc(GFP_NOFS|__GFP_NOFAIL|__GFP_ZERO) read_extent_buffer_pages submit_extent_page /* disk IO, might sleep */ submit_one_bio Other examples where the sleeping could happen is in 3 places might sleep in update_qgroup_limit_item(), as shown below: update_qgroup_limit_item btrfs_alloc_path /* allocate memory non-atomically, might sleep */ kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS) Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add stack helpers for a few btrfs itemsJosef Bacik
We don't have these defined in the kernel because we don't have any users of these helpers. However we do use them in btrfs-progs, so define them to make keeping accessors.h in sync between progs and the kernel easier. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add nr_global_roots to the super block definitionJosef Bacik
We already have this defined in btrfs-progs, add it to the kernel to make it easier to sync these files into btrfs-progs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove BTRFS_LEAF_DATA_OFFSETJosef Bacik
This is simply the same thing as btrfs_item_nr_offset(leaf, 0), so remove this helper and replace it's usage with the above statement. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add helpers for manipulating leaf items and dataJosef Bacik
We have some gnarly memmove and copy_extent_buffer calls for leaf manipulation. This is because our item offsets aren't absolute, they're based on 0 being where the items start in the leaf, which is after the btrfs_header. This means any manipulation of the data requires adding sizeof(struct btrfs_header) to the offsets we pull from the items. Moving the items themselves is easier as the helpers are absolute offsets, however we of course have to call the helpers to get the offsets for the item numbers. This makes for copy_extent_buffer/memmove_extent_buffer calls that are kind of hard to reason about what's happening. Fix this by pushing this logic into helpers. For data we'll only use the item provided offsets, and the helpers will use the BTRFS_LEAF_DATA_OFFSET addition for the offsets. Additionally for the item manipulation simply pass in the item numbers, and then the helpers will call the offset helper to get the actual offset into the leaf. The diffstat makes this look like more code, but that's simply because I added comments for the helpers, it's net negative for the amount of code, and is easier to reason. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add eb to btrfs_node_key_ptr_offsetJosef Bacik
This is a change needed for extent tree v2, as we will be growing the header size. This exists in btrfs-progs currently, and not having it makes syncing accessors.[ch] more problematic. So make this change to set us up for extent tree v2 and match what btrfs-progs does to make syncing easier. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: pass the extent buffer for the btrfs_item_nr helpersJosef Bacik
This is actually a change for extent tree v2, but it exists in btrfs-progs but not in the kernel. This makes it annoying to sync accessors.h with btrfs-progs, and since this is the way I need it for extent-tree v2 simply update these helpers to take the extent buffer in order to make syncing possible now, and make the extent tree v2 stuff easier moving forward. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move the csum helpers into ctree.hJosef Bacik
These got moved because of copy+paste, but this code exists in ctree.c, so move the declarations back into ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move eb offset helpers into extent_io.hJosef Bacik
These are very specific to how the extent buffer is defined, so this differs between btrfs-progs and the kernel. Make things easier by moving these helpers into extent_io.h so we don't have to worry about this when syncing ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file_extent_item helpers into file-item.hJosef Bacik
These helpers use functions that are in multiple places, which makes it tricky to sync them into btrfs-progs. Move them to file-item.h and then include file-item.h in places that use these helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move leaf_data_end into ctree.cJosef Bacik
This is only used in ctree.c, with the exception of zero'ing out extent buffers we're getting ready to write out. In theory we shouldn't have an extent buffer with 0 items that we're writing out, however I'd rather be safe than sorry so open code it in extent_io.c, and then copy the helper into ctree.c. This will make it easier to sync accessors.[ch] into btrfs-progs, as this requires a helper that isn't defined in accessors.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move root helpers back into ctree.hJosef Bacik
These accidentally got brought into accessors.h, but belong with the btrfs_root definitions which are currently in ctree.h. Move these to make it easier to sync accessors.[ch] into btrfs-progs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move repair_io_failure to bio.cChristoph Hellwig
repair_io_failure ties directly into all the glory low-level details of mapping a bio with a logic address to the actual physical location. Move it right below btrfs_submit_bio to keep all the related logic together. Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that repair_io_failure is available in a header. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: split the bio submission path into a separate fileChristoph Hellwig
The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move struct btrfs_tree_parent_check out of disk-io.hChristoph Hellwig
Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h an various .c files don't have to include disk-io.h just for it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ use tree-checker.h for the structure ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: raid56: do data csum verification during RMW cycleQu Wenruo
[BUG] For the following small script, btrfs will be unable to recover the content of file1: mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1 md5sum $mnt/file1 umount $mnt # Corrupt the above 64K data stripe. xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3 mount $dev1 $mnt # Write a new 64K, which should be in the other data stripe # And this is a sub-stripe write, which will cause RMW xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2 md5sum $mnt/file1 umount $mnt Above md5sum would fail. [CAUSE] There is a long existing problem for raid56 (not limited to btrfs raid56) that, if we already have some corrupted on-disk data, and then trigger a sub-stripe write (which needs RMW cycle), it can cause further damage into P/Q stripe. Disk 1: data 1 |0x000000000000| <- Corrupted Disk 2: data 2 |0x000000000000| Disk 2: parity |0xffffffffffff| In above case, data 1 is already corrupted, the original data should be 64KiB of 0xff. At this stage, if we read data 1, and it has data checksum, we can still recovery going via the regular RAID56 recovery path. But if now we decide to write some data into data 2, then we need to go RMW. Let's say we want to write 64KiB of '0x00' into data 2, then we read the on-disk data of data 1, calculate the new parity, resulting the following layout: Disk 1: data 1 |0x000000000000| <- Corrupted Disk 2: data 2 |0x000000000000| <- New '0x00' writes Disk 2: parity |0x000000000000| <- New Parity. But the new parity is calculated using the *corrupted* data 1, we can no longer recover the correct data of data1. Thus the corruption is forever there. [FIX] To solve above problem, this patch will do a full stripe data checksum verification at RMW time. This involves the following changes: - Always read the full stripe (including data/P/Q) when doing RMW Before we only read the missing data sectors, but since we may do a data csum verification and recovery, we need to read everything out. Please note that, if we have a cached rbio, we don't need to read anything, and can treat it the same as full stripe write. As only stripe with all its csum matches can be cached. - Verify the data csum during read. The goal is only the rbio stripe sectors, and only if the rbio already has csum_buf/csum_bitmap filled. And sectors which cannot pass csum verification will have their bit set in error_bitmap. - Always call recovery_sectors() after we read out all the sectors Since error_bitmap will be updated during read, recover_sectors() can easily find out all the bad sectors and try to recover (if still under tolerance). And since recovery_sectors() is already migrated to use error_bitmap, it can skip vertical stripes which don't have any error. - Verify the repaired sectors against its csum in recover_vertical() - Rename rmw_read_and_wait() to rmw_read_wait_recover() Since we will always recover the sectors, the old name is no longer accurate. Furthermore since recovery is already done in rmw_read_wait_recover(), we no longer need to call recovery_sectors() inside rmw_rbio(). Obviously this will have a performance impact, as we are doing more work during RMW cycle: - Fetch the data checksums - Do checksum verification for all data stripes - Do checksum verification again after repair But for full stripe write or cached rbio we won't have the overhead all, thus for fully optimized RAID56 workload (always full stripe write), there should be no extra overhead. To me, the extra overhead looks reasonable, as data consistency is way more important than performance. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: raid56: prepare data checksums for later RMW verificationQu Wenruo
This is for later data checksum verification at RMW time. This patch will try to allocate the needed memory for a locked rbio if the rbio is for data exclusively (we don't want to handle mixed bg yet). The memory will be released when the rbio is finished. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: introduce a bitmap based csum range search functionQu Wenruo
Although we have an existing function, btrfs_lookup_csums_range(), to find all data checksums for a range, it's based on a btrfs_ordered_sum list. For the incoming RAID56 data checksum verification at RMW time, we don't want to waste time by allocating temporary memory. So this patch will introduce a new helper, btrfs_lookup_csums_bitmap(). It will use bitmap based result, which will be a perfect fit for later RAID56 usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: refactor checksum calculations in btrfs_lookup_csums_range()Qu Wenruo
The refactoring involves the following parts: - Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers As we have quite some open-coded calculations, some of them are even split into two assignments just to fit 80 chars limit. - Remove the @csum_size parameter from max_ordered_sum_bytes() Csum size can be fetched from @fs_info. And we will use the csum_size_to_bytes() helper anyway. - Add a comment explaining how we handle the first search result - Use newly introduced helpers to cleanup btrfs_lookup_csums_range() - Move variables declaration to the minimal scope - Never mix number of sectors with bytes There are several locations doing things like: size = min_t(size_t, csum_end - start, max_ordered_sum_bytes(fs_info)); ... size >>= fs_info->sectorsize_bits Or offset = (start - key.offset) >> fs_info->sectorsize_bits; offset *= csum_size; Make sure these variables can only represent BYTES inside the function, by using the above bytes_to_csum_size() helpers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: allocate btrfs_io_context without GFP_NOFAILLi zeming
The __GFP_NOFAIL flag could loop indefinitely when allocation memory in alloc_btrfs_io_context. The callers starting from __btrfs_map_block already handle errors so it's safe to drop the flag. Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use btrfs_dev_name() helper to handle missing devices betterQu Wenruo
[BUG] If dev-replace failed to re-construct its data/metadata, the kernel message would be incorrect for the missing device: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault) Note the above "dev (efault)" of the second line. While the first line is properly reporting "<missing disk>". [CAUSE] Although dev-replace is using btrfs_dev_name(), the heavy lifting work is still done by scrub (scrub is reused by both dev-replace and regular scrub). Unfortunately scrub code never uses btrfs_dev_name() helper, as it's only declared locally inside dev-replace.c. [FIX] Fix the output by: - Move the btrfs_dev_name() helper to volumes.h - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls Only zoned code is not touched, as I'm not familiar with degraded zoned code. - Constify return value and parameter Now the output looks pretty sane: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use cached state when looking for delalloc ranges with lseekFilipe Manana
During lseek (SEEK_HOLE/DATA), whenever we find a hole or prealloc extent, we will look for delalloc in that range, and one of the things we do for that is to find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using calls to count_range_bits(). Typically there's a single, or few, searches in the io_tree for delalloc per lseek call. However it's common for applications to keep calling lseek with SEEK_HOLE and SEEK_DATA to find where extents and holes are in a file, read the extents and skip holes in order to avoid unnecessary IO and save disk space by preserving holes. One popular user is the cp utility from coreutils. Starting with coreutils 9.0, cp uses SEEK_HOLE and SEEK_DATA to iterate over the extents of a file. Before 9.0, it used fiemap to figure out where holes and extents are in the source file. Another popular user is the tar utility when used with the --sparse / -S option to detect and preserve holes. Given that the pattern is to keep calling lseek with a start offset that matches the returned offset from the previous lseek call, we can benefit from caching the last extent state visited in count_range_bits() and use it for the next count_range_bits() from the next lseek call. Example, the following strace excerpt from running tar: $ strace tar cJSvf foo.tar.xz qemu_disk_file.raw (...) lseek(5, 125019574272, SEEK_HOLE) = 125024989184 lseek(5, 125024989184, SEEK_DATA) = 125024993280 lseek(5, 125024993280, SEEK_HOLE) = 125025239040 lseek(5, 125025239040, SEEK_DATA) = 125025255424 lseek(5, 125025255424, SEEK_HOLE) = 125025353728 lseek(5, 125025353728, SEEK_DATA) = 125025357824 lseek(5, 125025357824, SEEK_HOLE) = 125026766848 lseek(5, 125026766848, SEEK_DATA) = 125026770944 lseek(5, 125026770944, SEEK_HOLE) = 125027053568 (...) Shows that pattern, which is the same as with cp from coreutils 9.0+. So start using a cached state for the delalloc searches in lseek, and store it in struct file's private data so that it can be reused across lseek calls. This change is part of a patchset that is comprised of the following patches: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek The following test was run before and after applying the whole patchset: $ cat test-cp.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh # coreutils 8.32, cp uses fiemap to detect holes and extents #CP_PROG=/usr/bin/cp # coreutils 9.1, cp uses SEEK_HOLE/DATA to detect holes and extents CP_PROG=/home/fdmanana/git/hub/coreutils/src/cp umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT FILE_SIZE=$((1024 * 1024 * 1024)) echo "Creating file with a size of $((FILE_SIZE / 1024 / 1024))M" # Create a very sparse file, where each extent has a length of 4K and # is preceded by a 4K hole and followed by another 4K hole. start=$(date +%s%N) echo -n > $MNT/foobar for ((off = 0; off < $FILE_SIZE; off += 8192)); do xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null echo -ne "\r$off / $FILE_SIZE ..." done end=$(date +%s%N) echo -e "\nFile created ($(( (end - start) / 1000000 )) milliseconds)" start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds with data/metadata cached and delalloc" # Flush all delalloc. sync start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds with data/metadata cached and no delalloc" # Unmount and mount again to test the case without any metadata # loaded in memory. umount $MNT mount $DEV $MNT start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds without data/metadata cached and no delalloc" umount $MNT The results, running on a box with a non-debug kernel (Debian's default kernel config), were the following: 128M file, before patchset: cp took 16574 milliseconds with data/metadata cached and delalloc cp took 122 milliseconds with data/metadata cached and no delalloc cp took 20144 milliseconds without data/metadata cached and no delalloc 128M file, after patchset: cp took 6277 milliseconds with data/metadata cached and delalloc cp took 109 milliseconds with data/metadata cached and no delalloc cp took 210 milliseconds without data/metadata cached and no delalloc 512M file, before patchset: cp took 14369 milliseconds with data/metadata cached and delalloc cp took 429 milliseconds with data/metadata cached and no delalloc cp took 88034 milliseconds without data/metadata cached and no delalloc 512M file, after patchset: cp took 12106 milliseconds with data/metadata cached and delalloc cp took 427 milliseconds with data/metadata cached and no delalloc cp took 824 milliseconds without data/metadata cached and no delalloc 1G file, before patchset: cp took 10074 milliseconds with data/metadata cached and delalloc cp took 886 milliseconds with data/metadata cached and no delalloc cp took 181261 milliseconds without data/metadata cached and no delalloc 1G file, after patchset: cp took 3320 milliseconds with data/metadata cached and delalloc cp took 880 milliseconds with data/metadata cached and no delalloc cp took 1801 milliseconds without data/metadata cached and no delalloc Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use cached state when looking for delalloc ranges with fiemapFilipe Manana
During fiemap, whenever we find a hole or prealloc extent, we will look for delalloc in that range, and one of the things we do for that is to find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using calls to count_range_bits(). Since we process file extents from left to right, if we have a file with several holes or prealloc extents, we benefit from keeping a cached extent state record for calls to count_range_bits(). Most of the time the last extent state record we visited in one call to count_range_bits() matches the first extent state record we will use in the next call to count_range_bits(), so there's a benefit here. So use an extent state record to cache results from count_range_bits() calls during fiemap. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update stale comment for count_range_bits()Filipe Manana
The comment for count_range_bits() mentions that the search is fast if we are asking for a range with the EXTENT_DIRTY bit set. However that is no longer true since we don't use that bit and the optimization for that was removed in: commit 71528e9e16c7 ("btrfs: get rid of extent_io_tree::dirty_bytes") So remove that part of the comment mentioning the no longer existing optimized case, and, while at it, add proper documentation describing the purpose, arguments and return value of the function. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: allow passing a cached state record to count_range_bits()Filipe Manana
An inode's io_tree can be quite large and there are cases where due to delalloc it can have thousands of extent state records, which makes the red black tree have a depth of 10 or more, making the operation of count_range_bits() slow if we repeatedly call it for a range that starts where, or after, the previous one we called it for. Such use cases are when searching for delalloc in a file range that corresponds to a hole or a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap. So introduce a cached state parameter to count_range_bits() which we use to store the last extent state record we visited, and then allow the caller to pass it again on its next call to count_range_bits(). The next patches in the series will make fiemap and lseek use the new parameter. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove no longer used btrfs_next_extent_map()Filipe Manana
There are no more users of btrfs_next_extent_map(), the previous patch in the series ("btrfs: search for delalloc more efficiently during lseek/fiemap") removed the last usage of the function, so delete it. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: search for delalloc more efficiently during lseek/fiemapFilipe Manana
During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range that corresponds to a hole or a prealloc extent, we have to check if there's any delalloc in the range. We do it by searching for delalloc ranges in the inode's io_tree (for unflushed delalloc) and in the inode's extent map tree (for delalloc that is flushing). We avoid searching the extent map tree if the number of outstanding extents is 0, as in that case we can't have extent maps for our search range in the tree that correspond to delalloc that is flushing. However if we have any unflushed delalloc, due to buffered writes or mmap writes, then the outstanding extents counter is not 0 and we'll search the extent map tree. The tree may be large because it can have lots of extent maps that were loaded by reads or created by previous writes, therefore taking a significant time to search the tree, specially if have a file with a lot of holes and/or prealloc extents. We can improve on this by instead of searching the extent map tree, searching the ordered extents tree of the inode, since when delalloc is flushing we create an ordered extent along with the new extent map, while holding the respective file range locked in the inode's io_tree. The ordered extents tree is typically much smaller, since ordered extents have a short life and get removed from the tree once they are completed, while extent maps can stay for a very long time in the extent map tree, either created by previous writes or loaded by read operations. So use the ordered extents tree instead of the extent maps tree. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: skip unnecessary delalloc searches during lseek/fiemapFilipe Manana
During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range that corresponds to a hole or a prealloc extent, if we find that there is no delalloc marked in the inode's io_tree but there is delalloc due to an extent map in the io tree, then on the next iteration that calls find_delalloc_subrange() we can skip searching the io tree again, since on the first call we had no delalloc in the io tree for the whole range. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add an early exit when searching for delalloc range for lseek/fiemapFilipe Manana
During fiemap and lseek (SEEK_HOLE/DATA), when looking for delalloc in a range corresponding to a hole or a prealloc extent, if we found the whole range marked as delalloc in the inode's io_tree, then we can terminate immediately and avoid searching the extent map tree. If not, and if the found delalloc starts at the same offset of our search start but ends before our search range's end, then we can adjust the search range for the search in the extent map tree. So implement those changes. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_treeFilipe Manana
We don't need to set the EXTENT_UPDATE bit in an inode's io_tree to mark a range as uptodate, we rely on the pages themselves being uptodate - page reading is not triggered for already uptodate pages. Recently we removed most use of the EXTENT_UPTODATE for buffered IO with commit 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path"), but there were a few leftovers, namely when reading from holes and successfully finishing read repair. These leftovers are unnecessarily making an inode's tree larger and deeper, slowing down searches on it. So remove all the leftovers. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move tree block parentness check into validate_extent_buffer()Qu Wenruo
[BACKGROUND] Although both btrfs metadata and data has their read time verification done at endio time (btrfs_validate_metadata_buffer() and btrfs_verify_data_csum()), metadata has extra verification, mostly parentness check including first key/transid/owner_root/level, done at read_tree_block() and btrfs_read_extent_buffer(). On the other hand, all the data verification is done at endio context. [ENHANCEMENT] This patch will make a new union in btrfs_bio, taking the space of the old data checksums, thus it will not increase the memory usage. With that extra btrfs_tree_parent_check inside btrfs_bio, we can just pass the check parameter into read_extent_buffer_pages(), and before submitting the bio, we can copy the check structure into btrfs_bio. And finally at endio time, we can grab btrfs_bio::parent_check and pass it to validate_extent_buffer(), to move the remaining checks into it. This brings the following benefits: - Much simpler btrfs_read_extent_buffer() Now it only needs to iterate through all mirrors. - Simpler read-time transid check Previously we go verify_parent_transid() after reading out the extent buffer. Now the transid check is done inside the endio function, no other code can modify the content. Thus no need to use the extent lock anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: concentrate all tree block parentness check parameters into one structureQu Wenruo
There are several different tree block parentness check parameters used across several helpers: - level Mandatory - transid Under most cases it's mandatory, but there are several backref cases which skips this check. - owner_root - first_key Utilized by most top-down tree search routine. Otherwise can be skipped. Those four members are not always mandatory checks, and some of them are the same u64, which means if some arguments got swapped compiler will not catch it. Furthermore if we're going to further expand the parentness check, we need to modify quite some helpers just to add one more parameter. This patch will concentrate all these members into a structure called btrfs_tree_parent_check, and pass that structure for the following helpers: - btrfs_read_extent_buffer() - read_tree_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move device->name RCU allocation and assign to btrfs_alloc_device()Anand Jain
There is a repeating code section in the parent function after calling btrfs_alloc_device(), as below: name = rcu_string_strdup(path, GFP_...); if (!name) { btrfs_free_device(device); return ERR_PTR(-ENOMEM); } rcu_assign_pointer(device->name, name); Except in add_missing_dev() for obvious reasons. This patch consolidates that repeating code into the btrfs_alloc_device() itself so that the parent function doesn't have to duplicate code. This consolidation also helps to review issues regarding RCU lock violation with device->name. Parent function device_list_add() and add_missing_dev() use GFP_NOFS for the allocation, whereas the rest of the parent functions use GFP_KERNEL, so bring the NOFS allocation context using memalloc_nofs_save() in the function device_list_add() and add_missing_dev() is already doing it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: constify input buffer parameter in compression codeDavid Sterba
The input buffers passed down to compression must never be changed, switch type to u8 as it's a raw byte buffer and use const. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: raid56: remove the old error tracking systemQu Wenruo
Since all the recovery paths have been migrated to the new error bitmap based system, we can remove the old stripe number based system. This cleanup involves one behavior change: - Rebuild rbio can no longer be merged Previously a rebuild rbio (caused by retry after data csum mismatch) can be merged, if the error happens in the same stripe. But with the new error bitmap based solution, it's much harder to compare error bitmaps. So here we just don't merge rebuild rbio at all. This may introduce some performance impact at extreme corner cases, but we're willing to take it. Other than that, this patch will cleanup the following members: - rbio::faila - rbio::failb They will be replaced by per-vertical stripe check, which is more accurate. - rbio::error It will be replace by per-vertical stripe error bitmap check. - Allow get_rbio_vertical_errors() to accept NULL pointers for @faila and @failb Some call sites only want to check if we have errors beyond the tolerance. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmapQu Wenruo
Since we have rbio::error_bitmap to indicate exactly where the errors are (including read error and csum mismatch error), we can make recovery path more accurate. For example: 0 32K 64K Data 1 |XXXXXXXX| | Data 2 | |XXXXXXXXX| Parity | | | 1) Get csum mismatch when reading data 1 [0, 32K) 2) Mark corresponding range error The old code will mark the whole data 1 stripe as error. While the new code will only mark data 1 [0, 32K) as error. 3) Recovery path The old code will recover data 1 [0, 64K), all using Data 2 and parity. This means, Data 1 [32K, 64K) will be corrupted data, as data 2 [32K, 64K) is already corrupted. While the new code will only recover data 1 [0, 32K), as only that range has error so far. This new behavior can avoid populating rbio cache with incorrect data. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: raid56: introduce btrfs_raid_bio::error_bitmapQu Wenruo
Currently btrfs raid56 uses btrfs_raid_bio::faila and failb to indicate which stripe(s) had IO errors. But that has some problems: - If one sector failed csum check, the whole stripe where the corruption is will be marked error. This can reduce the chance we do recover, like this: 0 4K 8K Data 1 |XX| | Data 2 | |XX| Parity | | | In above case, 0~4K in data 1 should be recovered using data 2 and parity, while 4K~8K in data 2 should be recovered using data 1 and parity. Currently if we trigger read on 0~4K of data 1, we will also recover 4K~8K of data 1 using corrupted data 2 and parity, causing wrong result in rbio cache. - Harder to expand for future M-N scheme As we're limited to just faila/b, two corruptions. - Harder to expand to handle extra csum errors This can be problematic if we start to do csum verification. This patch will introduce an extra @error_bitmap, where one bit represents error that happened for that sector. The choice to introduce a new error bitmap other than reusing sector_ptr, is to avoid extra search between rbio::stripe_sectors[] and rbio::bio_sectors[]. Since we can submit bio using sectors from both sectors, doing proper search on both array will more complex. Although the new bitmap will take extra memory, later we can remove things like @error and faila/b to save some memory. Currently the new error bitmap and failab mechanism coexists, the error bitmap is only updated at endio time and recover entrance. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>