aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-11-16UBUNTU: SAUCE: seccomp_filter: add process state reportingWill Drewry
Adds seccomp and seccomp_filter status reporting to proc. /proc/<pid>/seccomp_filter provides the current seccomp mode and the list of allowed or dynamically filtered system calls. v9: rebase on to bccaeafd7c117acee36e90d37c7e05c19be9e7bf v8: - v7: emit seccomp mode directly v6: - v5: fix typos when mailing the wrong patch series v4: move from rcu guard to mutex guard v3: changed to using filters directly. v2: removed status entry, added seccomp file. (requested by kosaki.motohiro@jp.fujitsu.com) allowed S_IRUGO reading of entries (requested by viro@zeniv.linux.org.uk) added flags got rid of the seccomp_t type dropped seccomp file Signed-off-by: Will Drewry <wad@chromium.org> BUG=chromium-os:14496 TEST=see others in series Change-Id: I86515e56d2da3b3c22ac90856a6ffa71a045c253 Reviewed-on: http://gerrit.chromium.org/gerrit/3242 Reviewed-by: Sonny Rao <sonnyrao@chromium.org> Tested-by: Will Drewry <wad@chromium.org> Signed-off-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
2011-11-16fs: limit filesystem stacking depthMiklos Szeredi
Add a simple read-only counter to super_block that indicates deep this is in the stack of filesystems. Previously ecryptfs was the only stackable filesystem and it explicitly disallowed multiple layers of itself. Overlayfs, however, can be stacked recursively and also may be stacked on top of ecryptfs or vice versa. To limit the kernel stack usage we must limit the depth of the filesystem stack. Initially the limit is set to 2. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
2011-11-16overlayfs: implement show_optionsErez Zadok
This is useful because of the stacking nature of overlayfs. Users like to find out (via /proc/mounts) which lower/upper directory were used at mount time. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
2011-11-16overlayfs: add statfs supportAndy Whitcroft
Add support for statfs to the overlayfs filesystem. As the upper layer is the target of all write operations assume that the space in that filesystem is the space in the overlayfs. There will be some inaccuracy as overwriting a file will copy it up and consume space we were not expecting, but it is better than nothing. Use the upper layer dentry and mount from the overlayfs root inode, passing the statfs call to that filesystem. Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
2011-11-16overlay filesystemMiklos Szeredi
Overlayfs allows one, usually read-write, directory tree to be overlaid onto another, read-only directory tree. All modifications go to the upper, writable layer. This type of mechanism is most often used for live CDs but there's a wide variety of other uses. The implementation differs from other "union filesystem" implementations in that after a file is opened all operations go directly to the underlying, lower or upper, filesystems. This simplifies the implementation and allows native performance in these cases. The dentry tree is duplicated from the underlying filesystems, this enables fast cached lookups without adding special support into the VFS. This uses slightly more memory than union mounts, but dentries are relatively small. Currently inodes are duplicated as well, but it is a possible optimization to share inodes for non-directories. Opening non directories results in the open forwarded to the underlying filesystem. This makes the behavior very similar to union mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file descriptors). Usage: mount -t overlay -olowerdir=/lower,upperdir=/upper overlay /mnt Supported: - all operations Missing: - ensure that filesystems part of the overlay are not modified outside the overlay The following cotributions have been folded into this patch: Neil Brown <neilb@suse.de>: - minimal remount support - use correct seek function for directories - initialise is_real before use - rename ovl_fill_cache to ovl_dir_read Felix Fietkau <nbd@openwrt.org>: - fix a deadlock in ovl_dir_read_merged - fix a deadlock in ovl_remove_whiteouts Erez Zadok <ezk@fsl.cs.sunysb.edu> - fix cleanup after WARN_ON Also thanks to the following people for testing and reporting bugs: Jordi Pujol <jordipujolp@gmail.com> Andy Whitcroft <apw@canonical.com> Michal Suchanek <hramrach@centrum.cz> Felix Fietkau <nbd@openwrt.org> Erez Zadok <ezk@fsl.cs.sunysb.edu> Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
2011-11-16Revert "UBUNTU: ubuntu: overlayfs -- overlay filesystem"Leann Ogasawara
This reverts commit eabb8d62f6ec4897974548368aa1280e5e726187.
2011-11-16Revert "UBUNTU: ubuntu: overlayfs -- overlayfs: add statfs support"Leann Ogasawara
This reverts commit 70dc86677da31d4ecaf70c5f86cb3881c58594ac.
2011-11-16Revert "UBUNTU: ubuntu: overlayfs -- overlayfs: implement show_options"Leann Ogasawara
This reverts commit 09bcd22c8c6f8a67a38702cf44febd52554ca543.
2011-11-16Revert "UBUNTU: ubuntu: overlayfs -- ovl: fix overlayfs over overlayfs"Leann Ogasawara
This reverts commit 2679353c90924972d3bbe1328042d7e88a794d62.
2011-11-16Revert "UBUNTU: ubuntu: overlayfs -- ovl: improve stack use of lookup and ↵Leann Ogasawara
readdir" This reverts commit beb0e0d9ca7db7d3b4e416089b1ef35efabb752d.
2011-11-16Revert "UBUNTU: ubuntu: overlayfs -- fs: limit filesystem stacking depth"Leann Ogasawara
This reverts commit d72ccf8cb287df877221c874efb38f343f671e86.
2011-11-16Revert "UBUNTU: ubuntu: overlayfs -- ovl: make lower mount read-only"Leann Ogasawara
This reverts commit 9fa5e35fb1ef0df154866d299b351c5cb25ccf5d.
2011-11-16UBUNTU: SAUCE: (no-up) vfs: Add a trace point in the mark_inode_dirty functionArjan van de Ven
[apw@canonical.com: This has no upstream traction but is used by powertop, so its worth carrying.] PowerTOP would like to be able to show who is keeping the disk busy by dirtying data. The most logical spot for this is in the vfs in the mark_inode_dirty() function. Doing this on the block level is not possible because by the time the IO hits the block layer the guilty party can no longer be found ("kjournald" and "pdflush" are not useful answers to "who caused this file to be dirty). The trace point follows the same logic/style as the block_dump code and pretty much dumps the same data, just not to dmesg (and thus to /var/log/messages) but via the trace events streams. Note: This patch was posted to lkml and might potentially go into 2.6.33 but I have not seen which maintainer will take it. Signed-of-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Amit Kucheria <amit.kucheria@canonical.com> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- ovl: make lower mount read-onlyMiklos Szeredi
If a file only existing on the lower fs is operned for O_RDONLY and fchmod/fchown/etc is performed on the open file then this will modify the lower fs, which is not what we want. Copying up at this point is not possible. The best solution is to return an error for this corner case and hope applications are not relying on it. Reported-by: "J. R. Okajima" <hooanon05@yahoo.co.jp> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- fs: limit filesystem stacking depthMiklos Szeredi
Add a simple read-only counter to super_block that indicates deep this is in the stack of filesystems. Previously ecryptfs was the only stackable filesystem and it explicitly disallowed multiple layers of itself. Overlayfs, however, can be stacked recursively and also may be stacked on top of ecryptfs or vice versa. To limit the kernel stack usage we must limit the depth of the filesystem stack. Initially the limit is set to 2. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- ovl: improve stack use of lookup and readdirMiklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- ovl: fix overlayfs over overlayfsMiklos Szeredi
Overlayfs expects ->permission() to not check for readonliness (which is normally checked by the VFS) and so not return with -EROFS. This is not true of some filesystems, notably overlayfs itself. The following patch should fix this by making sure that if the upper layer is read-only (such as squashfs) then it will mark overlayfs read-only too and by making ovl_permission() only return EROFS in the excpetional case where the upper filesystem became r/o after the overlay was constructed. Reported-by: Jordi Pujol <jordipujolp@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- overlayfs: implement show_optionsErez Zadok
This is useful because of the stacking nature of overlayfs. Users like to find out (via /proc/mounts) which lower/upper directory were used at mount time. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- overlayfs: add statfs supportAndy Whitcroft
Add support for statfs to the overlayfs filesystem. As the upper layer is the target of all write operations assume that the space in that filesystem is the space in the overlayfs. There will be some inaccuracy as overwriting a file will copy it up and consume space we were not expecting, but it is better than nothing. Use the upper layer dentry and mount from the overlayfs root inode, passing the statfs call to that filesystem. Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-11-16UBUNTU: ubuntu: overlayfs -- overlay filesystemMiklos Szeredi
Overlayfs allows one, usually read-write, directory tree to be overlaid onto another, read-only directory tree. All modifications go to the upper, writable layer. This type of mechanism is most often used for live CDs but there's a wide variety of other uses. The implementation differs from other "union filesystem" implementations in that after a file is opened all operations go directly to the underlying, lower or upper, filesystems. This simplifies the implementation and allows native performance in these cases. The dentry tree is duplicated from the underlying filesystems, this enables fast cached lookups without adding special support into the VFS. This uses slightly more memory than union mounts, but dentries are relatively small. Currently inodes are duplicated as well, but it is a possible optimization to share inodes for non-directories. Opening non directories results in the open forwarded to the underlying filesystem. This makes the behavior very similar to union mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file descriptors). Usage: mount -t overlay -olowerdir=/lower,upperdir=/upper overlay /mnt Supported: - all operations Missing: - ensure that filesystems part of the overlay are not modified outside the overlay The following cotributions have been folded into this patch: Neil Brown <neilb@suse.de>: - minimal remount support - use correct seek function for directories - initialise is_real before use - rename ovl_fill_cache to ovl_dir_read Felix Fietkau <nbd@openwrt.org>: - fix a deadlock in ovl_dir_read_merged - fix a deadlock in ovl_remove_whiteouts Erez Zadok <ezk@fsl.cs.sunysb.edu> - fix cleanup after WARN_ON Also thanks to the following people for testing and reporting bugs: Jordi Pujol <jordipujolp@gmail.com> Andy Whitcroft <apw@canonical.com> Michal Suchanek <hramrach@centrum.cz> Felix Fietkau <nbd@openwrt.org> Erez Zadok <ezk@fsl.cs.sunysb.edu> Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- vfs: introduce clone_private_mount()Miklos Szeredi
Overlayfs needs a private clone of the mount, so create a function for this and export to modules. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- vfs: export do_splice_direct() to modulesMiklos Szeredi
Export do_splice_direct() to modules. Needed by overlay filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: overlayfs -- vfs: add i_op->open()Miklos Szeredi
Add a new inode operation i_op->open(). This is for stacked filesystems that want to return a struct file from a different filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: nx-emu - i386: NX emulationIngo Molnar
This is old code with some cruft, all originally by Ingo Molnar with much later rebasing by Fedora folks and at least one arcane fix by Roland McGrath a few years ago. No longer uses exec-shield sysctl, merged with disable_nx. Kees Cook fixed boottime NX reporting for various corner cases. Signed-off-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
2011-11-16UBUNTU: ubuntu: AUFS -- aufs2-standalone.patch aufs2.1-39Andy Whitcroft
Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: ubuntu: AUFS -- aufs2-base.patch aufs2.1-39Andy Whitcroft
Signed-off-by: Andy Whitcroft <apw@canonical.com>
2011-11-16UBUNTU: SAUCE: (no-up) trace: add trace events for open(), exec() and uselib()Scott James Remnant
BugLink: http://bugs.launchpad.net/bugs/462111 This patch uses TRACE_EVENT to add tracepoints for the open(), exec() and uselib() syscalls so that ureadahead can cheaply trace the boot sequence to determine what to read to speed up the next. It's not upstream because it will need to be rebased onto the syscall trace events whenever that gets merged, and is a stop-gap. Signed-off-by: Scott James Remnant <scott@ubuntu.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Andy Whitcroft <andy.whitcroft@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2011-11-16UBUNTU: SAUCE: (no-up) version: Implement version_signature proc file.Andy Whitcroft
Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Tim Gardener <tim.gardner@canonical.com>
2011-11-11VFS: we need to set LOOKUP_JUMPED on mountpoint crossingAl Viro
commit a3fbbde70a0cec017f2431e8f8de208708c76acc upstream. Mountpoint crossing is similar to following procfs symlinks - we do not get ->d_revalidate() called for dentry we have arrived at, with unpleasant consequences for NFS4. Simple way to reproduce the problem in mainline: cat >/tmp/a.c <<'EOF' #include <unistd.h> #include <fcntl.h> #include <stdio.h> main() { struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1}; if (fcntl(0, F_SETLK, &fl)) perror("setlk"); } EOF cc /tmp/a.c -o /tmp/test then on nfs4: mount --bind file1 file2 /tmp/test < file1 # ok /tmp/test < file2 # spews "setlk: No locks available"... What happens is the missing call of ->d_revalidate() after mountpoint crossing and that's where NFS4 would issue OPEN request to server. The fix is simple - treat mountpoint crossing the same way we deal with following procfs-style symlinks. I.e. set LOOKUP_JUMPED... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11VFS: fix statfs() automounter semantics regressionDan McGee
commit 5c8a0fbba543d9428a486f0d1282bbcf3cf1d95a upstream. No one in their right mind would expect statfs() to not work on a automounter managed mount point. Fix it. [ I'm not sure about the "no one in their right mind" part. It's not mounted, and you didn't ask for it to be mounted. But nobody will really care, and this probably makes it match previous semantics, so.. - Linus ] This mirrors the fix made to the quota code in 815d405ceff0d69646. Signed-off-by: Dan McGee <dpmcgee@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11block: make gendisk hold a reference to its queueTejun Heo
commit f992ae801a7dec34a4ed99a6598bbbbfb82af4fb upstream. The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: fix race in xattr block allocation pathEric Sandeen
commit 6d6a435190bdf2e04c9465cde5bdc3ac68cf11a4 upstream. Ceph users reported that when using Ceph on ext4, the filesystem would often become corrupted, containing inodes with incorrect i_blocks counters. I managed to reproduce this with a very hacked-up "streamtest" binary from the Ceph tree. Ceph is doing a lot of xattr writes, to out-of-inode blocks. There is also another thread which does sync_file_range and close, of the same files. The problem appears to happen due to this race: sync/flush thread xattr-set thread ----------------- ---------------- do_writepages ext4_xattr_set ext4_da_writepages ext4_xattr_set_handle mpage_da_map_blocks ext4_xattr_block_set set DELALLOC_RESERVE ext4_new_meta_blocks ext4_mb_new_blocks if (!i_delalloc_reserved_flag) vfs_dq_alloc_block ext4_get_blocks down_write(i_data_sem) set i_delalloc_reserved_flag ... up_write(i_data_sem) if (i_delalloc_reserved_flag) vfs_dq_alloc_block_nofail In other words, the sync/flush thread pops in and sets i_delalloc_reserved_flag on the inode, which makes the xattr thread think that it's in a delalloc path in ext4_new_meta_blocks(), and add the block for a second time, after already having added it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks The real problem is that we shouldn't be using the DELALLOC_RESERVED state flag, and instead we should be passing EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of using an inode state flag. We'll fix this for now with using i_data_sem to prevent this race, but this is really not the right way to fix things. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: let ext4_page_mkwrite stop started handle in failureYongqiang Yang
commit fcbb5515825f1bb20b7a6f75ec48bee61416f879 upstream. The started journal handle should be stopped in failure case. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: call ext4_handle_dirty_metadata with correct inode in ext4_dx_add_entryTheodore Ts'o
commit 5930ea643805feb50a2f8383ae12eb6f10935e49 upstream. ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads that point to directory blocks assigned to the directory inode. However, the function calls ext4_handle_dirty_metadata with the inode of the file that's being added to the directory, not the directory inode itself. Therefore, correct the code to dirty the directory buffers with the directory inode, not the file inode. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: ext4_mkdir should dirty dir_block with newly created directory inodeDarrick J. Wong
commit f9287c1f2d329f4d78a3bbc9cf0db0ebae6f146a upstream. ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir". Unfortunately, dir_block belongs to the newly created directory (which is "inode"), not the parent directory (which is "dir"). Fix the incorrect association. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: ext4_rename should dirty dir_bh with the correct directoryDarrick J. Wong
commit bcaa992975041e40449be8c010c26192b8c8b409 upstream. When ext4_rename performs a directory rename (move), dir_bh is a buffer that is modified to update the '..' link in the directory being moved (old_inode). However, ext4_handle_dirty_metadata is called with the old parent directory inode (old_dir) and dir_bh, which is incorrect because dir_bh does not belong to the parent inode. Fix this error. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodesTheodore Ts'o
commit 1cd9f0976aa4606db8d6e3dc3edd0aca8019372a upstream. This doesn't make much sense, and it exposes a bug in the kernel where attempts to create a new file in an append-only directory using O_CREAT will fail (but still leave a zero-length file). This was discovered when xfstests #79 was generalized so it could run on all file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11binfmt_elf: fix PIE execution with randomization disabledJiri Kosina
commit a3defbe5c337dbc6da911f8cc49ae3cc3b49b453 upstream. The case of address space randomization being disabled in runtime through randomize_va_space sysctl is not treated properly in load_elf_binary(), resulting in SIGKILL coming at exec() time for certain PIE-linked binaries in case the randomization has been disabled at runtime prior to calling exec(). Handle the randomize_va_space == 0 case the same way as if we were not supporting .text randomization at all. Based on original patch by H.J. Lu and Josh Boyer. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: H.J. Lu <hongjiu.lu@intel.com> Cc: <stable@kernel.org> Tested-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11readlinkat: ensure we return ENOENT for the empty pathname for normal lookupsAndy Whitcroft
commit 1fa1e7f615f4d3ae436fa319af6e4eebdd4026a8 upstream. Since the commit below which added O_PATH support to the *at() calls, the error return for readlink/readlinkat for the empty pathname has switched from ENOENT to EINVAL: commit 65cfc6722361570bfe255698d9cd4dccaf47570d Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 13 15:56:26 2011 -0400 readlinkat(), fchownat() and fstatat() with empty relative pathnames This is both unexpected for userspace and makes readlink/readlinkat inconsistant with all other interfaces; and inconsistant with our stated return for these pathnames. As the readlinkat call does not have a flags parameter we cannot use the AT_EMPTY_PATH approach used in the other calls. Therefore expose whether the original path is infact entry via a new user_path_at_empty() path lookup function. Use this to determine whether to default to EINVAL or ENOENT for failures. Addresses http://bugs.launchpad.net/bugs/817187 [akpm@linux-foundation.org: remove unused getname_flags()] Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11/proc/self/numa_maps: restore "huge" tag for hugetlb vmasAndrew Morton
commit fc360bd9cdcf875639a77f07fafec26699c546f3 upstream. The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm: use walk_page_range() instead of custom page table walking code"). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Tested-by: Stephen Hemminger <shemminger@vyatta.com> Reviewed-by: Stephen Wilson <wilsons@start.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11vfs: add "device" tag to /proc/self/mountstatsBryan Schumaker
commit a877ee03ac010ded434b77f7831f43cbb1fcc60f upstream. nfsiostat was failing to find mounted filesystems on kernels after 2.6.38 because of changes to show_vfsstat() by commit c7f404b40a3665d9f4e9a927cc5c1ee0479ed8f9. This patch adds back the "device" tag before the nfs server entry so scripts can parse the mountstats file correctly. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: ignore WANT bits in open downgradeJ. Bruce Fields
commit c30e92df30d7d5fe65262fbce5d1b7de675fe34e upstream. We don't use WANT bits yet--and sending them can probably trigger a BUG() further down. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: fix open downgrade, againJ. Bruce Fields
commit 3d02fa29dec920c597dd7b7db608a4bc71f088ce upstream. Yet another open-management regression: - nfs4_file_downgrade() doesn't remove the BOTH access bit on downgrade, so the server's idea of the stateid's access gets out of sync with the client's. If we want to keep an O_RDWR open in this case, we should do that in the file_put_access logic rather than here. - We forgot to convert v4 access to an open mode here. This logic has proven too hard to get right. In the future we may consider: - reexamining the lock/openowner relationship (locks probably don't really need to take their own references here). - adding open upgrade/downgrade support to the vfs. - removing the atomic operations. They're redundant as long as this is all under some other lock. Also, maybe some kind of additional static checking would help catch O_/NFS4_SHARE_ACCESS confusion. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: permit read opens of executable-only filesJ. Bruce Fields
commit a043226bc140a2c1dde162246d68a67e5043e6b2 upstream. A client that wants to execute a file must be able to read it. Read opens over nfs are therefore implicitly allowed for executable files even when those files are not readable. NFSv2/v3 get this right by using a passed-in NFSD_MAY_OWNER_OVERRIDE on read requests, but NFSv4 has gotten this wrong ever since dc730e173785e29b297aa605786c94adaffe2544 "nfsd4: fix owner-override on open", when we realized that the file owner shouldn't override permissions on non-reclaim NFSv4 opens. So we can't use NFSD_MAY_OWNER_OVERRIDE to tell nfsd_permission to allow reads of executable files. So, do the same thing we do whenever we encounter another weird NFS permission nit: define yet another NFSD_MAY_* flag. The industry's future standardization on 128-bit processors will be motivated primarily by the need for integers with enough bits for all the NFSD_MAY_* flags. Reported-by: Leonardo Borda <leonardoborda@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: fix seqid_mutating_errorJ. Bruce Fields
commit 576163005de286bbd418fcb99cfd0971523a0c6d upstream. The set of errors here does *not* agree with the set of errors specified in the rfc! While we're there, turn this macros into a function, for the usual reasons, and move it to the one place where it's actually used. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: stop using nfserr_resource for transitory errorsJ. Bruce Fields
commit 3e77246393c0a433247631a1f0e9ec98d3d78a1c upstream. The server is returning nfserr_resource for both permanent errors and for errors (like allocation failures) that might be resolved by retrying later. Save nfserr_resource for the former and use delay/jukebox for the latter. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: Remove check for a 32-bit cookie in nfsd4_readdir()Bernd Schubert
commit 832023bffb4b493f230be901f681020caf3ed1f8 upstream. Fan Yong <yong.fan@whamcloud.com> noticed setting FMODE_32bithash wouldn't work with nfsd v4, as nfsd4_readdir() checks for 32 bit cookies. However, according to RFC 3530 cookies have a 64 bit type and cookies are also defined as u64 in 'struct nfsd4_readdir'. So remove the test for >32-bit values. Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfs: don't try to migrate pages with active requestsJeff Layton
commit 2da956523526e440ef4f4dd174e26f5ac06fe011 upstream. nfs_find_and_lock_request will take a reference to the nfs_page and will then put it if the req is already locked. It's possible though that the reference will be the last one. That put then can kick off a whole series of reference puts: nfs_page nfs_open_context dentry inode If the inode ends up being deleted, then the VFS will call truncate_inode_pages. That function will try to take the page lock, but it was already locked when migrate_page was called. The code deadlocks. Fix this by simply refusing the migration request if PagePrivate is already set, indicating that the page is already associated with an active read or write request. We've had a customer test a backported version of this patch and the preliminary results seem good. Cc: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Harshula Jayasuriya <harshula@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfs: don't redirty inode when ncommit == 0 in nfs_commit_unstable_pagesJeff Layton
commit 3236c3e1adc0c7ec83eaff1de2d06746b7c5bb28 upstream. commit 420e3646 allowed the kernel to reduce the number of unnecessary commit calls by skipping the commit when there are a large number of outstanding pages. However, the current test in nfs_commit_unstable_pages does not handle the edge condition properly. When ncommit == 0, then that means that the kernel doesn't need to do anything more for the inode. The current test though in the WB_SYNC_NONE case will return true, and the inode will end up being marked dirty. Once that happens the inode will never be clean until there's a WB_SYNC_ALL flush. Fix this by immediately returning from nfs_commit_unstable_pages when ncommit == 0. Mike noticed this problem initially in RHEL5 (2.6.18-based kernel) which has a backported version of 420e3646. The inode cache there was growing very large. The inode cache was unable to be shrunk since the inodes were all marked dirty. Calling sync() would essentially "fix" the problem -- the WB_SYNC_ALL flush would result in the inodes all being marked clean. What I'm not clear on is how big a problem this is in mainline kernels as the writeback code there is very different. Either way, it seems incorrect to re-mark the inode dirty in this case. Reported-by: Mike McLean <mikem@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11SUNRPC/NFS: make rpc pipe upcall genericPeng Tao
commit c1225158a8dad9e9d5eee8a17dbbd9c7cda05ab9 upstream. The same function is used by idmap, gss and blocklayout code. Make it generic. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>