arm64: word-at-a-time: improve byte count calculations for LEword-at-a-time

Do the same optimization as x86-64: do __ffs() on the intermediate value that found whether there is a zero byte, before we've actually computed the final byte mask. The logic is: has_zero(): Check if the word has a zero byte in it, which indicates the end of the loop, and prepare a value to be used for the rest of the sequence. The standard LE implementation just creates a word that has the high bit set in each byte of the word that was zero. Example: 0xaa00bbccdd00eeff -> 0x0080000000800000 prep_zero_mask(): Possibly do more prep to then clean up the initial fast result from has_zero, so that it can be combined with another zero mask with a simple logical "or" to create a final mask. This is only used on big-endian machines that use a different algorithm, and is a no-op here. create_zero_mask(): This is "step 1" of creating the count and the mask, and is meant for any common operations between the two. In the old implementation, this actually created the zero mask, that was then used for masking and for counting the number of bits in the mask. In the new implementation, this is a no-op. count_zero(): This takes the mask bits, and counts the number of bytes before the first zero byte. In the old implementation, it counted the number of bits in the final byte mask (which was the same as the C standard "find last set bit" that uses the silly "starts at one" counting) and shifted the value down by three. In the new implementation, we know the intermediate mask isn't zero, and it just does "find first set" with the sane semantics without any off-by-one issues, and again shifts by three (which also masks off the bit offset in the zero byte itself). Example: 0x0080000000800000 -> 2 zero_bytemask(): This takes the mask bits, and turns it into an actual byte mask of the bytes preceding the first zero byte. In the old implementation, this was a no-op, because the work had already been done by create_zero_mask(). In the new implementation, this does what create_zero_mask() used to do. Example: 0x0080000000800000 -> 0x000000000000ffff The difference between the old and the new implementation is that "count_zero()" ends up scheduling better because it is being done on a value that is available earlier (before the final mask). But more importantly, it can be implemented without the insane semantics of the standard bit finding helpers that have the off-by-one issue and have to special-case the zero mask situation. On arm64, the new "count_zero()" ends up just "rbit + clz" plus the shift right that then ends up being subsumed by the "add to final length". Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Linus Torvalds <torvalds@linux-foundation.org> 2024-06-18 18:14:48 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2024-06-19 12:35:19 -0700
commit: f915a3e5b0182dd7376f11337e231500a157e1f4 (patch)
tree: bcb39ed202365d2b2edf433005337975919b9cf5
parent: 4b8fa1173cdc33a1cb53e4cac869b1b7aac917a9 (diff)
1 files changed, 3 insertions, 8 deletions
diff --git a/arch/arm64/include/asm/word-at-a-time.h b/arch/arm64/include/asm/word-at-a-time.h
index 14251abee23c..824ca6987a51 100644
--- a/arch/arm64/include/asm/word-at-a-time.h
+++ b/arch/arm64/include/asm/word-at-a-time.h
@@ -27,20 +27,15 @@ static inline unsigned long has_zero(unsigned long a, unsigned long *bits,
 }
 
 #define prep_zero_mask(a, bits, c) (bits)
+#define create_zero_mask(bits) (bits)
+#define find_zero(bits) (__ffs(bits) >> 3)
 
-static inline unsigned long create_zero_mask(unsigned long bits)
+static inline unsigned long zero_bytemask(unsigned long bits)
 {
 	bits = (bits - 1) & ~bits;
 	return bits >> 7;
 }
 
-static inline unsigned long find_zero(unsigned long mask)
-{
-	return fls64(mask) >> 3;
-}
-
-#define zero_bytemask(mask) (mask)
-
 #else	/* __AARCH64EB__ */
 #include <asm-generic/word-at-a-time.h>
 #endif
author	Linus Torvalds <torvalds@linux-foundation.org>	2024-06-18 18:14:48 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2024-06-19 12:35:19 -0700
commit	f915a3e5b0182dd7376f11337e231500a157e1f4 (patch)
tree	bcb39ed202365d2b2edf433005337975919b9cf5
parent	4b8fa1173cdc33a1cb53e4cac869b1b7aac917a9 (diff)