Names generated by previous hashing function cause issues in [Yakuza 5](https://github.com/ValveSoftware/Proton/issues/4625) i.e. incorrect audio files are played because of hash collision (incorrect files are returned by [find_file_in_dir](https://gitlab.winehq.org/wine/wine/-/blob/master/dlls/ntdll/unix/file.c?ref...)). For example, both `auth_a1020.hca` and `auth_d4010.hca` have the same hashed name i.e. `AUTH~SDT.HCA`. This MR replaces hashing function with [MurmurOAAT hash](https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02...) and extends `hash_chars` to include more legal DOS filename characters. Based on my testing it removes all hash collisions in the newest (as 2023-11-01) available version of Yakuza 5 when comparing files that are in the same directory.
From: Komoszek komoszek10@gmail.com
--- dlls/ntdll/unix/file.c | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-)
diff --git a/dlls/ntdll/unix/file.c b/dlls/ntdll/unix/file.c index c6533710365..83a6e1f8883 100644 --- a/dlls/ntdll/unix/file.c +++ b/dlls/ntdll/unix/file.c @@ -1320,27 +1320,32 @@ static BOOL is_hidden_file( const char *name ) */ static ULONG hash_short_file_name( const WCHAR *name, int length, LPWSTR buffer ) { - static const char hash_chars[32] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ012345"; + static const char hash_chars[64] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!#$&'()-@^_`{}~ABCDEFGHIJKLM";
LPCWSTR p, ext, end = name + length; LPWSTR dst; - unsigned short hash; + unsigned int hash; int i;
- /* Compute the hash code of the file name */ - /* If you know something about hash functions, feel free to */ - /* insert a better algorithm here... */ + /* Compute the hash code of the file name using MurmurOAAT Hash */ + /* https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02... */ if (!is_case_sensitive) { - for (p = name, hash = 0xbeef; p < end - 1; p++) - hash = (hash<<3) ^ (hash>>5) ^ towlower(*p) ^ (towlower(p[1]) << 8); - hash = (hash<<3) ^ (hash>>5) ^ towlower(*p); /* Last character */ + for (p = name, hash = 0xdeadbeef; p < end; p++) + { + hash ^= towlower(*p); + hash *= 0x5bd1e995; + hash ^= hash >> 15; + } } else { - for (p = name, hash = 0xbeef; p < end - 1; p++) - hash = (hash << 3) ^ (hash >> 5) ^ *p ^ (p[1] << 8); - hash = (hash << 3) ^ (hash >> 5) ^ *p; /* Last character */ + for (p = name, hash = 0xdeadbeef; p < end; p++) + { + hash ^= *p; + hash *= 0x5bd1e995; + hash ^= hash >> 15; + } }
/* Find last dot for start of the extension */ @@ -1356,9 +1361,9 @@ static ULONG hash_short_file_name( const WCHAR *name, int length, LPWSTR buffer while (i-- >= 0) *dst++ = '~';
/* Insert hash code converted to 3 ASCII chars */ - *dst++ = hash_chars[(hash >> 10) & 0x1f]; - *dst++ = hash_chars[(hash >> 5) & 0x1f]; - *dst++ = hash_chars[hash & 0x1f]; + *dst++ = hash_chars[(hash >> 12) & 0x3f]; + *dst++ = hash_chars[(hash >> 6) & 0x3f]; + *dst++ = hash_chars[hash & 0x3f];
/* Copy the first 3 chars of the extension (if any) */ if (ext)
Is it possible to use something closely compatible with how Windows does it? Maybe Samba has a clue about that. We'll need something that's patent free obviously.
On 11/1/23 09:54, Nikolay Sivov (@nsivov) wrote:
Is it possible to use something closely compatible with how Windows does it? Maybe Samba has a clue about that. We'll need something that's patent free obviously.
I am afraid Windows, besides short name generation algorithm (which, looking at the names, is not exactly a hash but probably some translation of the long name), does it quite different in principle. Probably storing the short names somewhere and resolving the conflicts. Doing the same name translation without resolving conflicts will make things worse as the conflicts will become quite likely.
This contains illegal characters that cannot be used for 8.3 filenames:
https://gitlab.winehq.org/wine/wine/-/blob/0170cd3a4c67bd99291234dd8e0d638a8...
On Wed Nov 1 16:09:09 2023 +0000, Jinoh Kang wrote:
This contains illegal characters that cannot be used for 8.3 filenames: https://gitlab.winehq.org/wine/wine/-/blob/0170cd3a4c67bd99291234dd8e0d638a8...
I'm sorry but which of these characters are illegal? I don't see any.
On Wed Nov 1 16:09:09 2023 +0000, Łukasz Komoszyński wrote:
I'm sorry but which of these characters are illegal? I don't see any.
Sorry, must have mistaken.
The problem with changing the hash is that it will break existing prefixes, and make it impossible to bisect regressions across the change since it won't be backwards compatible.
We would want to take into account the following factors:
1. Arbitrarily changing SFNs will break existing wineprefixes that save SFNs in configuration files and registry. At minimum, we should have a soft-rolling mechanism that keeps the old hash algorithm for existing prefixes.
2. Changing the discriminator length merely reduces the probability of collision, but does not fundamentally eliminate it. Consider the following table (generated with [sfn_birthday_paradox.py](/uploads/c2d0766f51ec86b96652c0b063122efb/sfn_birthday_paradox.py)):
| # of possible hashes | maximum # of files until collision of ~50% probability | | -------------------: | -----------------------------------------------------: | | 2<sup>15</sup> | 214 | | 2<sup>16</sup> | 302 | | 2<sup>17</sup> | 427 | | 2<sup>18</sup> | 604 | | 2<sup>19</sup> | 853 |
2<sup>18</sup> is undeniably an improvement, sure, but it does not rule out the possibility that someone else will encounter a similar problem in the future. What would we do in that case?
3. Using arbitrary charset not used by Windows may lead to unforeseen compatibility problems.
On Wed Nov 1 16:55:24 2023 +0000, Nikolay Sivov wrote:
Is it possible to use something closely compatible with how Windows does it? Maybe Samba has a clue about that. We'll need something that's patent free obviously.
From what I understand [Samba generates short filenames similary](https://github.com/samba-team/samba/blob/de20ee1adad4a6b26e07a6cf1ac89819cea...) to how wine does i.e. by creating hash (albeit with [different hashing function](https://github.com/samba-team/samba/blob/de20ee1adad4a6b26e07a6cf1ac89819cea...)) and only based on file name and not other files in the directory, which is different to [Windows where SFNs are influenced by other files in the directory](https://en.wikipedia.org/wiki/8.3_filename).