New subject: ntdll: Use lockfree implementation for get_cached_fd.

31 May 2014

      On May 31, 2014, at 10:28 AM, Sebastian Lackner wrote:
...
Explanation: The basic idea of this patch is that we have a global
counter 'fd_cache_epoch' which is incremented, whenever the file
descriptor cache is updated. get_cached_fd() first reads the counter,
then the fd entry, and afterwards the counter again - if the counter has
changed we know that a second thread has maybe modified the content, and
have to read it again. The memory barrier is there to ensure that
compilers don't optimize away the two reads of 'fd_cache_epoch'. Please
note that:

the order of the write instructions when updating the content is

important - this way we ensure that threads only see either (0, *) = not
set, or a valid entry.

I assume that reading/writing up to the pointer size can be done as an

atomic operation (threads either see the old or new value, but no
partial updates). This assumption is also used in a lot other parts of
wine, for example here:
http://source.winehq.org/source/dlls/ntdll/sync.c#L1295
[ It would probably be more safe to introduce special
interlocked_get/interlocked_set macros/functions for that, but not sure
what other people think about that... ]
Thanks for working on this.  Interesting stuff.
I think there are still some issues.
In order to ensure a consistent view of the ordering of memory operations, memory barriers have to be used in both the writer and the reader.  The writer issues a write barrier after it has written and the reader issues a read barrier before it reads.
The appropriate barriers are implicit in interlocked_xchg_add() for incrementing fd_cache_epoch, but the first read of it in get_cached_fd() has no read barrier before it.  I think this is benign, though, because I think this can only result in the code thinking the epoch changed when it hasn't.  It may retry unnecessarily.
More problematic is the writing of the fd and its attributes.  There are barriers implicit in the interlocked_xchg() which clears the fd, but there's no write barrier after the final write to it.  There's no read barrier before the first read of them.
You mention the order of writing, attributes first then fd, as being important.  But there's no barrier in between the two operations, so either the compiler or the CPU could reorder either the writes or the reads, thwarting your design.  The compiler can also reorder the reading of the fd_cache_epoch with respect to the reading of the fd or attributes.  I think you can get this as the effective sequence:
thread1 / get_cached_fd / read fd, get old fd
thread2 / add_fd_to_cache / clear fd
thread2 / add_fd_to_cache / increment epoch
thread2 / add_fd_to_cache / set fd
thread1 / get_cached_fd / read epoch
thread1 / get_cached_fd / check if epoch has changed, conclude it hasn't
thread1 / get_cached_fd / return old fd
I think it's sufficient to have a write barrier after writing the fd and then a read barrier just before reading it.  I _think_ that this obviates the need for the barrier at the bottom of the do-while loop.  That is, you can just move that barrier up a line.  I'm not 100% confident of my analysis of that.  (Well, frankly, I'm never 100% certain of my analysis of this sort of code.)
In the email thread regarding Daniel Horn's patch, I suggested an approach using interlocked_cmpxchg() to check if the fd had changed after reading the attributes.  Was there a reason you rejected that approach?  You mentioned an approach you experimented with using interlocked_cmpxchg64().  What was that about?  Were you treating the fd and attributes combined as a single 64-bit value and then checking them both?  Was that safer than my suggestion for some reason?
If the only advantage is a slight performance improvement (especially in light of the huge difference that either approach has to the status quo, judging by your chart), then I'd argue for the simpler approach that's more easily analyzed for correctness.
If the above issue is addressed, that leaves one remaining problem: get_cached_fd() can encounter a false negative:
thread1 / add_fd_to_cache / set fd to -1
thread1 / add_fd_to_cache / increment epoch
thread2 / get_cached_fd / read epoch
thread2 / get_cached_fd / read attributes
thread2 / get_cached_fd / read fd, gets -1
thread2 / get_cached_fd / check if epoch has changed, conclude it hasn't
thread1 / add_fd_to_cache / set attributes
thread1 / add_fd_to_cache / set fd
thread2 / get_cached_fd / return -1
thread2 / server_get_unix_fd / see get_cached_fd as having failed since it returns -1
thread2 / server_get_unix_fd / do get_handle_fd server request
thread2 / server_get_unix_fd / call add_fd_to_cache to cache the new fd
I think you want to try get_cached_fd() again, inside the critical section, if the first call fails.
-Ken

Re: ntdll: Use lockfree implementation for get_cached_fd.