On May 31, 2014, at 10:28 AM, Sebastian Lackner wrote:
Explanation: The basic idea of this patch is that we have a global counter 'fd_cache_epoch' which is incremented, whenever the file descriptor cache is updated. get_cached_fd() first reads the counter, then the fd entry, and afterwards the counter again - if the counter has changed we know that a second thread has maybe modified the content, and have to read it again. The memory barrier is there to ensure that compilers don't optimize away the two reads of 'fd_cache_epoch'. Please note that:
- the order of the write instructions when updating the content is
important - this way we ensure that threads only see either (0, *) = not set, or a valid entry.
- I assume that reading/writing up to the pointer size can be done as an
atomic operation (threads either see the old or new value, but no partial updates). This assumption is also used in a lot other parts of wine, for example here: http://source.winehq.org/source/dlls/ntdll/sync.c#L1295 [ It would probably be more safe to introduce special interlocked_get/interlocked_set macros/functions for that, but not sure what other people think about that... ]
Thanks for working on this. Interesting stuff.
I think there are still some issues.
In order to ensure a consistent view of the ordering of memory operations, memory barriers have to be used in both the writer and the reader. The writer issues a write barrier after it has written and the reader issues a read barrier before it reads.
The appropriate barriers are implicit in interlocked_xchg_add() for incrementing fd_cache_epoch, but the first read of it in get_cached_fd() has no read barrier before it. I think this is benign, though, because I think this can only result in the code thinking the epoch changed when it hasn't. It may retry unnecessarily.
More problematic is the writing of the fd and its attributes. There are barriers implicit in the interlocked_xchg() which clears the fd, but there's no write barrier after the final write to it. There's no read barrier before the first read of them.
You mention the order of writing, attributes first then fd, as being important. But there's no barrier in between the two operations, so either the compiler or the CPU could reorder either the writes or the reads, thwarting your design. The compiler can also reorder the reading of the fd_cache_epoch with respect to the reading of the fd or attributes. I think you can get this as the effective sequence:
thread1 / get_cached_fd / read fd, get old fd thread2 / add_fd_to_cache / clear fd thread2 / add_fd_to_cache / increment epoch thread2 / add_fd_to_cache / set fd thread1 / get_cached_fd / read epoch thread1 / get_cached_fd / check if epoch has changed, conclude it hasn't thread1 / get_cached_fd / return old fd
I think it's sufficient to have a write barrier after writing the fd and then a read barrier just before reading it. I _think_ that this obviates the need for the barrier at the bottom of the do-while loop. That is, you can just move that barrier up a line. I'm not 100% confident of my analysis of that. (Well, frankly, I'm never 100% certain of my analysis of this sort of code.)
In the email thread regarding Daniel Horn's patch, I suggested an approach using interlocked_cmpxchg() to check if the fd had changed after reading the attributes. Was there a reason you rejected that approach? You mentioned an approach you experimented with using interlocked_cmpxchg64(). What was that about? Were you treating the fd and attributes combined as a single 64-bit value and then checking them both? Was that safer than my suggestion for some reason?
If the only advantage is a slight performance improvement (especially in light of the huge difference that either approach has to the status quo, judging by your chart), then I'd argue for the simpler approach that's more easily analyzed for correctness.
If the above issue is addressed, that leaves one remaining problem: get_cached_fd() can encounter a false negative:
thread1 / add_fd_to_cache / set fd to -1 thread1 / add_fd_to_cache / increment epoch thread2 / get_cached_fd / read epoch thread2 / get_cached_fd / read attributes thread2 / get_cached_fd / read fd, gets -1 thread2 / get_cached_fd / check if epoch has changed, conclude it hasn't thread1 / add_fd_to_cache / set attributes thread1 / add_fd_to_cache / set fd thread2 / get_cached_fd / return -1 thread2 / server_get_unix_fd / see get_cached_fd as having failed since it returns -1 thread2 / server_get_unix_fd / do get_handle_fd server request thread2 / server_get_unix_fd / call add_fd_to_cache to cache the new fd
I think you want to try get_cached_fd() again, inside the critical section, if the first call fails.
-Ken