On May 31, 2014, at 10:28 AM, Sebastian Lackner wrote:
Explanation: The basic idea of this patch is that we have a global counter 'fd_cache_epoch' which is incremented, whenever the file descriptor cache is updated. get_cached_fd() first reads the counter, then the fd entry, and afterwards the counter again - if the counter has changed we know that a second thread has maybe modified the content, and have to read it again. The memory barrier is there to ensure that compilers don't optimize away the two reads of 'fd_cache_epoch'. Please note that:
- the order of the write instructions when updating the content is
important - this way we ensure that threads only see either (0, *) = not set, or a valid entry.
- I assume that reading/writing up to the pointer size can be done as an
atomic operation (threads either see the old or new value, but no partial updates). This assumption is also used in a lot other parts of wine, for example here: http://source.winehq.org/source/dlls/ntdll/sync.c#L1295 [ It would probably be more safe to introduce special interlocked_get/interlocked_set macros/functions for that, but not sure what other people think about that... ]
Thanks for working on this. Interesting stuff.
I think there are still some issues.
In order to ensure a consistent view of the ordering of memory operations, memory barriers have to be used in both the writer and the reader. The writer issues a write barrier after it has written and the reader issues a read barrier before it reads.
The appropriate barriers are implicit in interlocked_xchg_add() for incrementing fd_cache_epoch, but the first read of it in get_cached_fd() has no read barrier before it. I think this is benign, though, because I think this can only result in the code thinking the epoch changed when it hasn't. It may retry unnecessarily.
More problematic is the writing of the fd and its attributes. There are barriers implicit in the interlocked_xchg() which clears the fd, but there's no write barrier after the final write to it. There's no read barrier before the first read of them.
You mention the order of writing, attributes first then fd, as being important. But there's no barrier in between the two operations, so either the compiler or the CPU could reorder either the writes or the reads, thwarting your design. The compiler can also reorder the reading of the fd_cache_epoch with respect to the reading of the fd or attributes. I think you can get this as the effective sequence:
thread1 / get_cached_fd / read fd, get old fd thread2 / add_fd_to_cache / clear fd thread2 / add_fd_to_cache / increment epoch thread2 / add_fd_to_cache / set fd thread1 / get_cached_fd / read epoch thread1 / get_cached_fd / check if epoch has changed, conclude it hasn't thread1 / get_cached_fd / return old fd
I think it's sufficient to have a write barrier after writing the fd and then a read barrier just before reading it. I _think_ that this obviates the need for the barrier at the bottom of the do-while loop. That is, you can just move that barrier up a line. I'm not 100% confident of my analysis of that. (Well, frankly, I'm never 100% certain of my analysis of this sort of code.)
In the email thread regarding Daniel Horn's patch, I suggested an approach using interlocked_cmpxchg() to check if the fd had changed after reading the attributes. Was there a reason you rejected that approach? You mentioned an approach you experimented with using interlocked_cmpxchg64(). What was that about? Were you treating the fd and attributes combined as a single 64-bit value and then checking them both? Was that safer than my suggestion for some reason?
If the only advantage is a slight performance improvement (especially in light of the huge difference that either approach has to the status quo, judging by your chart), then I'd argue for the simpler approach that's more easily analyzed for correctness.
If the above issue is addressed, that leaves one remaining problem: get_cached_fd() can encounter a false negative:
thread1 / add_fd_to_cache / set fd to -1 thread1 / add_fd_to_cache / increment epoch thread2 / get_cached_fd / read epoch thread2 / get_cached_fd / read attributes thread2 / get_cached_fd / read fd, gets -1 thread2 / get_cached_fd / check if epoch has changed, conclude it hasn't thread1 / add_fd_to_cache / set attributes thread1 / add_fd_to_cache / set fd thread2 / get_cached_fd / return -1 thread2 / server_get_unix_fd / see get_cached_fd as having failed since it returns -1 thread2 / server_get_unix_fd / do get_handle_fd server request thread2 / server_get_unix_fd / call add_fd_to_cache to cache the new fd
I think you want to try get_cached_fd() again, inside the critical section, if the first call fails.
-Ken
Hi Ken,
thanks for taking a look at it and for the detailed answer.
[...]
Thanks for working on this. Interesting stuff.
I think there are still some issues.
In order to ensure a consistent view of the ordering of memory operations, memory barriers have to be used in both the writer and the reader. The writer issues a write barrier after it has written and the reader issues a read barrier before it reads.
The appropriate barriers are implicit in interlocked_xchg_add() for incrementing fd_cache_epoch, but the first read of it in get_cached_fd() has no read barrier before it. I think this is benign, though, because I think this can only result in the code thinking the epoch changed when it hasn't. It may retry unnecessarily.
get_cached_fd() will only have to run multiple iterations if a second thread updated the cache in the meantime. As it is more likely to have a cache hit I would assume that its okay, when fd_cache_epoch returns a cached value - but of course this depends on the exact program you're running.
More problematic is the writing of the fd and its attributes. There are barriers implicit in the interlocked_xchg() which clears the fd, but there's no write barrier after the final write to it. There's no read barrier before the first read of them.
You mention the order of writing, attributes first then fd, as being important. But there's no barrier in between the two operations, so either the compiler or the CPU could reorder either the writes or the reads, thwarting your design. The compiler can also reorder the reading of the fd_cache_epoch with respect to the reading of the fd or attributes. I think you can get this as the effective sequence:
thread1 / get_cached_fd / read fd, get old fd thread2 / add_fd_to_cache / clear fd thread2 / add_fd_to_cache / increment epoch thread2 / add_fd_to_cache / set fd thread1 / get_cached_fd / read epoch thread1 / get_cached_fd / check if epoch has changed, conclude it hasn't thread1 / get_cached_fd / return old fd
I think it's sufficient to have a write barrier after writing the fd and then a read barrier just before reading it. I _think_ that this obviates the need for the barrier at the bottom of the do-while loop. That is, you can just move that barrier up a line. I'm not 100% confident of my analysis of that. (Well, frankly, I'm never 100% certain of my analysis of this sort of code.)
I have to agree that this is a bit tricky - but I don't think that moving the line one up would completely solve it. The compiler could still reorder (read epoch) and (read attributes), which could result in using wrong attributes. Moreover it there is no barrier at the original place it could still reorder (read epoch 2) and (read fd). To be completely sure it would be necessary to add compiler memory barriers everywhere inbetween ... which would make the code look very ugly... xD
Nevertheless, the example above couldn't happen that way: As add_fd_to_cache() is only used to set the file descriptor, not to replace it, the old fd would be (-1) in this case. ;)
In the email thread regarding Daniel Horn's patch, I suggested an approach using interlocked_cmpxchg() to check if the fd had changed after reading the attributes. Was there a reason you rejected that approach? You mentioned an approach you experimented with using interlocked_cmpxchg64(). What was that about? Were you treating the fd and attributes combined as a single 64-bit value and then checking them both? Was that safer than my suggestion for some reason?
If the only advantage is a slight performance improvement (especially in light of the huge difference that either approach has to the status quo, judging by your chart), then I'd argue for the simpler approach that's more easily analyzed for correctness.
My approach was indeed treating both the fd and the attributes as a single 64-bit number. This way we can read/write both fd and attributes at the same time, which definitely solves all multithreading issues. I still think that this is probably the best way to do it - with the only disadvantage that it doesn't work on PowerPC. :/ I have attached a patch showing this approach to the mail.
Using interlocked_cmpxchg() to verify only the fd doesn't seem to be safe to me:
thread1 / get_cached_fd / read fd and attributes thread2 / release old handle thread2 / allocate new object with the same handle (unlikely, but ...) thread1 / get_cached_fd / compares fd one more time -> looks like everything okay
This basically means that applications could assume that wrong attributes are associated with a specific file handle. I know that this is very unlikely (especially because the handles contain the generation in the upper 16bit), but it is still possible... Or is there any reason why we can be sure that this doesn't happen?
If the above issue is addressed, that leaves one remaining problem: get_cached_fd() can encounter a false negative:
thread1 / add_fd_to_cache / set fd to -1 thread1 / add_fd_to_cache / increment epoch thread2 / get_cached_fd / read epoch thread2 / get_cached_fd / read attributes thread2 / get_cached_fd / read fd, gets -1 thread2 / get_cached_fd / check if epoch has changed, conclude it hasn't thread1 / add_fd_to_cache / set attributes thread1 / add_fd_to_cache / set fd thread2 / get_cached_fd / return -1 thread2 / server_get_unix_fd / see get_cached_fd as having failed since it returns -1 thread2 / server_get_unix_fd / do get_handle_fd server request thread2 / server_get_unix_fd / call add_fd_to_cache to cache the new fd
I think you want to try get_cached_fd() again, inside the critical section, if the first call fails.
This should be fixed in try 2.
- Replace interlocked_xchg() with an assert(). The original code looks
like its valid that the fd is nonzero at this point, but in fact this would cause releasing a file descriptor which is still in use.
I'll leave it to others to comment on that.
Replacing with an assert() also saves a couple of CPU cycles.
I understand that this is an attempt to improve performance, but I
don't think it's necessary to squeeze every last CPU cycle out of this code.
This was not done to improve the performance, but instead to detect errors more easily. If the fd is already set it always means that something is going wrong. The current code hides these kind of errors by just closing the file handle and replacing it - that was also the reason why I assumed one call to get_cached_fd() is sufficient, as it looks like the code handles already the case if fd is set - but in fact it will close a handle which is still in use...
-Ken
What do you think about the alternative (slightly slower, but much easier to verify) solution attached to this mail?
Regards, Sebastian
Hi,
On May 31, 2014, at 2:13 PM, Sebastian Lackner wrote:
What do you think about the alternative (slightly slower, but much easier to verify) solution attached to this mail?
It looks good except for one issue. The assert() macro should never contain meaningful work. If NDEBUG is defined, the whole macro becomes (void)0. The expression is not evaluated.
As expected, the thread-safety of this approach is much more easily analyzed and looks good to me.
The ugliness for support for PowerPC is unfortunate, but probably still less ugly than the other approach with all necessary barriers inserted.
-Ken