[PATCH v2 0/1] MR10091: server: optimize mem allocation on NUMA platforms
This commits fix ignoring data locality with the distance to the nearest processor cache on multi-CPU platforms. By default select mask on all CPU sockets and local touch allocation memory. References: - https://www.kernel.org/doc/html/v5.6/vm/numa.html - https://hpc-wiki.info/hpc/NUMA - https://man7.org/linux/man-pages/man3/numa.3.html - https://en.wikipedia.org/wiki/Non-uniform_memory_access - https://stackoverflow.com/questions/8154162/numa-aware-cache-aligned-memory-... Raspberry have increase performance with memory, if use fake NUMA nodes from linux kernel params. - https://www.phoronix.com/news/ARM64-NUMA-Emulation-RPi5 - https://www.jeffgeerling.com/blog/2024/numa-emulation-speeds-pi-5-and-other-... Playstation 3 used NUMA but commercial failed. In the future, due to the increasing complexity of crystal area development, it will be more profitable to create chiplet multicrystalline systems, so this is becoming more and more relevant. Although most NUMA devices are still and server platforms (Intel Xeon, AMD Epyc, Ampere Ultra). - https://www.cs.york.ac.uk/rts/docs/ESWEEK-2007-tutorials/PS3_cell_tutorial.p... My topology: {width=808 height=489} -- v2: server: optimize mem allocation on NUMA platforms https://gitlab.winehq.org/wine/wine/-/merge_requests/10091
From: Herman Semenoff <GermanAizek@yandex.ru> This commits fix ignoring data locality with the distance to the nearest processor cache on multi-CPU platforms. By default select mask on all CPU sockets (interleave) and local touch allocation memory. References: https://www.kernel.org/doc/html/v5.6/vm/numa.html https://hpc-wiki.info/hpc/NUMA https://man7.org/linux/man-pages/man3/numa.3.html https://en.wikipedia.org/wiki/Non-uniform_memory_access https://stackoverflow.com/questions/8154162/numa-aware-cache-aligned-memory-... --- server/file.c | 28 ++++++++++++++++++--- server/main.c | 11 +++++++++ server/object.c | 21 +++++++++++++--- server/request.c | 26 +++++++++++++++++--- server/thread.c | 13 ++++++++++ server/unicode.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- 6 files changed, 141 insertions(+), 22 deletions(-) diff --git a/server/file.c b/server/file.c index cc5acc2aadc..7539fc5b8d3 100644 --- a/server/file.c +++ b/server/file.c @@ -36,6 +36,9 @@ #include <utime.h> #endif #include <poll.h> +#ifdef HAVE_NUMA_H +#include <numa.h> +#endif #include "ntstatus.h" #define WIN32_NO_STATUS @@ -280,7 +283,12 @@ static struct object *create_file( struct fd *root, const char *nameptr, data_si release_object( fd ); done: - free( name ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(name, len + 1); + else +#endif + free( name ); return obj; } @@ -411,7 +419,12 @@ static struct security_descriptor *file_get_sd( struct object *obj ) file->mode = st.st_mode; file->uid = st.st_uid; - free( obj->sd ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(obj->sd, sizeof(struct security_descriptor)); + else +#endif + free( obj->sd ); obj->sd = sd; return sd; } @@ -561,15 +574,22 @@ static struct object *file_open_file( struct object *obj, unsigned int access, struct object *new_file = NULL; struct unicode_str nt_name; char *unix_name; + size_t unix_name_len; assert( obj->ops == &file_ops ); if ((unix_name = dup_fd_name( file->fd, "" ))) { + unix_name_len = strlen(unix_name); get_nt_name( file->fd, &nt_name ); - new_file = create_file( NULL, unix_name, strlen(unix_name), nt_name, access, + new_file = create_file( NULL, unix_name, unix_name_len, nt_name, access, sharing, FILE_OPEN, options, 0, NULL ); - free( unix_name ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(unix_name, unix_name_len + 1); + else +#endif + free( unix_name ); } else set_error( STATUS_OBJECT_TYPE_MISMATCH ); return new_file; diff --git a/server/main.c b/server/main.c index 46419a09cb4..96314cd9a37 100644 --- a/server/main.c +++ b/server/main.c @@ -34,6 +34,9 @@ #ifdef HAVE_SYS_SYSCTL_H # include <sys/sysctl.h> #endif +#ifdef HAVE_NUMA_H +# include <numa.h> +#endif #include "object.h" #include "file.h" @@ -256,6 +259,14 @@ int main( int argc, char *argv[] ) signal( SIGABRT, sigterm_handler ); init_limits(); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + { + if (debug_level) fprintf( stderr, "wineserver: NUMA is available\n" ); + numa_set_interleave_mask(numa_all_nodes_ptr); + } +#endif + sock_init(); open_master_socket(); diff --git a/server/object.c b/server/object.c index 694835a6a51..6b479363abc 100644 --- a/server/object.c +++ b/server/object.c @@ -28,6 +28,9 @@ #include <unistd.h> #include <stdarg.h> #include <sys/types.h> +#ifdef HAVE_NUMA_H +#include <numa.h> +#endif #ifdef HAVE_VALGRIND_MEMCHECK_H #include <valgrind/memcheck.h> #endif @@ -221,7 +224,13 @@ void mark_block_uninitialized( void *ptr, size_t size ) /* malloc replacement */ void *mem_alloc( size_t size ) { - void *ptr = malloc( size ); + void *ptr; +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + ptr = numa_alloc_onnode(size, numa_node_of_cpu(sched_getcpu())); + else +#endif + ptr = malloc(size); if (ptr) mark_block_uninitialized( ptr, size ); else set_error( STATUS_NO_MEMORY ); return ptr; @@ -230,9 +239,8 @@ void *mem_alloc( size_t size ) /* duplicate a block of memory */ void *memdup( const void *data, size_t len ) { - void *ptr = malloc( len ); + void *ptr = mem_alloc( len ); if (ptr) memcpy( ptr, data, len ); - else set_error( STATUS_NO_MEMORY ); return ptr; } @@ -331,7 +339,12 @@ static void free_object( struct object *obj ) list_remove( &obj->obj_list ); memset( obj, 0xaa, obj->ops->size ); #endif - free( obj ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(obj, obj->ops->size); + else +#endif + free( obj ); } /* find an object by name starting from the specified root */ diff --git a/server/request.c b/server/request.c index 432a5918892..6be9e8fa733 100644 --- a/server/request.c +++ b/server/request.c @@ -44,6 +44,9 @@ #endif #include <unistd.h> #include <poll.h> +#ifdef HAVE_NUMA_H +#include <numa.h> +#endif #ifdef __APPLE__ # include <mach/mach_time.h> #endif @@ -232,7 +235,12 @@ void write_reply( struct thread *thread ) { if (!(thread->reply_towrite -= ret)) { - free( thread->reply_data ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(thread->reply_data, thread->reply_size); + else +#endif + free( thread->reply_data ); thread->reply_data = NULL; /* sent everything, can go back to waiting for requests */ set_fd_events( thread->request_fd, POLLIN ); @@ -275,7 +283,12 @@ static void send_reply( union generic_reply *reply ) return; } } - free( current->reply_data ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(current->reply_data, current->reply_size); + else +#endif + free( current->reply_data ); current->reply_data = NULL; return; @@ -339,7 +352,7 @@ void read_request( struct thread *thread ) call_req_handler( thread ); return; } - if (!(thread->req_data = malloc( thread->req_toread ))) + if (!(thread->req_data = mem_alloc( thread->req_toread ))) { fatal_protocol_error( thread, "no memory for %u bytes request %d\n", thread->req_toread, thread->req.request_header.req ); @@ -358,7 +371,12 @@ void read_request( struct thread *thread ) if (!(thread->req_toread -= ret)) { call_req_handler( thread ); - free( thread->req_data ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(thread->req_data, thread->req.request_header.request_size); + else +#endif + free( thread->req_data ); thread->req_data = NULL; return; } diff --git a/server/thread.c b/server/thread.c index 3aed496450a..182c7f4d31f 100644 --- a/server/thread.c +++ b/server/thread.c @@ -40,6 +40,9 @@ #ifdef HAVE_SYS_RESOURCE_H #include <sys/resource.h> #endif +#ifdef HAVE_NUMA_H +#include <numa.h> +#endif #ifdef __APPLE__ #include <mach/mach_init.h> #include <mach/mach_time.h> @@ -545,6 +548,16 @@ struct thread *create_thread( int fd, struct process *process, const struct secu thread->disable_boost = process->disable_boost; if (!current) current = thread; +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + { + static int next_node = 0; + int max_nodes = numa_max_node(); + if (next_node > max_nodes) next_node = 0; + numa_run_on_node(next_node++); + } +#endif + list_add_tail( &thread_list, &thread->entry ); if (sd && !set_sd_defaults_from_token( &thread->obj, sd, diff --git a/server/unicode.c b/server/unicode.c index bb39b55e50c..c3ce90e8b53 100644 --- a/server/unicode.c +++ b/server/unicode.c @@ -29,6 +29,9 @@ #ifdef HAVE_SYS_SYSCTL_H # include <sys/sysctl.h> #endif +#ifdef HAVE_NUMA_H +#include <numa.h> +#endif #ifdef __APPLE__ # include <mach-o/dyld.h> #endif @@ -244,6 +247,7 @@ static char *build_relative_path( const char *base, const char *from, const char const char *start; char *ret; unsigned int dotdots = 0; + size_t ret_len; for (;;) { @@ -265,7 +269,8 @@ static char *build_relative_path( const char *base, const char *from, const char break; } - ret = malloc( strlen(base) + 3 * dotdots + strlen(start) + 2 ); + ret_len = strlen(base) + 3 * dotdots + strlen(start) + 2; + ret = mem_alloc( ret_len ); strcpy( ret, base ); while (dotdots--) strcat( ret, "/.." ); @@ -278,39 +283,63 @@ static char *build_relative_path( const char *base, const char *from, const char static char *get_nls_dir(void) { char *p, *dir, *ret; +#ifdef HAVE_NUMA_H + size_t dir_len; +#endif #if defined(__linux__) || defined(__FreeBSD_kernel__) || defined(__NetBSD__) dir = realpath( "/proc/self/exe", NULL ); #elif defined (__FreeBSD__) || defined(__DragonFly__) static int pathname[] = { CTL_KERN, KERN_PROC, KERN_PROC_PATHNAME, -1 }; size_t dir_size = PATH_MAX; - dir = malloc( dir_size ); + dir = mem_alloc( dir_size ); if (dir) { if (sysctl( pathname, ARRAY_SIZE( pathname ), dir, &dir_size, NULL, 0 )) { - free( dir ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(dir, dir_size); + else +#endif + free( dir ); dir = NULL; } } #elif defined(__APPLE__) uint32_t dir_size = PATH_MAX; - dir = malloc( dir_size ); + dir = mem_alloc( dir_size ); if (dir) { if (_NSGetExecutablePath( dir, &dir_size )) { - free( dir ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(dir, dir_size); + else +#endif + free( dir ); dir = NULL; } } #else dir = realpath( server_argv0, NULL ); #endif + if (!dir) return NULL; + +#ifdef HAVE_NUMA_H + dir_len = strlen(dir); +#endif + if (!(p = strrchr( dir, '/' ))) { - free( dir ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(dir, dir_len + 1); + else +#endif + free( dir ); return NULL; } *(++p) = 0; @@ -320,7 +349,12 @@ static char *get_nls_dir(void) return dir; } ret = build_relative_path( dir, BINDIR, DATADIR "/wine/nls" ); - free( dir ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(dir, dir_len + 1); + else +#endif + free( dir ); return ret; } @@ -346,7 +380,12 @@ struct fd *load_intl_file(void) if ((fd = open_fd( NULL, path, nt_name, O_RDONLY, &mode, FILE_READ_DATA, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, FILE_NON_DIRECTORY_FILE | FILE_SYNCHRONOUS_IO_NONALERT ))) break; - free( path ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(path, strlen(path) + 1); + else +#endif + free( path ); } if (!fd) fatal_error( "failed to load l_intl.nls\n" ); unix_fd = get_unix_fd( fd ); @@ -361,9 +400,14 @@ struct fd *load_intl_file(void) offset++; size = data - 1; /* read lowercase table */ - if (!(casemap = malloc( size * 2 ))) goto failed; + if (!(casemap = mem_alloc( size * 2 ))) goto failed; if (pread( unix_fd, casemap, size * 2, offset * 2 ) != size * 2) goto failed; - free( path ); +#ifdef HAVE_NUMA_H + if (numa_available() != -1) + numa_free(path, strlen(path) + 1); + else +#endif + free( path ); return fd; failed: -- GitLab https://gitlab.winehq.org/wine/wine/-/merge_requests/10091
Are there any measurements confirming that such a change give a performance gain? If there are that should be a kernel issue or host setup issue if that uses some unusual NUMA defaults configuration. Single threaded programs (and wineserver is single threaded) should not ever need to care about NUMA locality, kernel cares about allocating memory on the same node where thread is currently running and care about the memory locality when making CPU scheduling decisions for the process or migrating allocated memory (unbound to NUMA nodes) between NUMA nodes. Explicit NUMA node management can only makes sense in multithreaded apps (while the rule of thumb for simple cases is just allocate memory on the same thread which is going to primarily use it). But that it is very specific to what the app is doing and usually only makes sense in combination with CPU pinning. I. e., mostly not applicable to Wine in general because most of the time Wine can't control thread's affinity and allocations, that is ultimately stipulated by app (apart from supporting corresponding bits in Nt memory allocation functions which are currently not supported). -- https://gitlab.winehq.org/wine/wine/-/merge_requests/10091#note_129547
only test-linux-32 is failed ``` ddraw1.c:3877:9.624 Test failed: RestoreDisplayMode failed, hr 0x8876086a. ddraw1.c:3880:9.624 Test failed: EnumDisplaySettingsA failed, error 0. ddraw1.c:3883:9.624 Test failed: EnumDisplaySettingsA failed, error 0. ``` -- https://gitlab.winehq.org/wine/wine/-/merge_requests/10091#note_129552
On Thu Feb 12 21:25:02 2026 +0000, Paul Gofman wrote:
Are there any measurements confirming that such a change give a performance gain? If there are that should be a kernel issue or host setup issue if that uses some unusual NUMA defaults configuration. Single threaded programs (and wineserver is single threaded) should not ever need to care about NUMA locality, kernel cares about allocating memory on the same node where thread is currently running and care about the memory locality when making CPU scheduling decisions for the process or migrating allocated memory (unbound to NUMA nodes) between NUMA nodes. Explicit NUMA node management can only makes sense in multithreaded apps (while the rule of thumb for simple cases is just allocate memory on the same thread which is going to primarily use it). But that it is very specific to what the app is doing and usually only makes sense in combination with CPU pinning. I. e., mostly not applicable to Wine in general because most of the time Wine can't control thread's affinity and allocations, that is ultimately stipulated by app (apart from supporting corresponding bits in Nt memory allocation functions which are currently not supported). @gofman, numa_alloc() not only allocates memory at the nearest node where current thread is located, it also aligns memory blocks because malloc() works very poorly on NUMA systems by default.
About numa_alloc(): https://linux.die.net/man/3/numa_alloc Detailed benchmarks here malloc() vs numa_alloc(): https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/issues/236 -- https://gitlab.winehq.org/wine/wine/-/merge_requests/10091#note_129553
Detailed benchmarks here malloc() vs numa_alloc(): https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/issues/236
Memory allocation and usage patterns in that cosmological simulation software are very different from wineserver, so those results and discussion are not exactly related here. Switching all the mallocs in existence to numa_malloc is unlikely a good idea for the above reasons. Also aligning malloc'ed memory is not universally beneficial, when it is needed there are other ways to do it without explicitly dealing with NUMA. -- https://gitlab.winehq.org/wine/wine/-/merge_requests/10091#note_129555
Instead of making invasive changes all throughout the wine source, I'd imagine using something like likwid-pin to pin the main thread to a certain NUMA domain would yield better results. Have you tried that? -- https://gitlab.winehq.org/wine/wine/-/merge_requests/10091#note_129688
participants (4)
-
Herman Semenoff -
Herman Semenov (@GermanAizek) -
Paul Gofman (@gofman) -
Sven Baars (@sbaars)