Fellow developers,
I'm thinking of starting a VM project to allow running x86 Windows apps on ARM Android. This will obviously involve binary translation. I've read about QEMU's tiny code generator and think for a usable experience, the intermediate micro-op representation will have to be abandoned, and use a more efficient, though less portable x86 to ARM translator. I also saw some Google SOC project that tried to incorporate LLVM into QEMU, but with disastrous slow down if done naively. I still think it's worth to do so, but lots of care will need to be done to only optimize code that needs it like Sun's HotSpot Java compiler does.
Questions:
1. How useful would this be and how much interest?
Obviously, this will be a huge project, and I just want to gauge the interest before I jump in. Microsoft will be releasing Windows for ARM soon, so there will be no need to worry about running Office, Matlab, Visual C++, etc on ARM, leaving only legacy applications and games to benefit from binary translation. I'm mostly interested in seeing some 3D games run on my Xoom.
2. What's the best design: whole system VM (qemu) or process VM (qemu & wine)?
Process VM:
+ easier to incorporate 3D acceleration at API level + uses less memory + better performance (e.g. no need for MMU translation when accessing memory) + much better integration with host OS - needs to maintain custom Windows API implementation (Wine)
Whole system VM:
+ simpler, more unified to implement + much better support for apps that are dependent on new, proprietary, obscure Windows libraries, interfaces (moot because Office, Matlab, etc will soon be available for ARM)
Given the aims of only running legacy applications and games, it seems a foregone conclusion that Wine's process VM approach is best. Comments?
3. Will Wine ever incorporate binary translation?
My proposed design will obviously use Wine's implementation of the Windows API, which is huge. I'm not sure how disruptive of a change binary translation will be to Wine.
If Wine does incorporate binary translation, maybe they can change the name to Wine Is Now an Emulator
If your're interested in this project, please reply.
On 4/1/11 6:19 PM, Yale Zhang wrote:
Fellow developers,
I'm thinking of starting a VM project to allow running x86 Windows apps on ARM Android. This will obviously involve binary translation. I've read about QEMU's tiny code generator and think for a usable experience, the intermediate micro-op representation will have to be abandoned, and use a more efficient, though less portable x86 to ARM translator. I also saw some Google SOC project that tried to incorporate LLVM into QEMU, but with disastrous slow down if done naively. I still think it's worth to do so, but lots of care will need to be done to only optimize code that needs it like Sun's HotSpot Java compiler does.
Questions:
How useful would this be and how much interest?
Obviously, this will be a huge project, and I just want to gauge the
interest before I jump in. Microsoft will be releasing Windows for ARM soon, so there will be no need to worry about running Office, Matlab, Visual C++, etc on ARM, leaving only legacy applications and games to benefit from binary translation. I'm mostly interested in seeing some 3D games run on my Xoom.
- What's the best design: whole system VM (qemu) or process VM (qemu &
wine)?
Process VM:
- easier to incorporate 3D acceleration at API level
- uses less memory
- better performance (e.g. no need for MMU translation when accessing
memory)
- much better integration with host OS
- needs to maintain custom Windows API implementation (Wine)
Whole system VM:
- simpler, more unified to implement
- much better support for apps that are dependent on new, proprietary,
obscure Windows libraries, interfaces (moot because Office, Matlab, etc will soon be available for ARM)
Given the aims of only running legacy applications and games, it seems a foregone conclusion that Wine's process VM approach is best. Comments?
Will Wine ever incorporate binary translation?
My proposed design will obviously use Wine's implementation of the
Windows API, which is huge. I'm not sure how disruptive of a change binary translation will be to Wine.
If Wine does incorporate binary translation, maybe they can change the name to Wine Is Now an Emulator
If your're interested in this project, please reply.
I'm not competent enough to mentor you, but, being a Mac user, I have a story to tell you. It's about a version of Wine that did incorporate binary translation: Darwine. Darwine was separate from mainline Wine; it incorporated QEMU directly and used it to translate x86 to PowerPC code. In a faraway time when Macs used PowerPC CPUs, Darwine allowed x86 Windows apps to run on Mac OS X. Now that Macs aren't built PowerPC chips anymore, Darwine has been abandoned. You might consider using it as a starting point, though.
I switched over to the Mac platform (from Windows) in 2007, well after the PPC->x86 transition began, so I never needed Darwine. I did hear however that Darwine had major problems getting good performance (just as you surmised). You might have better luck rolling your own x86->ARM translator after all.
That said, I'm also an LLVM guy, so I hear things like people proposing to modify LLVM's JITter to be adaptive (i.e. to heavily optimize heavy-traffic areas only), just as you propose. If that goes through, you might have better luck integrating LLVM into QEMU (again), then basing your project off of Darwine.
Chip
Hi, I may be telling you nonsense here, I hope Alexandre hits me with a cluestick next Wineconf if I do, but this is my 2c on such a project.
On Saturday 02 April 2011 02:19:33 Yale Zhang wrote:
- What's the best design: whole system VM (qemu) or process VM (qemu &
wine)?
A process VM already exists. It is called qemu. I don't know if it works with Wine, but it works with basic Linux apps. I could run a statically linked ARM ls on a x86 PC(ie, the reverse of what you want). For Libraries the normal Linux multilib scheme would apply, or a chroot. The advantage of this scheme is that the Linux syscall interface is comparably small, but you need every Linux library as x86 library and translate everything.
If you want to run x86 Windows apps on ARM I suggest to forget about Wine. Write an app that runs x86 Windows apps on ARM Windows. This wrapper app would be a ARM Windows app. If done properly it will automagically run on ARM Wine. This way you don't have to bother about Alexandre, your app naturally lives as a separate project etc. Plus, you only translate the app itself and not the libraries.
The big drawback of that approach is that it is a *lot* of work. You'll have to translate between x86 and ARM on every library call that leaves the app's own code. You'll have to write a wrapper version of every function, COM method, callback of every DLL in C:\windows\system32. This is a *lot* of work, probably comparable to the amount of work needed for Wine itself. What you have going for yourself though is that you can start small, you don't have to implement everything at once. Start with a console hello world app, implement some kernel32 thunks. Go on with Windows, add user32 and gdi32 etc etc etc.
A technical challenge is that the Windows DLLs call each other. But that would stay inside the host domain(ie, ARM), and the app doesn't see this anyway. But you'll have to translate pointers, handles etc consistently. And if there's a callback from the library back into the app(e.g. ddraw.dll:DirectDrawEnumerate) you'll have to translate it back.
If you make it flexible enough to deal with potential alignment differences(ARM doesn't need that, but PPC does) as well as separate address spaces then you could achieve more goals at once: PPC, Mips, etc support, run x86 apps on x86_64 without Linux x86 libs(although it will be much slower than the classic multilib approach)
Am 02.04.2011 03:14, schrieb Charles Davis:
On 4/1/11 6:19 PM, Yale Zhang wrote:
Fellow developers,
I'm thinking of starting a VM project to allow running x86 Windows apps on ARM Android. This will obviously involve binary translation. I've read about QEMU's tiny code generator and think for a usable experience, the intermediate micro-op representation will have to be abandoned, and use a more efficient, though less portable x86 to ARM translator. I also saw some Google SOC project that tried to incorporate LLVM into QEMU, but with disastrous slow down if done naively. I still think it's worth to do so, but lots of care will need to be done to only optimize code that needs it like Sun's HotSpot Java compiler does.
Maybe first to ARM Linux and then to ARM Android?
Questions:
How useful would this be and how much interest?
Obviously, this will be a huge project, and I just want to gauge the
interest before I jump in. Microsoft will be releasing Windows for ARM soon, so there will be no need to worry about running Office, Matlab, Visual C++, etc on ARM, leaving only legacy applications and games to benefit from binary translation. I'm mostly interested in seeing some 3D games run on my Xoom.
see: http://wiki.winehq.org/ARM http://lists.terrasoftsolutions.com/pipermail/yellowdog-general/2004-June/01... http://www.oesf.org/forum/index.php?showtopic=14829 http://wiki.winehq.org/MacOSX http://wiki.winehq.org/MacOSX/QemuWork
If your're interested in this project, please reply.
I'm not competent enough to mentor you, but, being a Mac user, I have a story to tell you. It's about a version of Wine that did incorporate binary translation: Darwine. Darwine was separate from mainline Wine; it incorporated QEMU directly and used it to translate x86 to PowerPC code. In a faraway time when Macs used PowerPC CPUs, Darwine allowed x86 Windows apps to run on Mac OS X. Now that Macs aren't built PowerPC chips anymore, Darwine has been abandoned. You might consider using it as a starting point, though.
I know about that and was told it was never implemented because of problems with the endianess (PPC big, x86 little), so i wonder if that ever worked, or was just planned to work? Any Darwine user around?
Yale Zhang yzhang1985@gmail.com writes:
- What's the best design: whole system VM (qemu) or process VM (qemu &
wine)?
Process VM:
- easier to incorporate 3D acceleration at API level
- uses less memory
- better performance (e.g. no need for MMU translation when accessing
memory)
- much better integration with host OS
- needs to maintain custom Windows API implementation (Wine)
Whole system VM:
- simpler, more unified to implement
- much better support for apps that are dependent on new, proprietary,
obscure Windows libraries, interfaces (moot because Office, Matlab, etc will soon be available for ARM)
Given the aims of only running legacy applications and games, it seems a foregone conclusion that Wine's process VM approach is best. Comments?
I think you underestimate the complexity of doing the emulation at the API level. You should first make it work by running the whole process under the emulator; once you get this right, then you can start thinking about running some parts natively.
On Sat, Apr 2, 2011 at 2:19 AM, Yale Zhang yzhang1985@gmail.com wrote:
Fellow developers, I'm thinking of starting a VM project to allow running x86 Windows apps on ARM Android. This will obviously involve binary translation. I've read about QEMU's tiny code generator and think for a usable experience, the intermediate micro-op representation will have to be abandoned, and use a more efficient, though less portable x86 to ARM translator. I also saw some Google SOC project that tried to incorporate LLVM into QEMU, but with disastrous slow down if done naively. I still think it's worth to do so, but lots of care will need to be done to only optimize code that needs it like Sun's HotSpot Java compiler does. Questions:
- How useful would this be and how much interest?
Obviously, this will be a huge project, and I just want to gauge the interest before I jump in. Microsoft will be releasing Windows for ARM soon, so there will be no need to worry about running Office, Matlab, Visual C++, etc on ARM, leaving only legacy applications and games to benefit from binary translation. I'm mostly interested in seeing some 3D games run on my
I would love such a project and am willing to help. Good x86 on ARM emulation is essential, and not just for Wine: Flash doesn't work on ARM, Java (in the form of OpenJDK) doesn't support ARM yet, there's the MPlayer win32codecs, etc.
Complete and correct x86 emulation is mighty difficult. The total number of all 16/32/64/MMX/SSE instructions (as seen by the udis86 disassembler) is 710(!!). This is excluding instruction prefixes which change what instruction do (eg. 16 vs 32 bit memory access). When last I checked, qemu didn't support all of those instructions.
Xoom. 2. What's the best design: whole system VM (qemu) or process VM (qemu & wine)? Process VM:
- easier to incorporate 3D acceleration at API level
- uses less memory
- better performance (e.g. no need for MMU translation when accessing
memory)
- much better integration with host OS
- needs to maintain custom Windows API implementation (Wine)
* To get 3D acceleration, user-space x86 X/OpenGL drivers would have to be able to talk to the ARM kernel driver for that graphics card, or you'd need x86 to ARM wrappers for X and OpenGL libraries, or you'd need to use x86 kernel driver and do x86 emulation in the kernel too (very hard), or do whole system VM and the kind of 3D acceleration passthrough that VirtualBox does at the moment (which works poorly, in my limited experience). NVidia's ioctls are undocumented IIRC, so even if they provide an ARM port, translating those between x86 and ARM might be difficult.
Whole system VM:
- simpler, more unified to implement
- much better support for apps that are dependent on new, proprietary,
obscure Windows libraries, interfaces (moot because Office, Matlab, etc will soon be available for ARM)
* poor integration with native desktop/filesystem * more to emulate -> slower
Given the aims of only running legacy applications and games, it seems a foregone conclusion that Wine's process VM approach is best. Comments?
Agree, but it doesn't have to be done as part of Wine. What Darwine did - IIRC try to make Wine DLLs PowerPC based and only the application x86 - seems like a bad idea: the application/Windows API split is badly defined and many things (eg. COM) are difficult/impossible to do correctly. I prefer qemu's approach: all user-space is x86, only the kernel is ARM.
qemu-i386 doesn't even run 32 bit Wine on amd64 long mode at the moment (segfault on startup), I'll have to investigate at some stage.
- Will Wine ever incorporate binary translation?
My proposed design will obviously use Wine's implementation of the Windows API, which is huge. I'm not sure how disruptive of a change binary translation will be to Wine.
If Wine does incorporate binary translation, maybe they can change the name to Wine Is Now an Emulator
If your're interested in this project, please reply.
Replying.
The best way to go here would probably be improving qemu. If it turns out not to be good enough, rewriting the CPU emulation but keeping the system call translation is probably easier than a whole new project written from scratch.
Damjan
Thanks everyone for their comments. I took some time to reread the FX!32 and Transmeta Crusoe publications (I 1st read them 3 years ago while I was at Georgia Tech) to see what the challenges are.
The simplest approach is what Stefan proposed: run the Windows app inside x86 wine, inside of QEMU (target = x86, host = arm). Pretty clever, but the 2 layers of translation, instead of 1 layer, might cause problems. Also, I said earlier, I don't think QEMU's code generator produces fast enough code, so that will need to be improved (no change to Wine). I will try it and see what happens.
The 2nd approach, which is almost identical to FX!32 (runs x86 Windows programs on Alpha Windows), will be to do what Stefan proposed 2nd: create a stand alone process VM to run x86 Windows apps on ARM Windows, using wrappers to translate x86 Windows functions to ARM Windows functions. I think those wrappers/jackets can be generated automatically by scanning header files.
I still don't like this approach due to doing the API translation and instruction set translation in 2 separate programs. Ideally, I would take the Darwine approach of doing both API translation and binary translation both in Wine.
To me, the API translation is less interesting than doing x86 to ARM translation efficiently. I said earlier, QEMU's approach of translating target instructions => micro ops => host instructions is inefficient due to generating redundant operations.
1. Transmeta code morphing software no emulation of x86 instructions: always translates to native instructions (though not always with optimizations). Hot code is retranslated with optimizations. 2. FX!32 first emulates x86 instructions, then picks candidates for translation to Alpha instructions
I'm tempted to do a quick and dirty x86 to ARM translation for cold code that isn't a candidate for optimization. But since any non-trivial code transformation/optimization is best done on a *simple* intermediate representation, I will have use an intermediate representation for hot code that needs to be optimized.
But writing a direct x86 to ARM translator will be a lot of work and not portable to other targets (resurgent MIPS ?)
Therefore, another approach would be to use QEMU as is, but use LLVM optimizations for hot candidates like was done earlier in a Google SOC project. This will be very slow on a the 1st run of the program, but a persistent translation cache like FX!32 and .NET assembly uses, will make subsequent executions much faster. The static persistent translation won't be complete however, due to unknown indirect branches, so it will keep growing. I think the main reason FX!32 uses a persistent translation cache is because it uses emulation, which would be otherwise intolerable if done on every application launch.
Other issues:
x86 condition flag evaluation - I want to do this lazily, but how do I know the liveness of those values (given an instruction that uses a condition flag, how do I find the instruction that generates the condition flag)?
Stefan, --------------------------------------------------------------------------------------------- "ARM doesn't need [dealing with unaligned loads/stores], but PPC does"
OK, good to know. Earlier, I thought ARM didn't allow unaligned loads/stores at all, but apparently ARM6+ does.
Andre, -----------------------------------------------------------------------------------------------
"Maybe first to ARM Linux and then to ARM Android?"
Yes, if I can figure out how to install Ubuntu onto my Xoom. I saw someone do it here http://www.youtube.com/watch?v=xDB0PMrGdN0, but he's just running the userspace part of Ubuntu on top of Android, so I'm not sure if that will be as compatible as running a native Linux kernel
"I know about that and was told it was never implemented because of problems with the endianess"
Right, if the endians are different and required byte swapping on every load/store, that will kill the performance. Luckily ARM can operate in both little and big endian.
Damjan, -------------------------------------------------------------------------------------------------
In theory, binary translation will allow Flash Player, Java JVM, to run on ARM, but there might be complications because those programs generate and execute x86 code.
Also, I agree improving QEMU binary translation would be the simplest approach, but like I said earlier, I get a feeling that doing API translation and instruction set translation in 2 separate programs, might cause problems.
Yale
On Sat, Apr 2, 2011 at 9:06 AM, Damjan Jovanovic damjan.jov@gmail.comwrote:
On Sat, Apr 2, 2011 at 2:19 AM, Yale Zhang yzhang1985@gmail.com wrote:
Fellow developers, I'm thinking of starting a VM project to allow running x86 Windows apps
on
ARM Android. This will obviously involve binary translation. I've read
about
QEMU's tiny code generator and think for a usable experience, the intermediate micro-op representation will have to be abandoned, and
use
a more efficient, though less portable x86 to ARM translator. I also saw some Google SOC project that tried to incorporate LLVM into QEMU, but
with
disastrous slow down if done naively. I still think it's worth to do so,
but
lots of care will need to be done to only optimize code that needs it
like
Sun's HotSpot Java compiler does. Questions:
- How useful would this be and how much interest? Obviously, this will be a huge project, and I just want to gauge the
interest before I jump in. Microsoft will be releasing Windows for ARM
soon,
so there will be no need to worry about running Office, Matlab, Visual C++, etc on ARM, leaving only legacy applications and games to benefit from binary translation. I'm mostly interested in seeing some 3D games run on my
I would love such a project and am willing to help. Good x86 on ARM emulation is essential, and not just for Wine: Flash doesn't work on ARM, Java (in the form of OpenJDK) doesn't support ARM yet, there's the MPlayer win32codecs, etc.
Complete and correct x86 emulation is mighty difficult. The total number of all 16/32/64/MMX/SSE instructions (as seen by the udis86 disassembler) is 710(!!). This is excluding instruction prefixes which change what instruction do (eg. 16 vs 32 bit memory access). When last I checked, qemu didn't support all of those instructions.
Xoom. 2. What's the best design: whole system VM (qemu) or process VM (qemu & wine)? Process VM:
- easier to incorporate 3D acceleration at API level
- uses less memory
- better performance (e.g. no need for MMU translation when accessing
memory)
- much better integration with host OS
- needs to maintain custom Windows API implementation (Wine)
- To get 3D acceleration, user-space x86 X/OpenGL drivers would have
to be able to talk to the ARM kernel driver for that graphics card, or you'd need x86 to ARM wrappers for X and OpenGL libraries, or you'd need to use x86 kernel driver and do x86 emulation in the kernel too (very hard), or do whole system VM and the kind of 3D acceleration passthrough that VirtualBox does at the moment (which works poorly, in my limited experience). NVidia's ioctls are undocumented IIRC, so even if they provide an ARM port, translating those between x86 and ARM might be difficult.
Whole system VM:
- simpler, more unified to implement
- much better support for apps that are dependent on new, proprietary,
obscure Windows libraries, interfaces (moot because Office, Matlab,
etc
will soon be available for ARM)
- poor integration with native desktop/filesystem
- more to emulate -> slower
Given the aims of only running legacy applications and games, it seems a foregone conclusion that Wine's process VM approach is best. Comments?
Agree, but it doesn't have to be done as part of Wine. What Darwine did - IIRC try to make Wine DLLs PowerPC based and only the application x86 - seems like a bad idea: the application/Windows API split is badly defined and many things (eg. COM) are difficult/impossible to do correctly. I prefer qemu's approach: all user-space is x86, only the kernel is ARM.
qemu-i386 doesn't even run 32 bit Wine on amd64 long mode at the moment (segfault on startup), I'll have to investigate at some stage.
- Will Wine ever incorporate binary translation? My proposed design will obviously use Wine's implementation of the
Windows API, which is huge. I'm not sure how disruptive of a change
binary
translation will be to Wine.
If Wine does incorporate binary translation, maybe they can change the name to Wine Is Now an Emulator
If your're interested in this project, please reply.
Replying.
The best way to go here would probably be improving qemu. If it turns out not to be good enough, rewriting the CPU emulation but keeping the system call translation is probably easier than a whole new project written from scratch.
Damjan
Am 08.04.2011 um 11:16 schrieb Yale Zhang:
The 2nd approach, which is almost identical to FX!32 (runs x86 Windows programs on Alpha Windows), will be to do what Stefan proposed 2nd: create a stand alone process VM to run x86 Windows apps on ARM Windows, using wrappers to translate x86 Windows functions to ARM Windows functions. I think those wrappers/jackets can be generated automatically by scanning header files.
I don't think you can autogenerate the wrappers. They require very deep understanding of the data passed around, e.g. if a parameter is a pointer to a structure you have to convert the members, which also may be pointers, ... You could even have things like double-linked lists where you have to know when to abort. Very often you don't even know the size of memory blobs pointed to by pointers without an in-depth look at other parameters, e.g. glTexImage2D.
Sometimes you might take some shortcuts in the conversion, especially with endianess or alignment. E.g. the D3D API already gives you 16 byte alignment for resources. Instead of swapping the byte order of the image data passed to glTexImage2D(or d3dsurface::lock) you could just change the pixelformat(fmt and type parameter or the d3d pixel format).
As I said, writing such an API wrapper is probably comparable to the effort needed to implement Wine.
I still don't like this approach due to doing the API translation and instruction set translation in 2 separate programs. Ideally, I would take the Darwine approach of doing both API translation and binary translation both in Wine.
Separation of concerns?
You do of course need a bit of integration. An API wrapper needs a functionality in the CPU emulator to enter and leave the emulated environment, somewhat similar to switching between ring3 and ring0 on x86 CPUs.
To me, the API translation is less interesting than doing x86 to ARM translation efficiently. I said earlier, QEMU's approach of translating target instructions => micro ops => host instructions is inefficient due to generating redundant operations.
I'm afraid you're looking at the wrong project then, you'd probably be better off looking at something less complex than the Windows API and focus on something that stresses the CPU side more. My impression is that running x86 code on <insert other platform here> is a more or less solved problem. It's probably not solved at maximum efficiency, but there are working solutions(bochs, qemu, probably some JIT compilers)