If this function is called often enough that the performance matters so much, maybe it'd be worth caching some of these transform matrices (perhaps world<->device and world<->gdidevice). Most of the calls probably have the exact same inputs and result, and we know exactly when those inputs change.