Jinoh Kang (@iamahuman) commented about dlls/windows.media.speech/recognizer.c:
+ + if (FAILED(hr = IMMDevice_Activate(mm_device, &IID_IAudioClient, CLSCTX_INPROC_SERVER, NULL, (void**)&session->audio_client))) + goto cleanup; + + if (SUCCEEDED(hr = IMMDevice_GetId(mm_device, &str))) + { + TRACE("selected capture device ID: %s\n", debugstr_w(str)); + CoTaskMemFree(str); + } + + if (FAILED(hr = IAudioClient_GetMixFormat(session->audio_client, (WAVEFORMATEX **)&wfx))) + goto cleanup; + + wfx->wFormatTag = WAVE_FORMAT_PCM; + wfx->nChannels = 1; + wfx->nSamplesPerSec = 16000; Magic constant. You should replace this with a `#define` shared with Unix side interfacing vosk. `#define WINE_VOSK_SAMPLE_RATE 16000` will do.
(I'm aware that most vosk models are trained with 16kHz PCM streams.) -- https://gitlab.winehq.org/wine/wine/-/merge_requests/1948#note_21044