[PATCH 0/3] MR10998: gdi32/uniscribe: Reworking mark_invalid_combinations()

May 26, 2026

      Hi, while solving #59680 I noticed some discrepancies in the behaviour of mark_invalid_combinations in wine and in windows.

In wine, combinations are only marked invalid when two adjacent characters shared the same context type, so for example a single arabic fatha `َ` will not be displayed with a dotted_circle under wine. This is not the case for windows. Thus in the game Path of Exile, the combination of latin+thai_mark, the single thai_mark is invisible, because the itemizer separates the latin and thai in two different runs, the thai mark is an isolated character with no base to be attached to, the game expects that this mark will appear with a base character. This can also be seen with this program:

```c
#include <windows.h>
#include <usp10.h>
#include <stdio.h>

#define MAKE_OPENTYPE_TAG(a,b,c,d) ((OPENTYPE_TAG)(a) | ((OPENTYPE_TAG)(b) << 8) | ((OPENTYPE_TAG)(c) << 16) | ((OPENTYPE_TAG)(d) << 24))

int dump_unicode_for_glyph(HDC hdc, WORD target_gid) {
  for (DWORD i = 0x0020; i <= 0xFFFF; i++) {
    WCHAR ch = (WCHAR)i;
    WORD gid = 0xFFFF;

    if (GetGlyphIndicesW(hdc, &ch, 1, &gid, GGI_MARK_NONEXISTING_GLYPHS) != GDI_ERROR) {
      if (gid == target_gid)
        return i;
    }
  }
}

void dump_shape_results(HDC hdc, const WCHAR *pwcChars, int cChars, const WORD *pwOutGlyphs, int cGlyphs, const WORD *pwLogClust, const SCRIPT_CHARPROP *pCharProps, const SCRIPT_GLYPHPROP *pGlyphProps) {
  printf("cChars=%d cGlyphs=%d\n\n", cChars, cGlyphs);

  printf("Input characters:\n");
  for (int i = 0; i < cChars; i++)
    printf("    pwcChars[%d] = U+%04X LogClust=%d\n", i, pwcChars[i], pwLogClust[i]);

  printf("\nOutput glyphs:\n");
  for (int i = 0; i < cGlyphs; i++) {
    printf("    glyph[%d] = %d (U+%04X) fDiacritic=%d fZeroWidth=%d\n", i, pwOutGlyphs[i], dump_unicode_for_glyph(hdc, pwOutGlyphs[i]), pGlyphProps[i].sva.fDiacritic, pGlyphProps[i].sva.fZeroWidth);

  }
  printf("\n\n");
}

void test_shape(HDC hdc, const WCHAR *pwcChars, OPENTYPE_TAG tagScript) {
  SCRIPT_ITEM items[2];
  int num_items = 0;
  ScriptItemize(pwcChars, 1, 2, NULL, NULL, items, &num_items);

  SCRIPT_CACHE sc = NULL;
  SCRIPT_ANALYSIS sa = items[0].a;
  WORD glyphs[8], log_clust[8];
  SCRIPT_CHARPROP char_props[8];
  SCRIPT_GLYPHPROP glyph_props[8];
  int glyph_count = 0;

  ScriptShapeOpenType(hdc, &sc, &sa, tagScript, 0, NULL, NULL, 0, pwcChars, 1, 8, log_clust, char_props, glyphs, glyph_props, &glyph_count);
  dump_shape_results(hdc, pwcChars, 1, glyphs, glyph_count, log_clust, char_props, glyph_props);

  ScriptFreeCache(&sc);
}

int main(void) {
  WCHAR arabic_lone_mark[1] = { 0x064E };

  OPENTYPE_TAG arabic_tag = MAKE_OPENTYPE_TAG('a', 'r', 'a', 'b');

  HDC hdc = CreateCompatibleDC(NULL);
  HFONT arabic_font = CreateFontW(32, 0, 0, 0, FW_NORMAL, FALSE, FALSE, FALSE, DEFAULT_CHARSET, OUT_DEFAULT_PRECIS, CLIP_DEFAULT_PRECIS, DEFAULT_QUALITY, DEFAULT_PITCH, L"Noto Naskh Arabic");
  HFONT old_font = SelectObject(hdc, arabic_font);
  printf("Shaping Arabic\n");
  test_shape(hdc, arabic_lone_mark, arabic_tag);

  WCHAR thai_lone_mark[1] = { 0x0E4E };

  OPENTYPE_TAG thai_tag = MAKE_OPENTYPE_TAG('t', 'h', 'a', 'i');
  HFONT thai_font = CreateFontW(32, 0, 0, 0, FW_NORMAL, FALSE, FALSE, FALSE, DEFAULT_CHARSET, OUT_DEFAULT_PRECIS, CLIP_DEFAULT_PRECIS, DEFAULT_QUALITY, DEFAULT_PITCH, L"Kanit");

  old_font = SelectObject(hdc, thai_font);
  printf("Shaping Thai\n");
  test_shape(hdc, thai_lone_mark, thai_tag);

  DeleteObject(arabic_font);
  DeleteObject(thai_font);

  DeleteDC(hdc);
  return 0;
}
```

When run under windows, the output is:

```
PS C:\Users\ss141309\Desktop\Shared> .\up32.exe
Shaping Arabic
cChars=1 cGlyphs=2

Input characters:
    pwcChars[0] = U+064E LogClust=1

Output glyphs:
    glyph[0] = 379 (U+064E) fDiacritic=1 fZeroWidth=1
    glyph[1] = 235 (U+25CC) fDiacritic=0 fZeroWidth=0

Shaping Thai
cChars=1 cGlyphs=2

Input characters:
    pwcChars[0] = U+0E4E LogClust=0

Output glyphs:
    glyph[0] = 593 (U+0020) fDiacritic=0 fZeroWidth=1
    glyph[1] = 737 (U+0E4E) fDiacritic=1 fZeroWidth=1
```

Even the microsoft documentation, does not mention that combining marks are only invalid when they are succeeded by the same class of mark. [See:](https://learn.microsoft.com/en-us/typography/script-development/arabic#handl...)
...
## Handling Invalid Combining Marks
Combining marks and signs that appear in text not in conjunction with a valid consonant base are considered _invalid._ Uniscribe displays these marks using the fallback rendering mechanism defined in the Unicode Standard (section 5.12, 'Rendering Non-Spacing Marks' of the Unicode Standard 3.1), i.e. positioned on a dotted circle.
Please note that to render a sign standalone (in apparent isolation from any base) one should apply it on a space (see section 2.5 'Combining Marks' of Unicode Standard 3.1). Uniscribe requires a ZWJ to be placed between the space and a mark for them to combine into a standalone sign.
For the fallback mechanism to work properly, an Arabic OTL font should contain a glyph for the dotted circle (U+25CC). In case this glyph is missing form the font, the invalid signs will be displayed on the missing glyph shape (white box).
In addition to the 'dotted circle,' other Unicode code points that are recommended for inclusion in any Arabic font are: ZWNJ (zero width non-joiner; U+200C), ZWJ (zero width joiner U+200D), LTR (left to right mark; U+200E), and RTL (right to left mark; U+200F). The ZWNJ can be used between two letters to prevent them from forming a cursive connection.
![Illustration that shows suggested glyphs for the five Unicode code points.](https://learn.microsoft.com/en-us/typography/script-development/images/arabi...)
If an invalid combination is found, like two fathas on the same base character, the diacritic that causes the invalid state is placed on a dotted circle to indicate to the user the invalid combination. The shaping engine for non-OpenType fonts will cause invalid mark combinations to overstrike. This is the problem that inserting the dotted circle for the invalid base solves. It should also be noted that the dotted circle is not inserted into the application's backing store. This is a run-time insertion into the glyph array that is returned from the **ScriptShape** function.
The invalid diacritic logic for Arabic is based on the classes listed below. There is a check to make sure more than one mark of a class is not placed on the same base. Additionally, DIAC1 and DIAC2 classes should not be applied on the same base character.
So was there a historical reason for this behaviour? If I get a go ahead with this approach, I will fix the tests next.

-- 
https://gitlab.winehq.org/wine/wine/-/merge_requests/10998

[PATCH 0/3] MR10998: gdi32/uniscribe: Reworking mark_invalid_combinations()

समीरसिंह Sameer Singh (＠ss141309)