I ran into an interesting problem in Windows 8. I tested, I can represent Unicode characters that are outside of BMP using wchar_t * strings. The following test code gave me unexpected results:
const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A";
int i1 = sizeof(wchar_t);
int i2 = sizeof(s1);
int i3 = sizeof(s2);
U + 2008A is a Khan symbol that is outside the binary multilingual panel, so it must be represented by a surrogate pair in UTF-16. This means that if I understand this correctly, it should be represented by two wchar_t characters. Therefore, I expected that sizeof (s2) would be 6 (4 for two wchar_t-s surrogate pairs and 2 for the final \ 0).
So why sizeof (s2) == 4? I tested that the s2 line was built correctly because I displayed it using DirectWrite and the Khan character was correctly displayed.
UPDATE: As Naven pointed out, I tried to incorrectly determine the size of the arrays. The following code gives the correct result:
const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A";
int i1 = sizeof(wchar_t);
std::wstring str1 (s1);
std::wstring str2 (s2);
int i2 = str1.size();
int i3 = str2.size();
source
share