The wchar_t * size for a surrogate pair (Unicode character from BMP) on Windows

I ran into an interesting problem in Windows 8. I tested, I can represent Unicode characters that are outside of BMP using wchar_t * strings. The following test code gave me unexpected results:

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess).
int i3 = sizeof(s2); // i3 == 4, why?

U + 2008A is a Khan symbol that is outside the binary multilingual panel, so it must be represented by a surrogate pair in UTF-16. This means that if I understand this correctly, it should be represented by two wchar_t characters. Therefore, I expected that sizeof (s2) would be 6 (4 for two wchar_t-s surrogate pairs and 2 for the final \ 0).

So why sizeof (s2) == 4? I tested that the s2 line was built correctly because I displayed it using DirectWrite and the Khan character was correctly displayed.

UPDATE: As Naven pointed out, I tried to incorrectly determine the size of the arrays. The following code gives the correct result:

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

std::wstring str1 (s1);
std::wstring str2 (s2);

int i2 = str1.size(); // i2 == 1.
int i3 = str2.size(); // i3 == 2, because two wchar_t characters needed for the surrogate pair.
+5
source share
3 answers

sizeof(s2)returns the number of bytes needed to store a pointer s2or any other pointer that is 4 bytes on your system. This has nothing to do with the character (s) stored in which it points to s2.

+8
source

sizeof(wchar_t*) sizeof(void*), , . 4 32- 8 64- . wcslen() lstrlenW() sizeof():

const wchar_t* s1 = L"a"; 
const wchar_t* s2 = L"\U0002008A"; // The "Han" character 

int i1 = sizeof(wchar_t); // i1 == 2
int i2 = wcslen(s1); // i2 == 1
int i3 = wcslen(s2); // i3 == 2
+4

Adding to the answers.
RE: unlock different units used to update the question with the help of i1, and i2, i3.

i1value 2 is the size in bytes.
i2 Value 1 is the size in wchar_t , IOW is 4 bytes (assuming it sizeof(wchar_t)is 4).
i3value 2 is the size in wchar_t , IOW 8 bytes

0
source

All Articles