UTF8CharLength

UTF8CharLength (str, startByte)

The UTF8CharLength function returns the number of bytes in the UTF-8 character ("code point") that begins at the startByte offset in the string expression str.

When str[startByte] is a normal ASCII character (byte value in the range of 0 to 127), UTF8CharLength returns 1.

In Unicode, a grapheme cluster is a sequence of one or more code points that together represent a single visual character, as seen by the user. Code points (or "UTF-8 characters") are encoded in UTF-8 as one, two, three, or four bytes.

For example, the grapheme á (the letter 'a' with an acute accent) can be represented two ways in Unicode:

as a single Unicode code point U+00E1 (a precomposed á) encoded in UTF-8 as a two-byte UTF-8 character sequence of 0xC3 and 0xA1,
as a grapheme composed of two UTF-8 code points: a 1-byte UTF-8 character 0x61 encoding the lower-case 'a' character (which is also a normal ASCII character code) followed by a two-byte UTF-8 character sequence of 0xCC and 0x81 to encode the U+0301 COMBINING ACUTE ACCENT Unicode code point.

In the first case, UTF8CharLength returns 2, and in the second case it returns 1.

UTF8CharLength returns NaN if the str is NULL or if startByte is NaN or ±Inf.

You can use numtype to test if the result is NaN, or use the stringIsNull function on str.

UTF8CharLength returns 0 if startByte is < 0 or startByte >= strlen(str).

The UTF8CharLength function was added in Igor Pro 10.00. It is the equivalent of the NumBytesInUTF8Character user function described in Character-by-Character Operations.

Example

// print the string and break it down to the graphemes and UTF-8 code points
Function ExamineUTF8Text(String str)
	
	Variable totalBytes = strlen(str)
	Printf "Entire string = \"%s\", total bytes = %d\r",str,totalBytes

	Print "UTF8 code points and graphemes individually"
	Variable startByte = 0
	do
		Variable graphemeLen = graphemeLength(str,startByte)
		String grapheme = str[startByte,startByte+graphemeLen-1]
		Printf "startByte=%d, %d bytes in grapheme, grapheme=\"%s\"\r", startByte, graphemeLen, grapheme

		// enumerate the code points.
		Variable codePointStart = startByte
		Variable codePointBytes = 0
		Variable codePointNum = 1 // first code point

		// The first code point is known to be <= graphemeLen. It may be ==.
		do
			Variable codepointLen = UTF8CharLength(str,codePointStart)
			String character = str[codePointStart,codePointStart+codepointLen-1]

			// print each byte in the code point/character
			String chars=""
			Variable c
			for (c=0; c<codepointLen; c+=1)
				Variable byteVal = char2num(character[c]) & 0x00FF
				sprintf chars, "%s 0x%X", chars, byteVal
			endfor 
			Printf "\t\tcodepoint=%d, startByte=%d, UTF8CharLength = %d bytes = %s\r", codePointNum, codePointStart, codepointLen, chars	
			
			codePointNum += 1
			codePointStart += codepointLen // advance
			codePointBytes += codepointLen
		while (codePointBytes < graphemeLen )
		
		startByte += graphemeLen // advance to start of next grapheme
	while (startByte < totalBytes)
	Print ""
End

•ExamineUTF8Text("a\xCC\x81 ♣ Δ") // á is the letter "a" combined with an acute accent U+0301
  Entire string = "á ♣ Δ", total bytes = 10
  UTF8 code points and graphemes individually
  startByte=0, 3 bytes in grapheme, grapheme="á"
               codepoint=1, startByte=0, UTF8CharLength = 1 bytes =  0x61
               codepoint=2, startByte=1, UTF8CharLength = 2 bytes =  0xCC 0x81
  startByte=3, 1 bytes in grapheme, grapheme=" "
               codepoint=1, startByte=3, UTF8CharLength = 1 bytes =  0x20
  startByte=4, 3 bytes in grapheme, grapheme="♣"
               codepoint=1, startByte=4, UTF8CharLength = 3 bytes =  0xE2 0x99 0xA3
  startByte=7, 1 bytes in grapheme, grapheme=" "
               codepoint=1, startByte=7, UTF8CharLength = 1 bytes =  0x20
  startByte=8, 2 bytes in grapheme, grapheme="Δ"
               codepoint=1, startByte=8, UTF8CharLength = 2 bytes =  0xCE 0x94

References

https://www.utf8-chartable.de/unicode-utf8-table.pl

UTF8CharLength (str, startByte)​

Example​

References​

See Also​

UTF8CharLength (str, startByte)

Example

References

See Also