@zipit said in How to handle C4D Unicode in Python scripting?:
well, that example defines the encoding, which is kind of the point of that example.
It is a unicode literal that contains a non-ASCII character, so it's the same as u"äöü" in my eyes...
About your other code - the following code:
a = "Bär"
print "str literal:", a, len(a), type(a)
will return, run from a default Python 2.7.8 interpreter:
str literal: Bär 4 <type 'str'>
That is weird. If that is the standard implementation, it makes no sense to me... the
str is interpreted as three letters when written to the output, but if used otherwise (be it with a
for loop, an index, a slice, or
len()) it treats the content like a sequence of single bytes. That means that all functions that rely on cutting up the string or extracting something by index will potentially slash a multi-byte encoded character in half. Seems like a contradiction in handling.
I don't even want to stick with literals. Getting a string directly from the API as with
GetName() causes the same encoded content with the same problems. If I name an object "Bär" and get the name string with
GetName() then the
len() is still 4. That is simply not what I expect.
If a string contains encoded content, I would assume that all functions that handle this string keep the integrity of the single characters (not bytes), so len("Bär") should be 3. (It gets difficult enough when Unicode uses separate (and potentially multiple) diacritical marks, modifiers, directional codes, or other stuff that makes it difficult to tell the characters apart...)
What I have found working with Unicode is the following (samples):
print "String:", myUstr
print "Length:", len(myUstr), "Type:", type(myUstr)
for c in myUstr: print c, "(" + str(ord(c)) + ") -",
print "---------- Start ----------"
a = "äöü".decode('utf-8')
b = u"\u0189\u018B\u01F7"
c = "ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆".decode('utf-8')
d = op.GetName()
d = d.decode('utf-8')
op.SetName(d + (" a∏ß".decode('utf-8')))
A literal containing Unicode characters needs to be
decoded (see variable a).
If the literal is supposed to contain escaped Unicode characters, then it must be an explicit
unicode literal (see variable b). This cannot be decoded, and it must contain only 7-bit ASCII characters other than the escaped ones. Inserting Unicode characters directly results in multibyte codes being inserted as multiple characters (not as the intended encoded character).
As variable c shows, decoding works even for multibyte characters that have been copied from some text editor.
Names from the API, as in d, need to be decoded too to yield a
unicode string. After that,
len, index and slice work fine. You can write that string back as name directly, as
SetName() accepts a
And now I close shop for today...