Hello; I seem to have stumbled over some interesting quirks in C4D Python that make it impossible to work with strings that are not ASCII.
Before I start: I am aware of the u prefix for unicode strings, the unicode string type, and the unicode character escape notations, so this is not really a Python question but an implementation question. Maybe I'm missing something essential...
Here's a test script:
import c4d
import maxon
from maxon import String
def main():
print "---------- Start ----------"
a = "äöü"
print "Umlaut:", a, len(a), a[0], a[1:3]
b = u"äöü"
print "Unicode umlaut:", b, len(b), b[0], b[1:3]
c = "\u0189\u018B\u01F7"
print "u escape:", c, len(c), c[0], c[1:3]
d = u"\u0189\u018B\u01F7"
print "Unicode u escape:", d, len(d), d[0], d[1:3]
e = "ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆"
print "Multibyte characters:", e, len(e), e[0], e[1:3]
f = u"ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆"
print "Unicode multibyte characters:", f, len(f), f[0], f[1:3]
x = op.GetName() # object is called äöü
print "Object name:", x, len(x), x[0], x[1:3]
y = String("Bär")
print "Explicit string:", y, "no len() attribute" #len(y), y[0], y[1:3]
z = str("Bär")
print "Python string:", z, len(z), z[0], z[1:3]
print type(a), type(b)
print type(c), type(d)
print type(e), type(f)
print type(x), type(y), type(z)
#p = unicode(op.GetName())
#print "Object name cast to unicode:", p, len(p), p[0], p[1:3]
q = str(op.GetName())
print "Object name cast to str:", q, len(q), q[0], q[1:3]
if __name__=='__main__':
main()
(I do see the unicode characters in the post preview, so this should be visible to everyone. Yes, I meant to write these.)
And here are the results from the console (I did not test everything with a MessageDialog)
---------- Start ----------
Umlaut: äöü 6
Unicode umlaut: äöü 6 à ¤Ã
u escape: \u0189\u018B\u01F7 18 \ u0
Unicode u escape: ƉƋǷ 3 Ɖ ƋǷ
Multibyte characters: ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆ 33
Unicode multibyte characters: ÎΩΨ ЩÐÐ â¡â¦â´ âââ 33 Î Î
Object name: äöü 6
Explicit string: Bär no len() attribute
Python string: Bär 4 B ä
<type 'str'> <type 'unicode'>
<type 'str'> <type 'unicode'>
<type 'str'> <type 'unicode'>
<type 'str'> <class 'maxon.reference.String'> <type 'str'>
Object name cast to str: äöü 6
At the beginning, I am testing a simple sequence of German umlauts in a and b. These are located in my national extended ASCII codepage (so they can be written with one byte). One might expect that these at least should work, but no.
With the str literal, the output is correct but the length is a two-byte encoding byte length (6) instead of the character length. Neither the single index nor the slice are working.
With the unicode literal, the output becomes some explicit encoding (I didn't bother to find out what). The single index and slice work fine - sort of, if you wanted to slice the encoded string.
Okay, so perhaps using actual characters beyond plain 7-bit ASCII doesn't work. Next, I try encoding Unicode characters by their escape sequences in c and d.
With the str literal, the escape sequences are not interpreted. That is fine, as this is the behavior from the Python standard.
With the unicode literal, the output and the single index and the slice are all fine! Yay, this seems the way to go. (But read on...)
For the fun of it, I took a text editor and created a few Unicode characters beyond the 1-byte range: Greek, Cyrillic, Currency, Math (all from left-to-right scripts, so we won't run into issues there). The Script Manager doesn't seem to mind. (If I save the script, these characters appear encoded in the source, but upon reloading, they are restored to their Unicode glory.) That's samples e and f.
Here, the same thing happens as with the umlauts. With the str literal, the output is fine but length, index and slice give me the wrong results. With the unicode literal, I see only the encoded characters - note that this sequence contains many non-printable characters otherwise you could see that the len, index and slice results are actually correct for the encoded sequence.
What's funny here that I can write these multibyte characters into a plain str. With äöü, I can argue that these can be represented by a single byte and therefore are fine to use in a str. With Greek and Cyrillic, I can't - but the str in e is the one that gives me the correct output in the console at least. Huh? There must be a good deal of transformations in the usage...
But at least I have found the way to handle C4D strings, right? Not quite...
Next, I try to read a name from an object in the Object Manager into x. This is named äöü, and upon checking that, I get the same results as from the str literal äöü - correct output, wrong length, index, and slice.
Hmm. Maybe the GetName() must be used as maxon.String instead. I create a variable y that is a String. This works with a literal containing an umlaut and gives me the correct output. But there is no len() attribute, nor a GetLength() one. Nor do I get any indices from a String class.
The documentation also is very sparse on String in Python. Obviously, we are supposed to use the Python internal classes str and unicode, which are either wrappers around maxon.String or are converted when used in the API.
Lastly, I create an explicit str object with a literal that contains an umlaut, again. This z behaves like the literal without cast, unsurprisingly. Note that index and slice indicate that the encoding is using varying byte lengths - the B comes out fine, and the ä is okay if you imagine it as 2-byte code.
Writing down the types holds no surprises. The explicit literals all give us str or unicode as expected. The object name results in a str (not a unicode, although C4D strings are supposed to support Unicode!). The explicit constructors do what they should.
Well, now here's the riddle: If an API function like GetName() gives us a str type, but the str type does not work properly with len, index, and slice, then how do we work in Python with C4D names that happen to be Unicode?
I try to cast str to unicode in p (after all, for the unicode literal d the functionality is there), but this attempt crashes with a fat error:
Traceback (most recent call last):
File "D:\3D\Cinema4D\HomeDir\Cinema4D V21_6F07B783\library\scripts\Test_StringLiteralsAndUnicode.py", line 38, in <module>
main()
File "D:\3D\Cinema4D\HomeDir\Cinema4D V21_6F07B783\library\scripts\Test_StringLiteralsAndUnicode.py", line 32, in main
#p = unicode(op.GetName())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Therefore I have commented it out here. Casting to str in q is (as expected) pointless, as the type is already str.
I do not mind writing unicode escape codes to get the string functionality working - but how do I get a unicode type from the C4D API to work with it, in the first place?