How to handle C4D Unicode in Python scripting?



  • @zipit said in How to handle C4D Unicode in Python scripting?:

    1. It seems to me that you are mixing up Python 2 and 3 features regarding unicode strings. You cannot enforce unicode representation for characters in unicode (generated by the OS, your keyboard, etc.) in Python 2. So print (u"ä ö ü") will not work in Python 2 as it will cause Python try to interpret these characters as hex ids. But it will work in Python 3. print ("ä ö ü") will work both in Python 2 and 3, given your system locale supports it.

    Thanks, I think I have found the core issue now (and I would say it's a C4D bug). See next post... I'll try to keep this short this time ;-)

    I was not trying to mix Python 3 features into this, actually Python3 doesn't even have the unicode class... all strings are now unicode. But as C4D is still stuck with Python 2.7 or something, the current implementation is buggy in this respect.

    1. Your unicodes symbols are weird - at least for the german diacritics. The 16-Bit unicode symbols for "äöü" should be u"\u00E4 \u00F6 \u00FC" and work fine for me here.

    Heh, you are right of course. The variables c and d were not supposed to represent äöü though, but some random Unicode characters.



  • I was not trying to mix Python 3 features into this, actually Python3 doesn't even have the unicode class... all strings are now unicode.

    Well, that was sort of my point. I do not understand the purpose of u"äöü" in your code then, since Python 2 is expecting an escaped string there [1]. Or am I overlooking something?

    [1] Python 2.7 Unicode HOWTO. url: https://docs.python.org/2/howto/unicode.html
    Cheers
    zipit



  • Okay, I checked some more, and I think there is a C4D bug at the core of the matter.

    I tried using # -*- coding: latin-1 -*- to get at least the German umlauts corrected, also with the coding utf-8. That didn't help though (and anyway, it would only solve the literal issue which is not the core problem).

    After digging through the internals of Python's unicode class, I am now convinced that str is wrongly implemented in C4D's Python. str in Python 2.7 should be a one-byte representation that allows characters up to codepoint 255. What we get returned from BaseObject.GetName() and also from the literal construction of a str is actually a UTF-8 encoded string (which is what Python 3 would do, as this version does not have a unicode class any more, and all str objects are UTF-8 unicode).

    That doesn't matter for pure 7-bit ASCII as the representation is the same, but it goes haywire for all characters >127 (as far as one-byte representation would be possible), and especially for all Unicode characters beyond the one-byte codepage.

    str is actually currently (R21) built as unicode with this encoding. But it does not support the proper len, index, and slice functions - these still treat the characters as if they were one-byte codes. Which then extracts partial codes from the multi-byte encodings, which by the nature of UTF-8 will be >127.

    I found that the decode function, invoked on such a str, actually reinterprets the string as already UTF-8-encoded, and returns a proper unicode string, which allows me to use len, index and slice in the intended way:

    import c4d
    
    def main():
        print "---------- Start ----------"
    
        a = "Bär"
        print "str literal:", a, len(a), type(a)
        for c in a: print c, " ",
        print
        u = a.decode('utf-8')
        print "decoded as unicode:", u, len(u), type(u)
        for c in u: print c, " ",
        print
        
    if __name__=='__main__':
        main()
    

    Result:

    ---------- Start ----------
    str literal: Bär 4 <type 'str'>
    B         r
    decoded as unicode: Bär 3 <type 'unicode'>
    B   ä   r
    

    The same works actually on the str returned by GetName().

    (I'm continuing the experiments)



  • @zipit said in How to handle C4D Unicode in Python scripting?:

    Well, that was sort of my point. I do not understand the purpose of u"äöü" in your code then, since Python 2 is expecting an escaped string there [1]. Or am I overlooking something?

    [1] Python 2.7 Unicode HOWTO. url: https://docs.python.org/2/howto/unicode.html

    If you scroll down on that documentation, you will find the sample code:

    #!/usr/bin/env python
    # -*- coding: latin-1 -*-
    
    u = u'abcdé'
    print ord(u[-1])
    

    which is supposed to work. So, at least if I specify the encoding in this comment notation, I should be able to use these umlauts in a unicode literal.
    If the encoding notation is not supported and there is no encoding default either, then using unsupported characters in a literal should raise an error.
    Instead, the notation creates a unicode string in which the UTF-8 symbols are contained as characters. A double encoding, so to say.



  • Hi,

    well, that example defines the encoding, which is kind of the point of that example. About your other code - the following code:

    a = "Bär"
    print "str literal:", a, len(a), type(a)
    

    will return, run from a default Python 2.7.8 interpreter:

    str literal: Bär 4 <type 'str'>
    

    Now that you mention it, I remember reading about that weird behavior of len and unicode literals in Python 2.7 before. Iterating through that string will cause an exception in Python 2.7.8 because of that.

    I do not know what Cinema's Python does behind the curtain, but to me it looks to me more like a feature than a bug.

    Cheers
    zipit



  • @zipit said in How to handle C4D Unicode in Python scripting?:

    well, that example defines the encoding, which is kind of the point of that example.

    It is a unicode literal that contains a non-ASCII character, so it's the same as u"äöü" in my eyes...

    About your other code - the following code:

    a = "Bär"
    print "str literal:", a, len(a), type(a)
    

    will return, run from a default Python 2.7.8 interpreter:

    str literal: Bär 4 <type 'str'>
    

    That is weird. If that is the standard implementation, it makes no sense to me... the str is interpreted as three letters when written to the output, but if used otherwise (be it with a for loop, an index, a slice, or len()) it treats the content like a sequence of single bytes. That means that all functions that rely on cutting up the string or extracting something by index will potentially slash a multi-byte encoded character in half. Seems like a contradiction in handling.

    I don't even want to stick with literals. Getting a string directly from the API as with GetName() causes the same encoded content with the same problems. If I name an object "Bär" and get the name string with GetName() then the len() is still 4. That is simply not what I expect.

    If a string contains encoded content, I would assume that all functions that handle this string keep the integrity of the single characters (not bytes), so len("Bär") should be 3. (It gets difficult enough when Unicode uses separate (and potentially multiple) diacritical marks, modifiers, directional codes, or other stuff that makes it difficult to tell the characters apart...)

    What I have found working with Unicode is the following (samples):

    import c4d
    
    def outputUstr(myUstr):
        print "String:", myUstr
        print "Length:", len(myUstr), "Type:", type(myUstr)
        for c in myUstr: print c, "(" + str(ord(c)) + ") -",
        print
    
    def main():
        print "---------- Start ----------"
    
        a = "äöü".decode('utf-8')
        outputUstr(a)
        b = u"\u0189\u018B\u01F7"
        outputUstr(b)
        c = "ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆".decode('utf-8')
        outputUstr(c)
        d = op.GetName()
        d = d.decode('utf-8')
        outputUstr(d)
        op.SetName(d + (" a∏ß".decode('utf-8')))
        c4d.EventAdd()
    
    if __name__=='__main__':
        main()
    

    A literal containing Unicode characters needs to be decoded (see variable a).
    If the literal is supposed to contain escaped Unicode characters, then it must be an explicit unicode literal (see variable b). This cannot be decoded, and it must contain only 7-bit ASCII characters other than the escaped ones. Inserting Unicode characters directly results in multibyte codes being inserted as multiple characters (not as the intended encoded character).
    As variable c shows, decoding works even for multibyte characters that have been copied from some text editor.
    Names from the API, as in d, need to be decoded too to yield a unicode string. After that, len, index and slice work fine. You can write that string back as name directly, as SetName() accepts a unicode parameter.

    And now I close shop for today...



  • Hi,

    I am kind of confused on what you are trying to do. You can hard-code your Unicode symbols or just set the encoding of the file.

    # -*- coding: utf-8 -*-
    
    string_literals = [
        u"äöüß",
        u"âêôû",
        u"ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆"
    ]
    
    for literal in string_literals:
        print literal, len(literal)
    

    This will put out :

    äöüß 4
    âêôû 4
    ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆ 15
    

    For the default Python 2.7.8 interpreter and c4d's interpreter (the console struggles with some characters). You also have the option of loading strings from a resource file.

    Cheers
    zipit



  • @zipit said in How to handle C4D Unicode in Python scripting?:

    # -*- coding: utf-8 -*-
    
    string_literals = [
        u"äöüß",
        u"âêôû",
        u"ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆"
    ]
    
    for literal in string_literals:
        print literal, len(literal)
    

    Well, now I am flabberghasted. I tried the encoding comment before, and it did not work at all. A coding of latin-1 still doesn't btw. Apparently I must have made a typo then, because all of a sudden it works and gives me the correct strings. (I still have questions why the non-unicode strings worked before... guess there are any implementation details I don't see...)

    That seems to solve the issues of literals for now. The issue of names returned from the API remains - these need to be decoded before use, since str isn't working as expected.

    As what I mean to do - actually nothing. I started writing a course on Python-in-Cinema4D over on Patreon (https://www.patreon.com/cairyn if you bother to look), and as I came to the string chapter, I wanted to check out all the unicode possibilities, as my readers appreciate a thorough overview. Normally I don't write object names in Cyrillic 😁

    So I started out with string literals, string literals in German, string literals in Python unicode, and how all of these are represented in to .py file and in memory. That was when I noticed the weird str behavior, and the same with names I get from the API.

    Even with the literal issue solved, I do wonder why I cannot find anything on the issue on the web. If there is a Russian or Greek programmer who found out that his characters aren't resolving without first decoding the str to unicode, I'm sure they would post something somewhere? Perhaps on Russian or Greek forums I am not privy to ... sigh



  • Hi,

    well, that API object names thing is a flaw of Python 2. So working as expected or not as expected is a bit a question of the point of view. If you got the string passed from any other source the problem would be the same.

    On a more productive note: I think that focusing on Unicode strings isn't really that important for Python stuff in c4d, since object names should be something you largely ignore, as they are a unreliable source of identification and only are rarely important in other contexts.

    PS: I have already seen your python patreon thingy on c4dcafe ;)
    PPS: If you google "python unicode len()" you will find a lot of confused python programmers on StackOverflow ;)

    Cheers
    zipit



  • @zipit said in How to handle C4D Unicode in Python scripting?:

    well, that API object names thing is a flaw of Python 2. So working as expected or not as expected is a bit a question of the point of view. If you got the string passed from any other source the problem would be the same.

    Right. The main thing is to understand the issue, and then to write the chapter in a way that explains what to watch out for. (I do wonder how third-party modules would do with a name string passed to them from a script that reads them from the API... well, another bridge to cross another day.)

    Python 3 clearly is superior in that respect, as there is no unicode class and all str objects are unicode (what they appear to be already in C4D, but with matching len, index, and slice capabilities).

    On a more productive note: I think that focusing in Unicode strings isn't really that important for
    Python stuff in c4d, since object names should be something you largely ignore as they are a unreliable source of identification and only are rarely important in other contexts.

    Hmm, I am not sure whether I would agree to that. Good naming is essential to find your way through complex scenes, and a good naming schema can be built in a way that is friendly to string search and comparison criteria, esp. if you can build your own scripts to perform the search and selection. I just point at the _L _R naming schema for joints that is common in C4D's docs.

    Of course, if your objects are all named Cube, Cube.1, Cube.2, Cube.3, then name-based identification may be unhelpful 😋

    Anyway, I am not the person to judge that, as I am only teaching Python to interested users. What they do with it is their own decision; I just have to point out the crucial points so they can apply the code to their own concepts.