Solved escape unicode characters in filepath

Hi there,

I'm currently having issues converting unicode characters on Windows...

I'm utilizing the AcceptDragObject in a c4d.gui.TreeViewFunctions.

I want to act on a given folder that is dragged into my treeview.
if dragtype == c4d.DRAGTYPE_FILENAME_OTHER gives me the direct filepath.

For example I'm having a folder called test_with_äüö on my Desktop.

My main issue is, that I need to work with the variable coming in from the dragtype called dragobject and can't convert the direct result!

Here's a simple test-script to play around with:

import c4d
import os

def _direct_conversion():
    return r"C:\Users\lasse\Desktop\test_with_äüö".decode('utf-8')

def _from_variable():
    dragobject = "C:\Users\lasse\Desktop\test_with_äüö"
    s = r"%s" % (dragobject)
    s = s.decode('utf-8')
    return s

def main():
    s = _direct_conversion()
    s = _from_variable()
    
    print "os.path.isdir:", os.path.isdir(s)

if __name__=='__main__':
    main()

The bad thing is that the _direct_conversion() gives me the correct result while _from_variable() does not.

I'm probably totally overthinking this, but any ideas how to solve that "simple" problem are welcome!!! :)

Cheers,
Lasse

I'd say you need to find out what format the dragobject is in, even, when you get it from the function, and whether os.path.isdir is accepting that formatting.

In your example, first you are using a literal that contains a \t symbol which inserts a tab into your string.
Then in s = r"%s" % (dragobject) you are using a raw string as format source, but that doesn't influence dragobject at all: r only affects the string before formatting, the actual evaluation of % happens in the next step and leaves dragobject unaffected (including the tab).

In the utf-8 decoding, you get the original string back in unicode: C4D stores its script files in utf-8 already, so any literal containing unicode characters will be used as-is.

On my system, the only issue isdir has is with the tab character:

import c4d, os
from c4d import gui

def check(s):
    print s, len(s), type(s)
    s = r"%s" % (s) # no effect
    print s, len(s), type(s)
    s = s.decode('utf-8')
    print s, len(s), type(s)
    print "os.path.isdir:", os.path.isdir(s)

def main():
    dragobject = "K:\klm"
    print "Path: klm:"
    check(dragobject)

    dragobject = "K:\klmäöü"
    print "Path: klmäöü:"
    check(dragobject)

    dragobject = "K:\tklmäöü"
    print "Path: tklmäöü:"
    check(dragobject)

    dragobject = r"K:\tklmäöü"
    print "Path: raw tklmäöü:"
    check(dragobject)

# Execute main()
if __name__=='__main__':
    main()

results in

Path: klm:
K:\klm 6 <type 'str'>
K:\klm 6 <type 'str'>
K:\klm 6 <type 'unicode'>
os.path.isdir: True
Path: klmäöü:
K:\klmäöü 12 <type 'str'>
K:\klmäöü 12 <type 'str'>
K:\klmäöü 9 <type 'unicode'>
os.path.isdir: True
Path: tklmäöü:
K:	klmäöü 12 <type 'str'>
K:	klmäöü 12 <type 'str'>
K:	klmäöü 9 <type 'unicode'>
os.path.isdir: False
Path: raw tklmäöü:
K:\tklmäöü 13 <type 'str'>
K:\tklmäöü 13 <type 'str'>
K:\tklmäöü 10 <type 'unicode'>
os.path.isdir: True

(all directories used here actually exist)

However, I am aware that this is not the root problem, as you do not construct dragobject through a literal when getting it from a function. I would suggest you check type and bytewise encoding of the value you receive, and adapt the decoding accordingly.

Yeah, the problem lies in the escaping characters...

Sadly the function only returns simple backward slashes C:\Users\lasse\Desktop\test_with_äüö so \t will become a tab character.

There wouldn't be a problem if the returned path would be with two backward slashes \\ or even forward slashes / ... That might be worth a bug report!?

Thankfully someone on stackoverflow had the same issue and came up with a function to convert this "bad" path...

backslash_map = { '\a': r'\a', '\b': r'\b', '\f': r'\f',
                  '\n': r'\n', '\r': r'\r', '\t': r'\t', '\v': r'\v' }
def reconstruct_broken_string(s):
    for key, value in backslash_map.items():
        s = s.replace(key, value)
    return s

So in my example I can do:

    dragobject = "C:\Users\lasse\Desktop\test_with_äüö"
    s = reconstruct_broken_string(dragobject).decode('utf-8')
    print os.path.isdir(s) # returns True

That is overly complicated and somewhat convoluted, but I haven't found any other way.

Cheers,
Lasse

@lasselauch said in escape unicode characters in filepath:

Yeah, the problem lies in the escaping characters...

Sadly the function only returns simple backward slashes C:\Users\lasse\Desktop\test_with_äüö so \t will become a tab character.

There wouldn't be a problem if the returned path would be with two backward slashes \\ or even forward slashes / ... That might be worth a bug report!?

I'm actually not sure what the problem is, then. If the function returns the string "as is" with backslashes and no conversion, then this is practically the raw string. Converting it to Unicode should not interpret the backslashes - it's factually the same as in your original function _direct_conversion() which works fine with isdir()?

The problem in your sample code only happens because you use a literal to create the string. But that is not what you do with the treeview, you are receiving the string from AcceptDragObject, and you say above that what you get is the raw string.

So, I'm a bit at a loss what's actually not working here. I suppose you may need some conversion to display the string, but you have not mentioned that yet.

The path from AcceptDragObject's dragobject is NOT working with os.path.isdir(). That is on Windows when using unicode characters e.g. äöü in your filepath... That's my main problem here...

Okay, it seems I just need to use dragobject = dragobject.decode('utf-8') then.
Sorry for all the confusion and thanks for the help @Cairyn !

Cheers,
Lasse

Hi, @lasselauch as a rule of thumb with Python2.7 always store data as a Unicode string.

Control your IO which means, for each, Input make sure you know the encoding in all cases by always calling .decode('utf-8') so you are sure to store a Unicode value and get an error if something went wrong at a loading time.
Then output the content according to your need some need ASCII, some can work with unicode, but do the conversion on the fly, don't touch you stored Unicode data.

For the francophone people, there is this fabulous article about unicode Encoding in Python for English I guess the best I found on this topic is A Guide to Unicode, UTF-8, and Strings in Python.

Cheers,
Maxime.