[python] How to make Django slugify work properly with Unicode strings?


Answers

The Mozilla website team has been working on an implementation : https://github.com/mozilla/unicode-slugify sample code at http://davedash.com/2011/03/24/how-we-slug-at-mozilla/

Question

What can I do to prevent slugify filter from stripping out non-ASCII alphanumeric characters? (I'm using Django 1.0.2)

cnprog.com has Chinese characters in question URLs, so I looked in their code. They are not using slugify in templates, instead they're calling this method in Question model to get permalinks

def get_absolute_url(self):
    return '%s%s' % (reverse('question', args=[self.id]), self.title)

Are they slugifying the URLs or not?




I'm afraid django's definition of slug means ascii, though the django docs don't explicitly state this. This is the source of the defaultfilters for the slugify... you can see that the values are being converted to ascii, with the 'ignore' option in case of errors:

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return mark_safe(re.sub('[-\s]+', '-', value))

Based on that, I'd guess that cnprog.com is not using an official slugify function. You may wish to adapt the django snippet above if you want a different behaviour.

Having said that, though, the RFC for URLs does state that non-us-ascii characters (or, more specifically, anything other than the alphanumerics and $-_.+!*'()) should be encoded using the %hex notation. If you look at the actual raw GET request that your browser sends (say, using Firebug), you'll see that the chinese characters are in fact encoded before being sent... the browser just makes it look pretty in the display. I suspect this is why slugify insists on ascii only, fwiw.







Related