Sunday, April 3, 2011

Replace html entities with the corresponding utf-8 characters in Python 2.6

I have a html text like this:

<xml ... >

and I want to convert it to something readable:

<xml ...>

Any easy (and fast) way to do it in Python?

From stackoverflow
  • http://docs.python.org/library/htmlparser.html

    >>> import HTMLParser
    >>> pars = HTMLParser.HTMLParser()
    >>> pars.unescape('&copy; &euro;')
    u'\xa9 \u20ac'
    >>> print _
    © €
    
    webjunkie : -1 because: "Deprecated since version 2.6"
    vartec : webjunkie: corrected the link.
    Alexandru : unescape is just an internal function of HTMLParser (and it's not documented in your link). however I could use the implementation. 10x a lot
    vartec : @brtzsnr: true, that it's undocumented. Don't think that it's internal though, after all name is unescape not _unescape or __unescape.
  • There is a function here that does it, as linked from the post Fred pointed out. Copied here to make things easier.

    Credit to Fred Larson for linking to the other question on SO. Credit to dF for posting the link.

0 comments:

Post a Comment