Coder Perfect

Is it possible to decode HTML entities in a Python string?


Beautiful Soup 3 is processing some HTML for me, but it contains HTML entities that Beautiful Soup 3 does not immediately decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text

How do I get “£682m” instead of “£682m” when decoding HTML entities in text?

Asked by jkp

Solution #1

Use html.unescape():

import html

HTML.parser.HTML is an example of html.parser.HTML. Parser.unescape is deprecated, and it was scheduled to be removed in version 3.5, but it was accidentally left in. It will be phased out of the language in the near future.

HTMLParser.unescape() is available in the standard library:

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))

You may also utilize the six compatibility libraries to make the import process easier:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))

Answered by luc

Solution #2

The entity conversion is handled by Beautiful Soup. The convertEntities argument to the BeautifulSoup constructor is required in Beautiful Soup 3 (see the ‘Entity Conversion’ section of the old docs). Entities are automatically decoded in Beautiful Soup 4.

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")

Answered by Ben James

Solution #3

Replace entities from the w3lib.html package can be used.

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")

Answered by Corvax

Solution #4

You can apply a formatter to your output in Beautiful Soup 4.

# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="">A link</a>')
# <a href="">A link</a>

Answered by LoicUV

Solution #5

I experienced a similar problem with encoding. The normalize() technique was utilized. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked…

    import unicodedata 

You can call the dataframe object whatever you like, but let’s call it table…

    table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
    table.index+= 1

Table data should be encoded so that we may export it to a.html file in the templates folder (this can be anywhere you want:)).

     #this is where the magic happens

exporting a normalized string to an HTML document

    file = open("templates/home.html","w") 



Reference: unicodedata documentation

Answered by Alex

Post is based on