Coder Perfect

g?

Problem

Beautiful Soup 3 is processing some HTML for me, but it contains HTML entities that Beautiful Soup 3 does not immediately decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

How do I get “£682m” instead of “£682m” when decoding HTML entities in text?

Asked by jkp

Solution #1

Use html.unescape():

import html
print(html.unescape('&pound;682m'))

HTML.parser.HTML is an example of html.parser.HTML. Parser.unescape is deprecated, and it was scheduled to be removed in version 3.5, but it was accidentally left in. It will be phased out of the language in the near future.

You can use HTMLParser.unescape() from the standard library:

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

You may also utilize the six compatibility libraries to make the import process easier:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

Answered by luc

Solution #2

The entity conversion is handled by Beautiful Soup. The convertEntities argument to the BeautifulSoup constructor is required in Beautiful Soup 3 (see the ‘Entity Conversion’ section of the old docs). Entities are automatically decoded in Beautiful Soup 4.

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

Answered by Ben James

Solution #3

Replace entities from the w3lib.html package can be used.

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

Answered by Corvax

Solution #4

You can apply a formatter to your output in Beautiful Soup 4.

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

Answered by LoicUV

Solution #5

I experienced a similar problem with encoding. The normalize() technique was utilized. When using the pandas.to html() method to export my data frame to an.html file in another directory, I was getting a Unicode error. This is what I ended up doing, and it worked…

    import unicodedata 

You can call the dataframe object whatever you like, but let’s call it table…

    table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
    table.index+= 1

Table data should be encoded so that we may export it to a.html file in the templates folder (this can be anywhere you want:)).

     #this is where the magic happens
     html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

exporting a normalized string to an HTML document

    file = open("templates/home.html","w") 

    file.write(html_data) 

    file.close() 

Reference: unicodedata documentation

Answered by Alex

Post is based on https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string