Problem
Beautiful Soup 3 is processing some HTML for me, but it contains HTML entities that Beautiful Soup 3 does not immediately decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>£682m</p>")
>>> text = soup.find("p").string
>>> print text
£682m
How do I get “£682m” instead of “£682m” when decoding HTML entities in text?
Asked by jkp
Solution #1
Use html.unescape():
import html
print(html.unescape('£682m'))
HTML.parser.HTML is an example of html.parser.HTML. Parser.unescape is deprecated, and it was scheduled to be removed in version 3.5, but it was accidentally left in. It will be phased out of the language in the near future.
HTMLParser.unescape() is available in the standard library:
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You may also utilize the six compatibility libraries to make the import process easier:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
Answered by luc
Solution #2
The entity conversion is handled by Beautiful Soup. The convertEntities argument to the BeautifulSoup constructor is required in Beautiful Soup 3 (see the ‘Entity Conversion’ section of the old docs). Entities are automatically decoded in Beautiful Soup 4.
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>",
... convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>")
<html><body><p>£682m</p></body></html>
Answered by Ben James
Solution #3
Replace entities from the w3lib.html package can be used.
In [202]: from w3lib.html import replace_entities
In [203]: replace_entities("£682m")
Out[203]: u'\xa3682m'
In [204]: print replace_entities("£682m")
£682m
Answered by Corvax
Solution #4
You can apply a formatter to your output in Beautiful Soup 4.
print(soup.prettify(formatter=None))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>
Answered by LoicUV
Solution #5
I experienced a similar problem with encoding. The normalize() technique was utilized. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked…
import unicodedata
You can call the dataframe object whatever you like, but let’s call it table…
table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
table.index+= 1
Table data should be encoded so that we may export it to a.html file in the templates folder (this can be anywhere you want:)).
#this is where the magic happens
html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')
exporting a normalized string to an HTML document
file = open("templates/home.html","w")
file.write(html_data)
file.close()
Reference: unicodedata documentation
Answered by Alex
Post is based on https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string