Beautiful Soup 3 is processing some HTML for me, but it contains HTML entities that Beautiful Soup 3 does not immediately decode for me:
>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup("<p>£682m</p>") >>> text = soup.find("p").string >>> print text £682m
How do I get “£682m” instead of “£682m” when decoding HTML entities in text?
Asked by jkp
import html print(html.unescape('£682m'))
HTML.parser.HTML is an example of html.parser.HTML. Parser.unescape is deprecated, and it was scheduled to be removed in version 3.5, but it was accidentally left in. It will be phased out of the language in the near future.
You can use HTMLParser.unescape() from the standard library:
>>> try: ... # Python 2.6-2.7 ... from HTMLParser import HTMLParser ... except ImportError: ... # Python 3 ... from html.parser import HTMLParser ... >>> h = HTMLParser() >>> print(h.unescape('£682m')) £682m
You may also utilize the six compatibility libraries to make the import process easier:
>>> from six.moves.html_parser import HTMLParser >>> h = HTMLParser() >>> print(h.unescape('£682m')) £682m
Answered by luc
The entity conversion is handled by Beautiful Soup. The convertEntities argument to the BeautifulSoup constructor is required in Beautiful Soup 3 (see the ‘Entity Conversion’ section of the old docs). Entities are automatically decoded in Beautiful Soup 4.
>>> from BeautifulSoup import BeautifulSoup >>> BeautifulSoup("<p>£682m</p>", ... convertEntities=BeautifulSoup.HTML_ENTITIES) <p>£682m</p>
>>> from bs4 import BeautifulSoup >>> BeautifulSoup("<p>£682m</p>") <html><body><p>£682m</p></body></html>
Answered by Ben James
Replace entities from the w3lib.html package can be used.
In : from w3lib.html import replace_entities In : replace_entities("£682m") Out: u'\xa3682m' In : print replace_entities("£682m") £682m
Answered by Corvax
You can apply a formatter to your output in Beautiful Soup 4.
print(soup.prettify(formatter=None)) # <html> # <body> # <p> # Il a dit <<Sacré bleu!>> # </p> # </body> # </html> link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') print(link_soup.a.encode(formatter=None)) # <a href="http://example.com/?foo=val1&bar=val2">A link</a>
Answered by LoicUV
I experienced a similar problem with encoding. The normalize() technique was utilized. When using the pandas.to html() method to export my data frame to an.html file in another directory, I was getting a Unicode error. This is what I ended up doing, and it worked…
You can call the dataframe object whatever you like, but let’s call it table…
table = pd.DataFrame(data,columns=['Name','Team','OVR / POT']) table.index+= 1
Table data should be encoded so that we may export it to a.html file in the templates folder (this can be anywhere you want:)).
#this is where the magic happens html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')
exporting a normalized string to an HTML document
file = open("templates/home.html","w") file.write(html_data) file.close()
Reference: unicodedata documentation
Answered by Alex
Post is based on https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string