Problem
I’m currently parsing an HTML file with Beautiful Soup and executing get text(), but it appears that I’m getting a lot of xa0 Unicode signifying spaces. Is there a quick way to remove all of them and replace them with spaces in Python 2.7? I suppose a more general question would be: Is it possible to remove Unicode formatting?
As suggested in another thread, I tried line = line.replace(u’xa0′,’ ‘), but that turned the xa0’s to u’s, so now I have “u”s everywhere.):
EDIT: str.replace(u’xa0′,”) appears to have addressed the problem. encode(‘utf-8’), but doing.encode(‘utf-8’) without replace() causes it to spit out even stranger characters, such as xc2. Is there anyone who can explain this?
Asked by zhuyxn
Solution #1
In Latin1 (ISO 8859-1), xa0 (also chr) is a non-breaking space (160). You should substitute a space for it.
string = string.replace(u’\xa0′, u’ ‘)
When you use.encode(‘utf-8’), the unicode will be encoded to utf-8, which means that each unicode can be represented by 1 to 4 bytes. xa0 is represented in this example by two bytes, xc2xa0.
Check out http://docs.python.org/howto/unicode.html for further information. .
Please note that this answer is from 2012; Python has progressed since then, and you should be able to utilize unicodedata now. Now is the time to normalize.
Answered by samwize
Solution #2
Python’s unicodedata package has a lot of interesting features. The.normalize() function is one of them.
Try:
new_str = unicodedata.normalize("NFKD", unicode_str)
If you don’t obtain the results you want with NFKD, try any of the other procedures described in the link above.
Answered by Jamie
Solution #3
To summarize, I did it this way after trying various approaches. Two methods for avoiding/removing xa0 characters from a parsed HTML string are listed below.
Assume we have the following raw HTML:
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
So, let’s see if we can tidy up this HTML string:
from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'
In the string, the above code generates the letters xa0. We have two options for appropriately removing them.
Method #1 (Suggested): BeautifulSoup’s get text method with the strip option set to True is the first. As a result, our code is:
clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks
Method # 2: Another alternative is to utilize the unicodedata library in Python.
import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'
These approaches are also described in depth on this blog, which you may find useful.
Answered by Ali Raza Bhayani
Solution #4
At the conclusion of your line, try using.strip(). strip() performed admirably for me.
Answered by user3590113
Solution #5
try this:
string.replace('\\xa0', ' ')
Answered by user278064
Post is based on https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python