Coder Perfect

In Python, how can I remove xa0 from a string?

Problem

I’m currently parsing an HTML file with Beautiful Soup and executing get text(), but it appears that I’m getting a lot of xa0 Unicode signifying spaces. Is there a quick way to remove all of them and replace them with spaces in Python 2.7? I suppose a more general question would be: Is it possible to remove Unicode formatting?

As suggested in another thread, I tried line = line.replace(u’xa0′,’ ‘), but that turned the xa0’s to u’s, so now I have “u”s everywhere.):

EDIT: str.replace(u’xa0′,”) appears to have addressed the problem. encode(‘utf-8’), but doing.encode(‘utf-8’) without replace() causes it to spit out even stranger characters, such as xc2. Is there anyone who can explain this?

Asked by zhuyxn

Solution #1

In Latin1 (ISO 8859-1), xa0 (also chr) is a non-breaking space (160). You should substitute a space for it.

string = string.replace(u’\xa0′, u’ ‘)

When you use.encode(‘utf-8’), the unicode will be encoded to utf-8, which means that each unicode can be represented by 1 to 4 bytes. xa0 is represented in this example by two bytes, xc2xa0.

Check out http://docs.python.org/howto/unicode.html for further information. .

Please note that this answer is from 2012; Python has progressed since then, and you should be able to utilize unicodedata now. Now is the time to normalize.

Answered by samwize

Solution #2

Python’s unicodedata package has a lot of interesting features. The.normalize() function is one of them.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

If you don’t obtain the results you want with NFKD, try any of the other procedures described in the link above.

Answered by Jamie

Solution #3

To summarize, I did it this way after trying various approaches. Two methods for avoiding/removing xa0 characters from a parsed HTML string are listed below.

Assume we have the following raw HTML:

raw_html = '<p>Dear Parent,&nbsp;</p><p><span style="font-size: 1rem;">This is a test message,&nbsp;</span><span style="font-size: 1rem;">kindly ignore it.&nbsp;</span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

So, let’s see if we can tidy up this HTML string:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

In the string, the above code generates the letters xa0. We have two options for appropriately removing them.

Method #1 (Suggested): BeautifulSoup’s get text method with the strip option set to True is the first. As a result, our code is:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2: Another alternative is to utilize the unicodedata library in Python.

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

These approaches are also described in depth on this blog, which you may find useful.

Answered by Ali Raza Bhayani

Solution #4

At the conclusion of your line, try using.strip(). strip() performed admirably for me.

Answered by user3590113

Solution #5

try this:

string.replace('\\xa0', ' ')

Answered by user278064

Post is based on https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python