python - Removing newlines (\n) with BeautifulSoup -
i'm parsing html page bs4:
import re import codecs import mysqldb bs4 import beautifulsoup soup = beautifulsoup(open("sprt.htm"), from_encoding='utf-8') sprt = [[0 x in range(3)] x in range(300)] = 0 para in soup.find_all('p'): if para.strong not none: sprt[i][0] = para.strong.get_text() sprt[i][1] = para.get_text() sprt[i][1] = re.sub(re.escape(sprt[i][0]), "", sprt[i][1], re.unicode) sprt[i][2] = sprt[i][1] sprt[i][2] = re.sub(r".+[\.\?][\s\s\n]", "", sprt[i][1], re.s) sprt[i][2] = re.sub(r".+panel", "panel", sprt[i][2], re.s) sprt[i][1] = re.sub(re.escape(sprt[i][2]), "", sprt[i][1]) += 1 x = 0 the page i'm parsing filled paragraphs 3:
<p><strong>name name. </strong>the visual politics of play: on signifying practices of digital games. panel proposal (2p)</p> <p><strong>name name , name name. </strong>pain, art , communication. panel proposal (2p)</p> <p><strong>name name, name name , name name. </strong>waves of technology: hidden ideologies of cognitive neuroscience , future production of iconic. panel proposal (2p)</p> the parsing works until last paragraph:
<p><strong>name name, name name , name name. </strong>waves of technology: hidden ideologies of cognitive neuroscience , future production of iconic. panel proposal (2p)</p> what find in last slot of array this:
[u'name name, name name\xa0and name name.\xa0', u'waves\n of technology: hidden ideologies of cognitive neuroscience , \nfuture production of iconic.\xa0panel proposal (2p)', u'waves\n of technology: hidden ideologies of cognitive neuroscience , \nfuture production of iconic.\xa0panel proposal (2p)'] there 2 newlines (\n) appear in weird places (after waves , before future). appear in same position, not randomly. thought due length of paragraph, there longer paragraphs no \n appears.
i tried remove them with:
sprt[i][2] = re.sub("\n", "", sprt[i][1], re.u, re.s) but didn't work.
are newlines there because made mistake somewhere? there way remove them?
i suspect newline appears in source html file. tried reproduce error using paragraphs , didn't \n until inserted new line in source file. explain why doesn't happen other longer paragraphs: don't have actual newline in html source file.
having said that, if add re.sub line newline character removed. (i in sprt[i][2] though, not sprt[i][1] of course - possible looking in wrong place there?)
Comments
Post a Comment