python - Removing newlines (\n) with BeautifulSoup -

April 15, 2012

i'm parsing html page bs4:

import re import codecs import mysqldb bs4 import beautifulsoup  soup = beautifulsoup(open("sprt.htm"), from_encoding='utf-8') sprt = [[0 x in range(3)] x in range(300)] = 0  para in soup.find_all('p'):     if para.strong not none:         sprt[i][0] = para.strong.get_text()         sprt[i][1] = para.get_text()         sprt[i][1] = re.sub(re.escape(sprt[i][0]), "", sprt[i][1], re.unicode)         sprt[i][2] = sprt[i][1]         sprt[i][2] = re.sub(r".+[\.\?][\s\s\n]", "", sprt[i][1], re.s)         sprt[i][2] = re.sub(r".+panel", "panel", sprt[i][2], re.s)         sprt[i][1] = re.sub(re.escape(sprt[i][2]), "", sprt[i][1])  += 1 x = 0

the page i'm parsing filled paragraphs 3:

<p><strong>name name. </strong>the visual politics of play: on signifying practices of digital games. panel proposal (2p)</p> <p><strong>name name , name name. </strong>pain, art , communication. panel proposal (2p)</p> <p><strong>name name, name name , name name. </strong>waves of technology: hidden ideologies of cognitive neuroscience , future production of iconic. panel proposal (2p)</p>

the parsing works until last paragraph:

<p><strong>name name, name name , name name. </strong>waves of technology: hidden ideologies of cognitive neuroscience , future production of iconic. panel proposal (2p)</p>

what find in last slot of array this:

[u'name name, name name\xa0and name name.\xa0', u'waves\n of technology: hidden ideologies of cognitive neuroscience , \nfuture production of iconic.\xa0panel proposal (2p)', u'waves\n of technology: hidden ideologies of cognitive neuroscience , \nfuture production of iconic.\xa0panel proposal (2p)']

there 2 newlines (\n) appear in weird places (after waves , before future). appear in same position, not randomly. thought due length of paragraph, there longer paragraphs no \n appears.

i tried remove them with:

sprt[i][2] = re.sub("\n", "", sprt[i][1], re.u, re.s)

but didn't work.

are newlines there because made mistake somewhere? there way remove them?

i suspect newline appears in source html file. tried reproduce error using paragraphs , didn't \n until inserted new line in source file. explain why doesn't happen other longer paragraphs: don't have actual newline in html source file.

having said that, if add re.sub line newline character removed. (i in sprt[i][2] though, not sprt[i][1] of course - possible looking in wrong place there?)

Search This Blog

Script

python - Removing newlines (\n) with BeautifulSoup -

Comments

Post a Comment

Popular posts from this blog

android - Sent Blob results empty -

javascript - Bootstrap Popover: iOS Safari strange behaviour -

ruby - How to configure keymap of Rubymine for rails console -