python - How can I extract the text between comment tags with Beautiful Soup? -
i have following html code:
<!doctype html public "-//w3c//dtd html 4.01 transitional//en" "http://www.w3.org/tr/html4/loose.dtd"> <html><!-- instancebegin template="/templates/banddetails.dwt" codeoutsidehtmlislocked="false" --> <head> <!-- instancebegineditable name="doctitle" --> <title><blr></title> <!-- instanceendeditable --> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> <!-- instancebegineditable name="head" --><!-- instanceendeditable --> </head> <body> <div align="center"> <table width="0" border="0" cellpadding="0" cellspacing="0" id="maintable"> <tr> <td colspan="2" id="navbar"><!--#include file="menu.htm" --></td> </tr> <tr> <td id="maincontent"><table width="0" border="0" cellpadding="0" cellspacing="0" id="contentinner"> <tr> <td class="bodytext"> <p></p><!-- instancebegineditable name="bigpicture-378wide" --><img src="images/blrlarge.jpg" alt="blr" width="378" height="324" class="picturefloatright"><!-- instanceendeditable --> <!-- instancebegineditable name="daydatemonthyear" --> <p>thursday 11th march 2010 </p> <!-- instanceendeditable -->
how can extract text contained within comment tags using beautiful soup? example, want return:
<blr>
thursday 11th march 2010
thanks
you might find program helpful:
from bs4 import beautifulsoup bs4.element import comment, navigablestring html_doc = 'x.html' soup = beautifulsoup(open(html_doc)) # identify start comment def isinstancebegineditable(text): return (isinstance(text, comment) , text.strip().startswith("instancebegineditable")) # identify end comment def isinstanceendeditable(text): return (isinstance(text, comment) , text.strip().startswith("instanceendeditable")) # start comments instancebegineditable in soup.find_all(text=isinstancebegineditable): # found start comment, @ text , comments: text in instancebegineditable.find_all_next(text=true): # found text or comment, examine closely if isinstanceendeditable(text): # found end comment, out of pool break if isinstance(text, comment): # found comment, ignore continue if not text.strip(): # found blank text, ignore continue # whatever left must print text
Comments
Post a Comment