python - How can I extract the text between comment tags with Beautiful Soup? -

February 15, 2011

i have following html code:

<!doctype html public "-//w3c//dtd html 4.01 transitional//en" "http://www.w3.org/tr/html4/loose.dtd"> <html><!-- instancebegin template="/templates/banddetails.dwt" codeoutsidehtmlislocked="false" --> <head> <!-- instancebegineditable name="doctitle" --> <title>&lt;blr&gt;</title> <!-- instanceendeditable --> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> <!-- instancebegineditable name="head" --><!-- instanceendeditable --> </head>  <body> <div align="center">   <table width="0" border="0" cellpadding="0" cellspacing="0" id="maintable">     <tr>       <td colspan="2" id="navbar"><!--#include file="menu.htm" --></td>     </tr>     <tr>       <td id="maincontent"><table width="0" border="0" cellpadding="0" cellspacing="0" id="contentinner">         <tr>           <td class="bodytext">             <p></p><!-- instancebegineditable name="bigpicture-378wide" --><img src="images/blrlarge.jpg" alt="blr" width="378" height="324" class="picturefloatright"><!-- instanceendeditable -->                       <!-- instancebegineditable name="daydatemonthyear" -->             <p>thursday 11th march 2010 </p>             <!-- instanceendeditable -->

how can extract text contained within comment tags using beautiful soup? example, want return:

<blr>

thursday 11th march 2010

thanks

you might find program helpful:

from bs4 import beautifulsoup bs4.element import comment, navigablestring html_doc = 'x.html' soup = beautifulsoup(open(html_doc))  # identify start comment def isinstancebegineditable(text):     return (isinstance(text, comment) ,             text.strip().startswith("instancebegineditable"))  # identify end comment def isinstanceendeditable(text):     return (isinstance(text, comment) ,             text.strip().startswith("instanceendeditable"))  # start comments instancebegineditable in soup.find_all(text=isinstancebegineditable):     # found start comment, @ text , comments:     text in instancebegineditable.find_all_next(text=true):         # found text or comment, examine closely         if isinstanceendeditable(text):             # found end comment, out of pool             break         if isinstance(text, comment):             # found comment, ignore             continue         if not text.strip():             # found blank text, ignore             continue         # whatever left must         print text

Search This Blog

Script

python - How can I extract the text between comment tags with Beautiful Soup? -

Comments

Post a Comment

Popular posts from this blog

javascript - Bootstrap Popover: iOS Safari strange behaviour -

Magento/PHP - Get phones on all members in a customer group -

spring cloud - How to configure SpringCloud Eureka instance to point to https on non standard port -