python - Why is this returning a NoneType? -
i'm trying scrape info off of wikipedia using function below, i'm running attribute error because function call returning none. can please try , explain why returning none?
import wikipedia wp import string def add_section_info(search): html = wp.page(search).html().encode("utf-8") #gets html source wikipedia open("temp.xml",'w') t: #write html xml format t.write(html) table_of_contents = [] dict_of_section_info = {} #this extracts info in table of contents open("temp.xml",'r') r: line in r: if "toclevel" in line: new_string = line.partition("#")[2] content_title = new_string.partition("\"")[0] tbl = string.maketrans("_"," ") content_title = content_title.translate(tbl) table_of_contents.append(content_title) print wp.page(search).section("aortic rupture") #this none, shouldn't item in table_of_contents: section = wp.page(search).section(item).encode("utf-8") print section if section == "": continue else: dict_of_section_info[item] = section open("section_info.txt",'a') sect: sect.write(search) sect.write("------------------------------------------\n") item in dict_of_section_info: sect.write(item) sect.write("\n\n") sect.write(dict_of_section_info[item]) sect.write("####################################\n\n") add_section_info("abdominal aortic aneurysm") what don't understand if run add_section_info("hiv"), example, works perfectly.
the source code imported wikipedia here
my output on above code this:
abdominal aortic aneurysm signs , symptoms traceback (most recent call last): file "/home/pharoslabsllc/documents/wikitest.py", line 79, in <module> add_section_info(line) file "/home/pharoslabsllc/documents/wikitest.py", line 30, in add_section_info section = wp.page(search).section(item).encode("utf-8") attributeerror: 'nonetype' object has no attribute 'encode'
the page method never returns none (you can check in source code), section method does return none if title cannot found. see documentation:
section(section_title)get plain text content of section
self.sections. returnsnoneifsection_titleisn’t found, otherwise returns whitespace stripped string.
so answer wikipedia page referring has no section titled aortic rupture, as far library concerned.
looking @ wikipedia seems page abdominal aortic aneurysm have such section.
note if try check value of wp.page(search).sections get: []. i.e. it seems library isn't parsing sections properly.
from source code of library found here can see test:
section = u"== {} ==".format(section_title) try: index = self.content.index(section) + len(section) except valueerror: return none however:
in [14]: p.content.find('aortic') out[14]: 3223 in [15]: p.content[3220:3220+50] out[15]: '== aortic ruptureedit ===\n\nthe signs , symptoms ' in [16]: p.section('aortic ruptureedit') out[16]: "the signs , symptoms of ruptured aaa may includes severe pain in lower back, flank, abdomen or groin. mass pulses heart beat may felt. bleeding can leads hypovolemic shock low blood pressure , fast heart rate. may lead brief passing out.\nthe mortality of aaa rupture 90%. 65–75% of patients die before arrive @ hospital , 90% die before reach operating room. bleeding can retroperitoneal or abdominal cavity. rupture can create connection between aorta , intestine or inferior vena cava. flank ecchymosis (appearance of bruise) sign of retroperitoneal bleeding, , called grey turner's sign.\naortic aneurysm rupture may mistaken pain of kidney stones, muscle related pain." note edit ==. in other words library has bug doesn't take account link edit.
the same code works page hiv because in page headings don't have edit link right next them. have no idea why so, anywyay looks either bug or shortcoming of library, should open ticket on issue tracker.
in meanwhile use simple fix like:
def find_section(page, title): res = page.section(title) if res none: res = page.section(title + 'edit') return res and use function instead of using .section method. can temporary fix.
Comments
Post a Comment