web scraping - Avoiding overlapping of responses in python scrapy -
i trying scrape family information of indian rajya sabha members found here http://164.100.47.5/newmembers/memberlist.aspx
being newbie in scrapy, followed this , this example codes produce following.
def parse(self,response): print "inside parse" requests = [] target_base_prefix = 'ctl00$contentplaceholder1$gridview2$ctl' target_base_suffix = '$lkb' in range(2,5): if < 10: target_id = "0"+str(i) else: target_id = str(i) evtarget = target_base_prefix+target_id+target_base_suffix form_data = {'__eventtarget':evtarget,'__eventargument':''} requests.append(scrapy.http.formrequest.from_response(response, formdata = form_data,dont_filter=true,method = 'post', callback = self.parse_politician)) r in requests: print "before yield"+str(r) yield r def parse_pol_bio(self,response): print "[parse_pol_bio]- response url - "+response.url name_xp = '//span[@id=\"ctl00_contentplaceholder1_gridview1_ctl02_label3\"]/font/text()' base_xp_prefix = '//*[@id=\"ctl00_contentplaceholder1_tabcontainer1_tabpanel2_ctl00_detailsview2_label' base_xp_suffix='\"]/text()' father_id = '12' mother_id = '13' married_id = '1' spouse_id = '3' name = response.xpath(name_xp).extract()[0].strip() name = re.sub(' +', ' ',name) father = response.xpath(base_xp_prefix+father_id+base_xp_suffix).extract()[0].strip() mother = response.xpath(base_xp_prefix+mother_id+base_xp_suffix).extract()[0].strip() married = response.xpath(base_xp_prefix+married_id+base_xp_suffix).extract()[0].strip().split(' ')[0] if married == "married": spouse = response.xpath(base_xp_prefix+spouse_id+base_xp_suffix).extract()[0].strip() else: spouse = '' print 'name marital_stat father_name mother_name spouse' print name,married,father,mother,spouse item = rsitem() item['name'] = name item['spouse'] = spouse item['mother'] = mother item['father'] = father return item def parse_politician(self,response): evtarget = 'ctl00$contentplaceholder1$tabcontainer1' evarg = 'activetabchanged:1' formdata = {'__eventtarget':evtarget,'__eventargument':evarg} print "[parse_politician]-response url - "+response.url return scrapy.formrequest.from_response(response, formdata,method = 'post', callback = self.parse_pol_bio)
explanation
parse
method loops on target id different politicians , sends requests.
parse_politician
- tab changing purpose
parse_politician_bio
scraping of dependency names.
problem
problem causes duplicate responses parse_politician_bio.
i.e information same person coming multiple times.
nature of duplicate responses quite random @ each run i.e - different politicans data may duplicated @ each response.
have checked whether there request being yield multiple times none are.
tried put sleep after each yield request see if helps.
suspect scrapy request scheduler here.
is there other problem in code??can done avoid this?
edit
clarify here, know dont_filter=true , have deliberately kept that.
the problem response data getting replaced. example, when generate 3 requests, target_id = 1,2,3 separately. response target_id = 1 getting replaced response target_id = 2 .
[so makes me have 1 response target_id - 3 , 2 target_id -2]
expected output (csv)
politician name , spouse name , father name , mother name pol1 , spouse1, father1, mother1 pol2 , spouse2, father2, mother2 pol3 , spouse3, father3, mother3
output given (csv)
politician name , spouse name , father name , mother name pol1 , spouse1, father1, mother1 pol1 , spouse1, father1, mother1 pol3 , spouse3, father3, mother3
finally fixed (phew!).
default scrapy sends 16 requests @ time (concurrent requests).
putting concurrent_requests = 1
in settings.py file made sequential , solved issue.
the requests gave similar (check above), , responses data got overlapped 1 give duplicate responses of 1 type only.
no idea how happening though, solution making sequential requests confirms this.
better explanations?
Comments
Post a Comment