web scraping - Avoiding overlapping of responses in python scrapy -


i trying scrape family information of indian rajya sabha members found here http://164.100.47.5/newmembers/memberlist.aspx
being newbie in scrapy, followed this , this example codes produce following.

def parse(self,response):      print "inside parse"     requests = []     target_base_prefix = 'ctl00$contentplaceholder1$gridview2$ctl'     target_base_suffix = '$lkb'      in range(2,5):         if < 10:             target_id = "0"+str(i)         else:             target_id = str(i)          evtarget = target_base_prefix+target_id+target_base_suffix          form_data = {'__eventtarget':evtarget,'__eventargument':''}          requests.append(scrapy.http.formrequest.from_response(response, formdata = form_data,dont_filter=true,method = 'post', callback = self.parse_politician))      r in requests:         print "before yield"+str(r)         yield r   def parse_pol_bio(self,response):      print "[parse_pol_bio]- response url - "+response.url      name_xp = '//span[@id=\"ctl00_contentplaceholder1_gridview1_ctl02_label3\"]/font/text()'     base_xp_prefix = '//*[@id=\"ctl00_contentplaceholder1_tabcontainer1_tabpanel2_ctl00_detailsview2_label'     base_xp_suffix='\"]/text()'     father_id = '12'     mother_id = '13'     married_id = '1'     spouse_id = '3'      name = response.xpath(name_xp).extract()[0].strip()     name = re.sub(' +', ' ',name)      father = response.xpath(base_xp_prefix+father_id+base_xp_suffix).extract()[0].strip()     mother = response.xpath(base_xp_prefix+mother_id+base_xp_suffix).extract()[0].strip()     married = response.xpath(base_xp_prefix+married_id+base_xp_suffix).extract()[0].strip().split(' ')[0]      if married == "married":         spouse = response.xpath(base_xp_prefix+spouse_id+base_xp_suffix).extract()[0].strip()     else:         spouse = ''      print 'name     marital_stat    father_name     mother_name     spouse'     print name,married,father,mother,spouse      item = rsitem()     item['name'] = name     item['spouse'] = spouse     item['mother'] = mother     item['father'] = father      return item    def parse_politician(self,response):      evtarget = 'ctl00$contentplaceholder1$tabcontainer1'     evarg =  'activetabchanged:1'     formdata = {'__eventtarget':evtarget,'__eventargument':evarg}      print "[parse_politician]-response url - "+response.url      return scrapy.formrequest.from_response(response, formdata,method = 'post', callback = self.parse_pol_bio) 

explanation
parse method loops on target id different politicians , sends requests.
parse_politician - tab changing purpose
parse_politician_bio scraping of dependency names.

problem
problem causes duplicate responses parse_politician_bio.
i.e information same person coming multiple times.
nature of duplicate responses quite random @ each run i.e - different politicans data may duplicated @ each response.
have checked whether there request being yield multiple times none are.
tried put sleep after each yield request see if helps.
suspect scrapy request scheduler here.

is there other problem in code??can done avoid this?

edit
clarify here, know dont_filter=true , have deliberately kept that.

the problem response data getting replaced. example, when generate 3 requests, target_id = 1,2,3 separately. response target_id = 1 getting replaced response target_id = 2 .
[so makes me have 1 response target_id - 3 , 2 target_id -2]

expected output (csv)

politician name , spouse name , father name , mother name pol1 , spouse1, father1, mother1 pol2 , spouse2, father2, mother2 pol3 , spouse3, father3, mother3 

output given (csv)

politician name , spouse name , father name , mother name pol1 , spouse1, father1, mother1 pol1 , spouse1, father1, mother1 pol3 , spouse3, father3, mother3 

finally fixed (phew!).
default scrapy sends 16 requests @ time (concurrent requests).
putting concurrent_requests = 1 in settings.py file made sequential , solved issue.

the requests gave similar (check above), , responses data got overlapped 1 give duplicate responses of 1 type only.

no idea how happening though, solution making sequential requests confirms this.
better explanations?


Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -