r - Scrape table, exclude rows with certain class, and assign attribute value to variable per row -


i have page following html.

<table id="batting_gamelogs">  <tbody>   <tr class id="batting_gamelogs.153">    <td></td>    <td></td>    <td>     <span id="pha192504150-simmoal01">    </td>   </tr>   <tr class id="batting_gamelogs.154">    <td></td>    <td></td>    <td>     <span id="pha192504160-simmoal01">    </td>   </tr>   <tr class ="thead">    <td></td>    <td></td>    <td></td>   </tr>  </tbody> </table> 

i using following code scrape table.

data = null batlist = null  battingurls <- paste("http://www.baseball- reference.com",yplist[,c("hrefs")],sep="")  for(thisbattingurl in battingurls){  batting <- htmlparse(thisbattingurl)  fstampid <- regexpr("&", thisbattingurl, fixed=true)-1 fstampyr <- regexpr("year=", thisbattingurl, fixed=true)+5 id <- substr(thisbattingurl, 53, fstampid) year <- substr(thisbattingurl, fstampyr, 75)  if (length(xpathsapply(batting, '//*[@id = "batting_gamelogs"]', xmlvalue))==0) next  tablenode <- xpathsapply(batting, '//*[@id="batting_gamelogs"]')[[1]] data <- readhtmltable(tablenode, stringsasfactors = false) data # select first table total <- cbind(id,year,data)  batlist <- bind_rows(batlist, total)  } 

i leave out row class "thead". don't know if easier scrape whole table , delete unwanted rows later or not grab them in first place. assign span id variable called gameid each row in table scrape.

the code using scrape table grabs whole table @ once think, i'm not sure i'm new r. i've tried searching here, can't make heads or tails of have found.

the code i'm using set gameid works when test 1 url , choose specific tr class id, doesn't when use contains. i'm not sure if it's because i'm running in loop , scraping whole table @ once or not.

gameid <- xpathsapply(batting, '//*[@id="batting_gamelogs.153"]/td[10]/span/@id') 

returns "pha192504150-simmoal01" , different/unique every row of table.

when run in loop i'm trying following code

gameid <- xpathsapply(batting, '//*[contains(., "batting_gamelogs."]/td[10]/span/@id') 

from there i'll cbind gameid other variables @ end of code. don't have in there because it's not working.

thanks help, appreciated!!

i able pull gameid table correcting code @har07 suggested , switching . @id in contains.

gameid <- xpathsapply(batting, '//*[contains(., "batting_gamelogs.")]/td[10]/span/@id') 

now looks like

gameid <- xpathsapply(batting, '//*[contains(@id, "batting_gamelogs.")]/td[10]/span/@id') 

i able exclude rows subsetting current data.frame. added line of code create new data.frame without useless headers , misc rows.

newdata <- subset(data, rk!="april" & rk!="may" & rk!="june" & rk!="july" & rk!="august" & rk!="september" & rk!="october" & rk!="november" & opp!="<na>") 

Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -