r - Scrape table, exclude rows with certain class, and assign attribute value to variable per row -
i have page following html.
<table id="batting_gamelogs"> <tbody> <tr class id="batting_gamelogs.153"> <td></td> <td></td> <td> <span id="pha192504150-simmoal01"> </td> </tr> <tr class id="batting_gamelogs.154"> <td></td> <td></td> <td> <span id="pha192504160-simmoal01"> </td> </tr> <tr class ="thead"> <td></td> <td></td> <td></td> </tr> </tbody> </table>
i using following code scrape table.
data = null batlist = null battingurls <- paste("http://www.baseball- reference.com",yplist[,c("hrefs")],sep="") for(thisbattingurl in battingurls){ batting <- htmlparse(thisbattingurl) fstampid <- regexpr("&", thisbattingurl, fixed=true)-1 fstampyr <- regexpr("year=", thisbattingurl, fixed=true)+5 id <- substr(thisbattingurl, 53, fstampid) year <- substr(thisbattingurl, fstampyr, 75) if (length(xpathsapply(batting, '//*[@id = "batting_gamelogs"]', xmlvalue))==0) next tablenode <- xpathsapply(batting, '//*[@id="batting_gamelogs"]')[[1]] data <- readhtmltable(tablenode, stringsasfactors = false) data # select first table total <- cbind(id,year,data) batlist <- bind_rows(batlist, total) }
i leave out row class "thead". don't know if easier scrape whole table , delete unwanted rows later or not grab them in first place. assign span id variable called gameid each row in table scrape.
the code using scrape table grabs whole table @ once think, i'm not sure i'm new r. i've tried searching here, can't make heads or tails of have found.
the code i'm using set gameid works when test 1 url , choose specific tr class id, doesn't when use contains. i'm not sure if it's because i'm running in loop , scraping whole table @ once or not.
gameid <- xpathsapply(batting, '//*[@id="batting_gamelogs.153"]/td[10]/span/@id')
returns "pha192504150-simmoal01" , different/unique every row of table.
when run in loop i'm trying following code
gameid <- xpathsapply(batting, '//*[contains(., "batting_gamelogs."]/td[10]/span/@id')
from there i'll cbind gameid other variables @ end of code. don't have in there because it's not working.
thanks help, appreciated!!
i able pull gameid table correcting code @har07 suggested , switching . @id in contains.
gameid <- xpathsapply(batting, '//*[contains(., "batting_gamelogs.")]/td[10]/span/@id')
now looks like
gameid <- xpathsapply(batting, '//*[contains(@id, "batting_gamelogs.")]/td[10]/span/@id')
i able exclude rows subsetting current data.frame. added line of code create new data.frame without useless headers , misc rows.
newdata <- subset(data, rk!="april" & rk!="may" & rk!="june" & rk!="july" & rk!="august" & rk!="september" & rk!="october" & rk!="november" & opp!="<na>")
Comments
Post a Comment