screen scraping - How to save the body content of New York Times links using jsoup -

August 15, 2014

i have screen scraping different news websites washington post, ny times , yahoo message boards. used jsoup , works fine of websites washington post. however, when comes ny times, every approach i've used, failed. using such piece of code gives me "log in - new york times" content.

string html = jsoup.connect(urlstring).maxbodysize(integer.max_value).timeout(600000).get().html(); doc = jsoup.parse(html); result = doc.title() + "\n"; result += doc.body().text();

i used cookies , pass them through requests, didn't work well.

connection.response loginform = jsoup.connect("https://myaccount.nytimes.com/auth/login")      .method(connection.method.get).execute(); doc = jsoup.connect("https://myaccount.nytimes.com/auth/login")            .data("userid", myemail).data("password", password)            .cookies(loginform.cookies())            .post(); map<string, string> logincookies = loginform.cookies(); document doc1 =  jsoup.connect(urlstring).maxbodysize(integer.max_value).timeout(600000)                       .cookies(logincookies).get();

can give me approach save body content of ny times urls?

if @ actual data sending during 'normal' login, you'll see thtat besides cookies, user name , password, browser sends fields 'token', 'expires' , on, gets first request. open developer tools in browser, , you'll see it.

you can these values easily. token, can use query - div[class="control hidden"] > input[name=token].
consider change user agent of request, match browser use on pc - way you'll same response site, same field names etc.
see similar question here how-to-loign website

Search This Blog

Script

screen scraping - How to save the body content of New York Times links using jsoup -

Comments

Post a Comment

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

javascript - Bootstrap Popover: iOS Safari strange behaviour -

spring cloud - How to configure SpringCloud Eureka instance to point to https on non standard port -