screen scraping - How to save the body content of New York Times links using jsoup -
i have screen scraping different news websites washington post, ny times , yahoo message boards. used jsoup , works fine of websites washington post. however, when comes ny times, every approach i've used, failed. using such piece of code gives me "log in - new york times" content.
string html = jsoup.connect(urlstring).maxbodysize(integer.max_value).timeout(600000).get().html(); doc = jsoup.parse(html); result = doc.title() + "\n"; result += doc.body().text();
i used cookies , pass them through requests, didn't work well.
connection.response loginform = jsoup.connect("https://myaccount.nytimes.com/auth/login") .method(connection.method.get).execute(); doc = jsoup.connect("https://myaccount.nytimes.com/auth/login") .data("userid", myemail).data("password", password) .cookies(loginform.cookies()) .post(); map<string, string> logincookies = loginform.cookies(); document doc1 = jsoup.connect(urlstring).maxbodysize(integer.max_value).timeout(600000) .cookies(logincookies).get();
can give me approach save body content of ny times urls?
if @ actual data sending during 'normal' login, you'll see thtat besides cookies, user name , password, browser sends fields 'token', 'expires' , on, gets first request. open developer tools in browser, , you'll see it.
you can these values easily. token, can use query - div[class="control hidden"] > input[name=token]
.
consider change user agent
of request, match browser use on pc - way you'll same response site, same field names etc.
see similar question here how-to-loign website
Comments
Post a Comment