html - Extract URLs from Google search result page -


i'm trying grab urls off of google search page , there 2 ways think it, don't have idea how them.

first, scrape them .r tags , href attribute each link. however, gives me long string have parse through url. here's example of have parsed through:

https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=utf-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=u&ei=n8nvvdsvbmosyatszykocq&ved=0ceuqfjal&usg=afqjcngyd5njsqoncyleljt9c0hqvq7gya

the url want out of be:

https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

so have create string between https , &sa i'm not 100% sure how because each long string google gives me different size using slice , cutting "x" amount of characters wouldn't work.

second, underneath each link in google search there url in green text. right clicking , inspecting element gives: cite class="_rm" (between chevrons) don't know how find goquery because looking cite small function gives me more long strings of characters.

here small function, first option without parsing , gives me long string of text takes me search page:

func geturls(url string) {      doc, err := goquery.newdocument(url)      if err != nil {         panic(err)     }      doc.find(".r").each(func(i int, s *goquery.selection) {          doc.find(".r a").each(func(i int, s *goquery.selection) {             link, _ := s.attr("href")             link = url + link             fmt.printf("link [%s]\n", link)         })      })  } 

the standard library has support parsing urls. check out net/url package. using package, can query parameters urls.

note original raw url contains url want extract in "aqs" parameter in form of

chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/ 

which url.

let's write little helper function gets parameter raw url text:

func getparam(raw, param string) (string, error) {     u, err := url.parse(raw)     if err != nil {         return "", err     }      q := u.query()     if q == nil {         return "", fmt.errorf("no query part")     }      v := q.get(param)     if v == "" {         return "", fmt.errorf("param not found")     }     return v, nil } 

using can "aqs" parameter original url, , using again can "q" parameter desired url:

raw := "https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=utf-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=u&ei=n8nvvdsvbmosyatszykocq&ved=0ceuqfjal&usg=afqjcngyd5njsqoncyleljt9c0hqvq7gya" aqs, err := getparam(raw, "aqs") if err != nil {     panic(err) } fmt.println(aqs)  result, err := getparam(aqs, "q") fmt.println(result) 

output (try on go playground):

chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/ https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/ 

Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -