html - Extract URLs from Google search result page -
i'm trying grab urls off of google search page , there 2 ways think it, don't have idea how them.
first, scrape them .r
tags , href
attribute each link. however, gives me long string have parse through url. here's example of have parsed through:
the url want out of be:
so have create string between https
, &sa
i'm not 100% sure how because each long string google gives me different size using slice , cutting "x" amount of characters wouldn't work.
second, underneath each link in google search there url in green text. right clicking , inspecting element gives: cite class="_rm"
(between chevrons) don't know how find goquery because looking cite
small function gives me more long strings of characters.
here small function, first option without parsing , gives me long string of text takes me search page:
func geturls(url string) { doc, err := goquery.newdocument(url) if err != nil { panic(err) } doc.find(".r").each(func(i int, s *goquery.selection) { doc.find(".r a").each(func(i int, s *goquery.selection) { link, _ := s.attr("href") link = url + link fmt.printf("link [%s]\n", link) }) }) }
the standard library has support parsing urls. check out net/url
package. using package, can query parameters urls.
note original raw url contains url want extract in "aqs"
parameter in form of
chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
which url.
let's write little helper function gets parameter raw url text:
func getparam(raw, param string) (string, error) { u, err := url.parse(raw) if err != nil { return "", err } q := u.query() if q == nil { return "", fmt.errorf("no query part") } v := q.get(param) if v == "" { return "", fmt.errorf("param not found") } return v, nil }
using can "aqs"
parameter original url, , using again can "q"
parameter desired url:
raw := "https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=utf-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=u&ei=n8nvvdsvbmosyatszykocq&ved=0ceuqfjal&usg=afqjcngyd5njsqoncyleljt9c0hqvq7gya" aqs, err := getparam(raw, "aqs") if err != nil { panic(err) } fmt.println(aqs) result, err := getparam(aqs, "q") fmt.println(result)
output (try on go playground):
chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/ https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
Comments
Post a Comment