Or, for an easier way to import data into your Google Sheets spreadsheet, you can use app automation tool Zapier's Google Sheets integrations to add data to your spreadsheet automatically. It can log Tweets to a spreadsheet, keep a backup of your MailChimp contacts, or save data from your forms and events to a sheet. Apr 11, 2021 However, as I have just started learning about webscraping, I would get more familiar about the ethics of using it. For example, on this specific case, I am planning to create a loop in order to get the prices for Cu, Zn, Ni, etc. But would this affect the website somehow?(i.e. Create too much traffic that it might fail) – danielmf93 Apr 11.
I have previously written a post on scraping Google with Python. As I am starting to write more Golang, I thought I should write the same tutorial using Golang to scrape Google. Why not scrape Google search results using Google’s home grown programming language.
Imports & Setup
2 4 6 8 | 'fmt' 'net/http' ) |
This example will only being using one external dependency. While it is possible to parse HTML using Go’s standard library, this involves writing a lot of code. So instead we are going to be using the very popular Golang library, Goquery which supports JQuery style selection of HTML elements.
Web Scraping Google Sheets
Defining What To Return
We can get a variety of different information from Google, but we typically want to return a result’s position, URL, title and description. In Golang it makes sense to create a struct representing the data we want to be gathered by our scraper.
2 4 6 | 'com':'https://www.google.com/search?q=', 'ru':'https://www.google.ru/search?q=', } |
This will allow pass a two letter country code to our scraping function and scrape results from that particular version of Google. Using the different base domains in combination with a language code allows us to scrape results as they appear in the country in question.
Web Scraping Google Scholar
2 4 6 8 | func buildGoogleUrl(searchTerm string,countryCode string,languageCode string)string{ searchTerm=strings.Replace(searchTerm,' ','+',-1) ifgoogleBase,found:=googleDomains[countryCode];found{ returnfmt.Sprintf('%s%s&num=100&hl=%s',googleBase,searchTerm,languageCode) returnfmt.Sprintf('%s%s&num=100&hl=%s',googleDomains['com'],searchTerm,languageCode) } |
We then write a function that allows us to build a Google search URL. The function takes in three arguments, all of the string type and returns a URL also a string. We first trim the search term to remove any trailing or proceeding white-space. We then replace any of the remaining spaces with ‘+’, the -1 in this line of code means that we replace every-single remaining instance of white-space with a plus.
We then look up the country code passed as an argument against the map we defined earlier. If the countryCode is found in our map, we use the respective URL from the map, otherwise we use the default ‘.com’ Google site. We then use the format packages “Sprintf” function to format a string made up of our base URL, our search term and language code. We don’t check the validity of the language code, which is something we might want to do if we were writing a more fully featured scraper.
2 4 6 8 10 12 14 | funcgoogleRequest(searchURL string)(*http.Response,error){ baseClient:=&http.Client{} req,_:=http.NewRequest('GET',searchURL,nil) req.Header.Set('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36') res,err:=baseClient.Do(req) iferr!=nil{ }else{ } |
We can now write a function to make a request. Go has a very easy to use and power “net/http” library which makes it relatively easy to make HTTP requests. We first get a client to make our request with. We then start building a new HTTP request which will be eventually executed using our client. This allows us to set custom headers to be sent with our request. In this instance we our replicating the User-Agent header of a real browser.
We then execute this request, with the client’s Do method returning us a response and error. If something went wrong with the request we return a nil value and the error. Otherwise we simply return the response object and a nil value to show that we did not encounter an error.
Parsing the Result
Now we move onto parsing the result of request. Compared with Python the options when it comes to HTML parsing libraries is not as robust, with nothing coming close to the ease of use of BeautifulSoup. In this example, we are going to use the very popular Goquery package which uses JQuery style selectors to allow users to extract data from HTML documents.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 | func googleResultParser(response *http.Response)([]GoogleResult,error){ doc,err:=goquery.NewDocumentFromResponse(response) returnnil,err results:=[]GoogleResult{} rank:=1 item:=sel.Eq(i) link,_:=linkTag.Attr('href') descTag:=item.Find('span.st') title:=titleTag.Text() iflink!='&&link!='#'{ rank, title, } rank+=1 } } |
We generate a goquery document from our response, and if we encounter any errors we simply return the error and a nil value object. We then create an empty slice of Google results which we will eventually append results to. On a Google results page, each organic result can be found in ‘div’ block with the class of ‘g’. So we can simply use the JQuery selector “div.g” to pick out all of the organic links.
We then loop through each of these found ‘div’ tags finding the link and it’s href attribute, as well as extracting the title and meta description information. Providing the link isn’t an empty string or a navigational reference, we then create an GoogleResult struct holding our information. This can then be appended to the slice of structs which we defined earlier. Finally, we increment the rank so we can tell the order in which the results appeared on the page.