Welcome to part 13 of the Go programming tutorial series, where we're going to continue building our news aggregator. Once we have grabbed the list of sitemaps, we need to parse those sitemaps. That said, we've also got some cleaning up of the XML parser that we can do. Leading up to here, our code is:
package main import ( "encoding/xml" "fmt" "io/ioutil" "net/http" ) type Sitemapindex struct { Locations []Location `xml:"sitemap"` } type Location struct { Loc string `xml:"loc"` } func (l Location) String() string { return fmt.Sprintf(l.Loc) } func main() { resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml") bytes, _ := ioutil.ReadAll(resp.Body) var s Sitemapindex xml.Unmarshal(bytes, &s) //fmt.Println(s.Locations) for _, Location := range s.Locations { fmt.Printf("%s\n", Location) } }
While this code works, it's setting us up for quite the mess as we expand things. I wanted to show it fully broken down, but, in the case of our example with types that have single values, we can easily combine things. Our Sitemapindex
just contains a slice of the Location
type. Looking at our Location
type, it's just a string, so we can instead make the Sitemapindex
Locations
value a slice of string type. If we do this, we no longer need a string method either. Our code can instead be:
package main import ( "encoding/xml" "fmt" "io/ioutil" "net/http" ) type Sitemapindex struct { Locations []string `xml:"sitemap>loc"` } func main() { resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml") bytes, _ := ioutil.ReadAll(resp.Body) var s Sitemapindex xml.Unmarshal(bytes, &s) for _, Location := range s.Locations { fmt.Printf("%s\n", Location) } }
Much cleaner! Note the use of the >
in `xml:"sitemap>loc"`
. This is used to specify embedded tags. The loc
tag is found instead the sitemap
tag. Okay, so now let's work on our next step, which is parsing from this list of sitemaps. An example sitemap that we'll get linked to is: http://www.washingtonpost.com/news-technology-sitemap.xml
, which looks like:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:n="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.google.com/schemas/sitemap-news/0.9 http://www.google.com/schemas/sitemap-news/0.9/sitemap-news.xsd"> <url> <loc>https://www.washingtonpost.com/business/technology/un-adds-32-items-to-list-of-prohibited-goods-for-north-korea/2017/10/23/5f112818-b812-11e7-9b93-b97043e57a22_story.html</loc> <changefreq>hourly</changefreq> <n:news> <n:publication> <n:name>Washington Post</n:name> <n:language>en</n:language> </n:publication> <n:publication_date>2017-10-23T22:12:20Z</n:publication_date> <n:title>UN adds 32 items to list of prohibited goods for North Korea</n:title> <n:keywords> UN-United Nations-North Korea-Sanctions,North Korea,East Asia,Asia,United Nations Security Council,United Nations,Business,General news,Sanctions and embargoes,Foreign policy,International relations,Government and politics,Government policy,Military technology,Technology</n:keywords> </n:news> </url> <url> <loc>https://www.washingtonpost.com/business/technology/cisco-systems-buying-broadsoft-for-19-billion-cash/2017/10/23/ae024774-b7f2-11e7-9b93-b97043e57a22_story.html</loc> <changefreq>hourly</changefreq> <n:news> <n:publication> <n:name>Washington Post</n:name> <n:language>en</n:language> </n:publication> <n:publication_date>2017-10-23T21:42:14Z</n:publication_date> <n:title>Cisco Systems buying BroadSoft for $1.9 billion cash</n:title> <n:keywords> US-Cisco-BroadSoft-Acquisition,Cisco Systems Inc,Business,Technology,Communication technology</n:keywords> </n:news> </url> </urlset>
Remember, if these links and such change, you can convert these samples to bytes to still follow along, or you can also just use your favorite site's sitemap instead and convert the code.
Here, we're interested in the loc
tags still, but we'd also like to grab title
and probably the keywords
tag. Not every sitemap will provide us keywords, but a news sitemap will almost always come with a title and a link. We'll grab the Keywords tag though.
Let's start by making a struct that will accept the title, link, and keywords for entries:
type News struct { Titles []string `xml:"url>news>title"` Keywords []string `xml:"url>news>keywords"` Locations []string `xml:"url>loc"` }
The News
struct is much like our Sitemapindex
stuct, just with more values and varying tag paths.
Next, let's visit our main
function, which will still start with:
func main() { var s Sitemapindex var n News resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml") bytes, _ := ioutil.ReadAll(resp.Body) xml.Unmarshal(bytes, &s) for _, Location := range s.Locations { fmt.Printf("%s\n", Location) } }
Now though, we want to actually visit that Location
, which itself is a sitemap that links to actual content. Let's start by adding var n News
before the for loop begins, and then let's make our request, get the response, and unpack it to n
, all using the same concepts as we've covered up to now:
func main() { var s Sitemapindex var n News resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml") bytes, _ := ioutil.ReadAll(resp.Body) xml.Unmarshal(bytes, &s) news_map := make(map[string]NewsMap) for _, Location := range s.Locations { resp, _ := http.Get(Location) bytes, _ := ioutil.ReadAll(resp.Body) xml.Unmarshal(bytes, &n) } }
Now, our n
var should contain data from all of the sitemaps, but this isn't quite the format we want. Many times, we want something to translate to a sort of "key" and "value" system, called a hash table. Python uses dictionaries for this. In Go, we use a map
, which is what we're going to be covering in the next tutorial.