Python Programming Tutorials

Continuing our Go Web application

Welcome to part 13 of the Go programming tutorial series, where we're going to continue building our news aggregator. Once we have grabbed the list of sitemaps, we need to parse those sitemaps. That said, we've also got some cleaning up of the XML parser that we can do. Leading up to here, our code is:

package main

import (
	"encoding/xml"
	"fmt"
	"io/ioutil"
	"net/http"
)

type Sitemapindex struct {
	Locations []Location `xml:"sitemap"`
}

type Location struct {
	Loc string `xml:"loc"`
}

func (l Location) String() string {
	return fmt.Sprintf(l.Loc)
}

func main() {
	resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
	bytes, _ := ioutil.ReadAll(resp.Body)

	var s Sitemapindex
	xml.Unmarshal(bytes, &s)
	//fmt.Println(s.Locations)

	for _, Location := range s.Locations {
		fmt.Printf("%s\n", Location)
	}
}

While this code works, it's setting us up for quite the mess as we expand things. I wanted to show it fully broken down, but, in the case of our example with types that have single values, we can easily combine things. Our Sitemapindex just contains a slice of the Location type. Looking at our Location type, it's just a string, so we can instead make the Sitemapindex Locations value a slice of string type. If we do this, we no longer need a string method either. Our code can instead be:

package main

import (
	"encoding/xml"
	"fmt"
	"io/ioutil"
	"net/http"
)

type Sitemapindex struct {
	Locations []string `xml:"sitemap>loc"`
}

func main() {
	resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
	bytes, _ := ioutil.ReadAll(resp.Body)

	var s Sitemapindex
	xml.Unmarshal(bytes, &s)

	for _, Location := range s.Locations {
		fmt.Printf("%s\n", Location)
	}
}

Much cleaner! Note the use of the > in `xml:"sitemap>loc"`. This is used to specify embedded tags. The loc tag is found instead the sitemap tag. Okay, so now let's work on our next step, which is parsing from this list of sitemaps. An example sitemap that we'll get linked to is: http://www.washingtonpost.com/news-technology-sitemap.xml, which looks like:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
	xmlns:n="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
       http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd
       http://www.google.com/schemas/sitemap-news/0.9
       http://www.google.com/schemas/sitemap-news/0.9/sitemap-news.xsd">
	<url>
			<loc>https://www.washingtonpost.com/business/technology/un-adds-32-items-to-list-of-prohibited-goods-for-north-korea/2017/10/23/5f112818-b812-11e7-9b93-b97043e57a22_story.html</loc>
			<changefreq>hourly</changefreq>
			<n:news>
				<n:publication>
					<n:name>Washington Post</n:name>
					<n:language>en</n:language>
				</n:publication>
				<n:publication_date>2017-10-23T22:12:20Z</n:publication_date>
				<n:title>UN adds 32 items to list of prohibited goods for North Korea</n:title>
				<n:keywords>
					UN-United Nations-North Korea-Sanctions,North Korea,East Asia,Asia,United Nations Security Council,United Nations,Business,General news,Sanctions and embargoes,Foreign policy,International relations,Government and politics,Government policy,Military technology,Technology</n:keywords>
			</n:news>
		</url>
	<url>
			<loc>https://www.washingtonpost.com/business/technology/cisco-systems-buying-broadsoft-for-19-billion-cash/2017/10/23/ae024774-b7f2-11e7-9b93-b97043e57a22_story.html</loc>
			<changefreq>hourly</changefreq>
			<n:news>
				<n:publication>
					<n:name>Washington Post</n:name>
					<n:language>en</n:language>
				</n:publication>
				<n:publication_date>2017-10-23T21:42:14Z</n:publication_date>
				<n:title>Cisco Systems buying BroadSoft for $1.9 billion cash</n:title>
				<n:keywords>
					US-Cisco-BroadSoft-Acquisition,Cisco Systems Inc,Business,Technology,Communication technology</n:keywords>
			</n:news>
		</url>
	</urlset>

Remember, if these links and such change, you can convert these samples to bytes to still follow along, or you can also just use your favorite site's sitemap instead and convert the code.

Here, we're interested in the loc tags still, but we'd also like to grab title and probably the keywords tag. Not every sitemap will provide us keywords, but a news sitemap will almost always come with a title and a link. We'll grab the Keywords tag though.

Let's start by making a struct that will accept the title, link, and keywords for entries:

type News struct {
	Titles []string `xml:"url>news>title"`
	Keywords []string `xml:"url>news>keywords"`
	Locations []string `xml:"url>loc"`
}

The News struct is much like our Sitemapindex stuct, just with more values and varying tag paths.

Next, let's visit our main function, which will still start with:

func main() {
	var s Sitemapindex
	var n News
	resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
	bytes, _ := ioutil.ReadAll(resp.Body)
	xml.Unmarshal(bytes, &s)

	for _, Location := range s.Locations {
		fmt.Printf("%s\n", Location)
	}
}

Now though, we want to actually visit that Location, which itself is a sitemap that links to actual content. Let's start by adding var n News before the for loop begins, and then let's make our request, get the response, and unpack it to n, all using the same concepts as we've covered up to now:

func main() {
	var s Sitemapindex
	var n News
	resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
	bytes, _ := ioutil.ReadAll(resp.Body)
	xml.Unmarshal(bytes, &s)
	news_map := make(map[string]NewsMap)

	for _, Location := range s.Locations {
		resp, _ := http.Get(Location)
		bytes, _ := ioutil.ReadAll(resp.Body)
		xml.Unmarshal(bytes, &n)
	}
}

Now, our n var should contain data from all of the sitemaps, but this isn't quite the format we want. Many times, we want something to translate to a sort of "key" and "value" system, called a hash table. Python uses dictionaries for this. In Go, we use a map, which is what we're going to be covering in the next tutorial.

The next tutorial:

Introduction to the Go Programming Language
Go Language Syntax
Go Language Types
Pointers in Go Programming
Simple Web App in Go Programming
Structs in the Go Programming Language
Methods in Go Programming
Pointer Receivers in Go Programming
More Web Dev in Go Language
Acessing the Internet in Go
Parsing XML with Go Programming
Looping in Go Programming
Continuing our Go Web application
Mapping in Golang
Mapping Golang sitemap data
Golang Web App HTML Templating
Applying templating to our Golang web app
Goroutines - Concurrency in Goprogramming
Synchronizing Goroutines - Concurrency in Golang
Defer - Golang
Panic and Recover in Go Programming
Go Channels - Concurrency in Go
Go Channels buffering, iteration, and synchronization
Adding Concurrency to speed up our Golang Web Application