Date Tags golang / xml

i have a huge XML file (about 700MB) generated by masscan, initially i was trying to use masscan-web-ui to parse the file, which was extremely slow of course (have to admit that Offensive Security has done a great job for turning masscan into a Shodan/ZoomEye like search engine). then i decided to roll out my own tool for this.

i chose Go to write my tool, mainly because Go is way faster than any other languages i familiar with, and it's easier to write (compared to C/CPP, Rust, etc)

i've found some good articles covering Go's encoding/xml lib, one of which also talks about how to process huge XML file with Go, i adapted his code and it worked like a charm

Go's XML lib

if you read Go's doc about encoding/xml, there're two functions for parsing: func Unmarshal(data []byte, v interface{}) error and func NewDecoder(r io.Reader) *Decoder, i would use the latter for stream decoding, since the file is too huge to read at once

decoder := xml.NewDecoder(xmlFile)

for {
    // Read tokens from the XML document in a stream.
    t, _ := decoder.Token()
    if t == nil {
        break
    }
    // Inspect the type of the token just read.
    switch se := t.(type) {
    case xml.StartElement:
        // If we just read a StartElement token
        // ...and its name is "page"
        if se.Name.Local == "page" {
            var p Page
            // decode a whole chunk of following XML into the
            // variable p which is a Page (se above)
            decoder.DecodeElement(&p, &se)
            // Do some stuff with the page.
            p.Title = CanonicalizeTitle(p.Title)
            ...
        }
...
  • my code can be found here

build xml data structure

this is the hardest part, as Go's doc doesn't say much about how to build struct for xml data (it has some examples though), i have to figure it out by reading other people's code

a very confusing thing with DecodeElement() is, you have to use Exported type names for your structs, otherwise encoding/xml will not be able to use them, the same rule applies to other functions in the lib, use UPPER case naming for Exporting (like Host). and remember to upper case every name in your code

  • example xml from masscan
<host endtime="1511300934">
   <address addr="23.209.193.80" addrtype="ipv4" />
   <ports>
      <port protocol="tcp" portid="443">
         <state state="open" reason="response" reason_ttl="38" />
         <service name="X509" banner="MIIINzCCBx+gAwIBAgISAyHPrX5WmY2NR7a7ZNMKQMTgMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQDExpMZXQncyBFbmNyeXB0IEF1dGhvcml0eSBYMzAeFw0xNzA5MjUxNjQwMDBaFw0xNzEyMjQxNjQwMDBaMBcxFTATBgNVBAMTDHd3dy5rbm9yci5jbjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBALIuF7kZD/Ruh9eZa4kWLFhuXvdtOZEo9+t7b34B5/e3hUzuRNes3cxIPuCzBzDWX2lFTpaYrUPcSQZIx3LcnQQ88kTIUmtGkE1ZV3YNZaLKpyr5d1jjesWRupJGG/JehYlU4w8z4UX5ARHixHxnVponL/g/Uo3WObUXn/Y61CAEs+C50TQz6fuviXSv7QbAgiOAubWCL3fe8zFl8USA9ls5TrmMdGyyTI4IaiGGysJBNP0Xtf0T0xXVZ2p4FfDRtubjrJVVVJD0u5LUhRnuzZzM8X09nWGFsZ5DTetiIAMzGvrJEELboP9peJa5WrayJCyJgaanWdjDVBL7dr6VLCcCAwEAAaOCBUgwggVEMA4GA1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggr" />
      </port>
   </ports>
</host>
  • corresponding Go struct
// Address : host>address
type Address struct {
    Addr     string `xml:"addr,attr"`
    Addrtype string `xml:"addrtype,attr"`
}

// State : host>ports>port>state
type State struct {
    State     string `xml:"state,attr"`
    Reason    string `xml:"reason,attr"`
    ReasonTTL string `xml:"reason_ttl,attr"`
}

// Service : host>ports>port>service
type Service struct {
    Name   string `xml:"name,attr"`
    Banner string `xml:"banner,attr"`
}

// Ports : host>ports
type Ports []struct {
    Protocol string `xml:"protocol,attr"`
    Portid   string `xml:"portid,attr"`

    State    State   `xml:"state"`
    Service  Service `xml:"service"`
}

// Host : host field in XML
type Host struct {
    XMLName xml.Name `xml:"host"`
    Endtime string   `xml:"endtime,attr"`

    Address Address `xml:"address"`
    Ports   Ports   `xml:"ports>port"`
}
  • you can also write Ports this way:
// Port : host>ports>port
type Port struct {
    Protocol string `xml:"protocol,attr"`
    Portid   string `xml:"portid,attr"`

    State    State   `xml:"state"`
    Service  Service `xml:"service"`
}

// Host : host field in XML
type Host struct {
    XMLName xml.Name `xml:"host"`
    Endtime string   `xml:"endtime,attr"`

    Address Address `xml:"address"`
    Ports   []Port `xml:"ports>port"`
}

test the code

on my laptop (i5 5200u), this code takes about 1 min to finish parsing the near 700MB XML file


Comments

comments powered by Disqus