i have a huge XML file (about 700MB) generated by masscan, initially i was trying to use masscan-web-ui to parse the file, which was extremely slow of course (have to admit that Offensive Security has done a great job for turning masscan into a Shodan/ZoomEye like search engine). then i decided to roll out my own tool for this.
i chose Go to write my tool, mainly because Go is way faster than any other languages i familiar with, and it's easier to write (compared to C/CPP, Rust, etc)
i've found some good articles covering Go's encoding/xml
lib, one of which also talks about how to process huge XML file with Go, i adapted his code and it worked like a charm
Go's XML lib
if you read Go's doc about encoding/xml
, there're two functions for parsing: func Unmarshal(data []byte, v interface{}) error
and func NewDecoder(r io.Reader) *Decoder
, i would use the latter for stream decoding, since the file is too huge to read at once
decoder := xml.NewDecoder(xmlFile)
for {
// Read tokens from the XML document in a stream.
t, _ := decoder.Token()
if t == nil {
break
}
// Inspect the type of the token just read.
switch se := t.(type) {
case xml.StartElement:
// If we just read a StartElement token
// ...and its name is "page"
if se.Name.Local == "page" {
var p Page
// decode a whole chunk of following XML into the
// variable p which is a Page (se above)
decoder.DecodeElement(&p, &se)
// Do some stuff with the page.
p.Title = CanonicalizeTitle(p.Title)
...
}
...
- my code can be found here
build xml data structure
this is the hardest part, as Go's doc doesn't say much about how to build struct
for xml data (it has some examples though), i have to figure it out by reading other people's code
a very confusing thing with DecodeElement()
is, you have to use Exported type names for your struct
s, otherwise encoding/xml
will not be able to use them, the same rule applies to other functions in the lib, use UPPER case naming for Exporting (like Host
). and remember to upper case every name in your code
- example xml from masscan
<host endtime="1511300934">
<address addr="23.209.193.80" addrtype="ipv4" />
<ports>
<port protocol="tcp" portid="443">
<state state="open" reason="response" reason_ttl="38" />
<service name="X509" banner="MIIINzCCBx+gAwIBAgISAyHPrX5WmY2NR7a7ZNMKQMTgMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQDExpMZXQncyBFbmNyeXB0IEF1dGhvcml0eSBYMzAeFw0xNzA5MjUxNjQwMDBaFw0xNzEyMjQxNjQwMDBaMBcxFTATBgNVBAMTDHd3dy5rbm9yci5jbjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBALIuF7kZD/Ruh9eZa4kWLFhuXvdtOZEo9+t7b34B5/e3hUzuRNes3cxIPuCzBzDWX2lFTpaYrUPcSQZIx3LcnQQ88kTIUmtGkE1ZV3YNZaLKpyr5d1jjesWRupJGG/JehYlU4w8z4UX5ARHixHxnVponL/g/Uo3WObUXn/Y61CAEs+C50TQz6fuviXSv7QbAgiOAubWCL3fe8zFl8USA9ls5TrmMdGyyTI4IaiGGysJBNP0Xtf0T0xXVZ2p4FfDRtubjrJVVVJD0u5LUhRnuzZzM8X09nWGFsZ5DTetiIAMzGvrJEELboP9peJa5WrayJCyJgaanWdjDVBL7dr6VLCcCAwEAAaOCBUgwggVEMA4GA1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggr" />
</port>
</ports>
</host>
- corresponding Go struct
// Address : host>address
type Address struct {
Addr string `xml:"addr,attr"`
Addrtype string `xml:"addrtype,attr"`
}
// State : host>ports>port>state
type State struct {
State string `xml:"state,attr"`
Reason string `xml:"reason,attr"`
ReasonTTL string `xml:"reason_ttl,attr"`
}
// Service : host>ports>port>service
type Service struct {
Name string `xml:"name,attr"`
Banner string `xml:"banner,attr"`
}
// Ports : host>ports
type Ports []struct {
Protocol string `xml:"protocol,attr"`
Portid string `xml:"portid,attr"`
State State `xml:"state"`
Service Service `xml:"service"`
}
// Host : host field in XML
type Host struct {
XMLName xml.Name `xml:"host"`
Endtime string `xml:"endtime,attr"`
Address Address `xml:"address"`
Ports Ports `xml:"ports>port"`
}
- you can also write
Ports
this way:
// Port : host>ports>port
type Port struct {
Protocol string `xml:"protocol,attr"`
Portid string `xml:"portid,attr"`
State State `xml:"state"`
Service Service `xml:"service"`
}
// Host : host field in XML
type Host struct {
XMLName xml.Name `xml:"host"`
Endtime string `xml:"endtime,attr"`
Address Address `xml:"address"`
Ports []Port `xml:"ports>port"`
}
test the code
on my laptop (i5 5200u), this code takes about 1 min to finish parsing the near 700MB XML file
Comments
comments powered by Disqus