Crawling Large Sites
I’ve been working on bug bounties and the tools I use for crawling HackTheBox machines do not scale well for large, public sites. These are a few things I’ve learned, and my methodology will improve as time goes on.
GUI Tools Choke
My go-to intercepting proxy is ZAP. I won’t give an exhaustive explanation of my choice, but mentioned a few things. ZAP performs passive scanning as a site is browsed or crawled (aka. spidered). There are a lot of useful alerts given by this. The requests and responses, including headers and bodies, can be searched for content. It would be advantageous to scan a large site. Burp Suite has similar features.
The issue is that crawling a large site can approach tens or hundreds of thousands of requests. Excluding out of scope domains from the proxy helps, but it isn’t enough. At this number of requests operations slow down to become unusable. Opening and closing the ZAP session is slow. I’ve also experienced where ZAP crashes or my VM crashes, it corrupts the session file. That is a lot of work lost.
Command Line
My solution is to not use ZAP or Burp Suite for crawling. I use these tools for manual work. Instead,
I use the command line tool katana
to crawl the entire site. The results are stored in JSON in a text file. Command line tools can be used to digest the data.
Passive Crawling
Passive crawling is using sites like The Wayback Machine to query
historical URLs. The -passive
argument enables this feature instead of crawling the site itself.
katana -u https://example.com -passive
-omit-raw -omit-body
-o katana-passive-example.com.json -jsonl
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0'
Crawling
katana -u https://example.com -js-crawl -jsluice -known-files all
-field-scope fqdn -display-out-scope
-form-extraction -ignore-query-params -strategy breadth-first
-omit-raw -omit-body
-rate-limit 6
-o katana-example.com.json -jsonl
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0' -retry 3
Headless Crawling
Headless crawling uses the Chromium browser to crawl the site. For dynamic web sites this may yield more results. The -headless
argument enables this feature.
katana -u https://example.com -headless -js-crawl -jsluice -known-files all
-field-scope fqdn -display-out-scope
-form-extraction -ignore-query-params -strategy breadth-first
-omit-raw -omit-body
-rate-limit 6
-depth 5 -retry 3
-o katana-headless-example.com.json -jsonl
Argument Description
Read the katana docs to fully understand its options and behavior. These are the ones I’ve found useful so far.
Argument | Description |
---|---|
-js-crawl | Scans javascript for URLs. |
-js-luice | Uses the JSLuice library to extract more URLs. |
-known-files all | Look for robots.txt, sitemap.xml, etc. |
-field-scope fqdn | Don’t crawl outside of the fully qualified domain. |
-display-out-scope | Output links to out of scope URLs without accessing them. |
-form-extraction | Extract form data. |
-ignore-query-params | Ignore query params when determining if a URL has been visited. Keeps the scan from growing out of control. |
-strategy breadth-first | Most features of a site are identified by the first or second depth of the path. This option discovers these earlier. |
-omit-raw | Omit the raw request/response data, otherwise local files can grow large. |
-omit-body | Omit the raw request/response bodies, otherwise local files can grow large. |
-rate limit 6 | Limit to 6 requests per second. Attempts to prevent being blocked or overwhelming the site. |
-o | The output file |
-jsonl | JSON Line output, each line is a JSON document for a request and response. |
-H | Specifies an HTTP header, in this case a custom user agent. |
Analyzing
The output is a JSON payload, one request/response per line, perfect for command line tools.
jq
is the primary tool I use to extract fields. For example, to get a list of all visited
URLs, one per line:
cat katana-example.com.json | jq -r .request.endpoint
To extract parts of the URL, use TomNomNom’s unfurl
:
cat katana-example.com.json | jq -r .request.endpoint | unfurl paths
TomNomNom uses the command line at lot and has developed tools that will help in this approach. Check out his GitHub. Here’s a video I enjoyed watching his command line skills: NahamSec .
ZAP and Burp Suite both allow importing a text file of URLs. Once I have a set of URLs I want to investigate further, I import into ZAP.
Conclusion
When working with sites at large scale, I need to get creative instead of wasting time waiting for tools to run. ZAP has a command line mode and docker image that I plan to experiment with to see if I can get it to perform passive scans at scale.