You may often need to mirror all (or part) of a website for offline analysis. The ‘wget’ program has some easy features to use when you want to quickly get a local copy of a site and correct common issues (like links pointing to server locations). Set up one of these behind the scenes while you work on other aspects, then peruse at your leisure.
Update 2018-07-21: Or just use the script I wrote to simplify this for my customized Kali build, available here.
Mirror All of a Website
wget --mirror --execute robots=off --page-requisites --wait 5 -- adjust-extension --user-agent="friendly-spiderman" --convert-links --directory-prefix=/{archive_path} {site_url}
Let’s break those down:
- The “–mirror” bit tells wget to, well, mirror a site. It sets several default settings which we won’t go into here.
- The “–execute robots=off” part tells wget to ignore any directives contained in the site’s robots.txt file. Without that, wget behaves like a good robot and ignores anything it’s been told to ignore by that file.
- The “–page-requisites” tells wget to download anything required to display the page. In short, if there’s a third-party file or something outside the scope of what you tell wget to pull down, but it’s needed to display the site, it will handle that as a special exception and get it anyway.
- The “–wait 5” tells wget to wait a variable amount of time, up to 5 seconds, between requests.
- The “–adjust-extension” tells wget to rename files to .html. For instance, there’s no reason to keep the “.php” extension for a file when what you’ve downloaded is really the HTML returned by the server after processing the PHP.
- The “–user-agent=…” tells wget to use a custom user agent to identify itself. You may need/want to do this to bypass certain security controls. If you need a list of user agents, you can find a comprehensive set here (mirrored here).
- The “–convert-links” tells wget to switch links from external links (e.g. to http://www.site.com/link.html) to internal links (e.g. to ../../link.html)
- The “–directory-prefix=/{archive_path}” tells wget to store the mirrored site in a folder named whatever the domain is, located in the archive path. So if the archive path is “/tmp/mirrors” and you are mirroring google.com, the mirror would be stored at “/tmp/mirrors/google.com/”
Mirror Part of a Website
wget --mirror --execute robots=off --page-requisites --wait 5 -- adjust-extension --user-agent="friendly-spiderman" --no-parent --convert-links --directory-prefix=/{archive_path} {site_url/path}
Most of this is the same as the previous one. The differences are only two:
- The “–no-parent” tells wget not to follow any links that move further up the directory tree. For instance, if you tell it to mirror “www.catpics.com/funny” it will not pull anything from “www.catpics.com/sad” or even just “www.catpics.com”. It stays at and below the specified directory (except for the –page-requisites items).
- The only other change is that the path is added to tell wget where to mirror. Without that, it’s just a plain ol’ mirror.
Good luck, and good hunting!
Leave a Reply