Web archiving for everyone

By Matthew Cowie  ·   October 15, 2019  ·  3 minute read

Creating an offline archive of a website used to be nearly impossible. Open source, easy to use software, embraced by most web preservation organizations and digital archivists, is making it possible.

As the web grows older, more sites are lost to time. Preserving websites becomes more and more complex as they grow in complexity from simple markup and images, to offering varieties of video, dynamically loading elements, and social media interactions. We recently were tasked with producing a working offline archive of the current content of a client’s site, which was time-sensitive and due to be replaced. But, we soon hit a wall when reaching for the familiar utilities.

Classic web scraping tools like wget, httrack, et al. still work as well as ever for capturing the raw HTML that’s served. However, this is often an incomplete view of the page. Capturing embedded images, audio, and video require extra steps, and often may not be possible without additional tools. And to have the downloaded pages point to those downloaded assets, as well as to link to each other, the pages must be rewritten. It’s a delicate process, and that’s before anything involving javascript yet. And even with a friendly GUI, these tools often require a fair amount of technical knowledge to use.

Enter WebRecorder. WebRecorder is a hosted application that can capture websites exactly as they’d load in a browser. WebRecorder can capture both the initial page load, as well as anything that gets loaded after the fact, such as infinite scroll sites or a huge table that is only loaded if the user clicks a button. It can capture rich embedded elements from third parties, like Tweets or YouTube videos. That means even offline, an embedded YouTube video can still be played.

Not only can WebRecorder capture sites just like a browser would, but using it is just like browsing the site as you usually would. Create a collection, start a capture session, and… surf. That’s it. The “auto-pilot” feature can scroll down a page for you, tigger actions like videos to play, and thus capture dynamic content into the archive. Each page loaded is saved to the collection. Want to capture a newer version of the page? Start a new session and browse to it. WebRecorder will save both versions.

To support the advanced capturing it does, WebRecorder uses a special Web Archive format (WARC). Special formats and long term archiving don’t usually go together. The more niche the format, the harder it is to find the software to use the archive. An archive without the tools to use is not much of an archive. Thankfully, WARC is an open format embraced by most web preservation organizations, including The Internet Archive and The International Internet Preservation Consortium. The latter of which maintains a list of software being developed for WARC.

WebRecorder also provides its own companion playback application, WebRecorder Player for Windows, Mac, and Linux. The player can render all content captured, fully offline. Like the recording server, it is easy to use, free, and open source.

WebRecorder is an impressive tool. It empowers anyone, even nontechnical folks, to preserve any piece of the web they wish. It’s an easy recommendation for organizations wishing to archive their own work. And as folks who produce websites and applications, it’s a reminder to us of all the additional costs that a complex website incurs, even after it’s gone.

Further reading

Image by Catarina Carvalho on Unsplash