Earlier this week I learned about the Archive Team, a group of enthusiasts who feel it’s a shame we lose so much of our digital heritage in today's interconnected world. Online services come and go, taking all our data with them, lost forever. The Archive Team feels it doesn't have to be this way and has decided to take action. They track websites and data in danger of getting lost and try to save them as much as possible by archiving everything they can grab before the service shuts down.
Take twitpic for example. The twitpic service consists of an image hosting website where Twitter users upload(ed) photos to link from their tweets before Twitter added support for images. On , Noah Everett - twitpic owner - announced they would shut down the service after a trademark dispute with Twitter. If the notice to shut down is effected, all uploaded photos and comments will be lost forever.
You could argue that losing the photos of Bob's late night dinner and Alicia's selfie aren't that important in the first place. Still, it's a window on our time, how we live and what people think is worth sharing. Not all twitpic photos are personal memories. Some captured major events, like twitpic user Jānis Krūms who took one of the first photos of US Airways Flight 1549 after its emergency landing in the Hudson River in .
The Archive Team built a set of tools to crawl websites, grab its contents and upload it to online archives for long time storage. You can help too by running a warrior, a piece of software you run on your computer that grabs and packages data in danger of getting lost and uploads it to the Internet Archive. The more people running a warrior, the faster the website will be archived. Speed is important here, the endangered service won’t hang around forever.
The Internet Archive helped in developing the WebArchive (or WARC) file format, a file format to use to combine multiple digital resources into an aggregate archive together with related information. There are various open-source tools available to browse or manipulate WARC files. The Archive Team warrior tool uses this format to package the content it grabs and hands it over to the Internet Archive.
Wget, a command line utility used to download websites, can build WARC files out-of-the-box. Creating a WebArchive off of this blog for example is as simple as running:
This post is open source. Did you spot a mistake? Ideas for improvements? Contribute to this post via Github. Thank you!