REALITY BIT(E)S

The Digital Dark Ages - the Spiders:

Last article I made mention of spiders on the World Wide Web. For those who thought that was perhaps a bizarre joke, think again.

Spiders are a class of computer program than literally crawl the web (going from website to website in a methodical manner) collecting a specific type of information. Probably the busiest spiders belong to the internet search engine companies like Google. It is these spiders that collect the data to create the indices that power their search engines.

Spiders can also be used to archive Internet sites. By regularly visiting and methodically capturing all the pages of a website an accessible record of the growth and changes a website goes through can be maintained on a secondary site where hopefully it will be maintained as a record for posterity even if the original site changes completely or even ceases to exist.

The Internet Archive (http://www.archive.org/) is one of these sites. Founded in 1996 the archive is a non-profit organisation dedicated to building a library of the internet. It now possesses about 2 petabytes (2,000,000,000 Mega Bytes) of digital material including web pages. As well as collecting data from contributing sites the Internet Archive also has a feature called the ‘Wayback Machine’ (Yes, it is named after Mr. Peabody’s time machine from the cartoon. Why do you ask?), which allows users to dial up a date and access the web pages from a site for that period.

A little closer to home the National Library of Australia (NLA) has its own web archiving project called ‘PANDORA’. PANDORA is an acronym, “Preserving and Accessing Networked Documentary Resources of Australia” and one of its responsibility is to record Australia’s digital cultural heritage. This puts the sort of writings that authors are likely to put up on their websites well within the ambit of PANDORA’s activities, and your website doesn’t even have to be on Australian soil to be visited by the NLA’s spiders. In fact, there are a number of author’s blogs currently stored in the Pandora archive, even some LiveJournal ones. Your’s could be as well. All you have to do is apply for consideration at the following webpage.

http://pandora.nla.gov.au/registration_form.html

I personally have a couple of sites under review at the moment, including my own mirror of these ‘Reality Bit(e)s’ articles (http://web.mac.com/phillberrie/Reality_Bit%28e%29s/Index.html), so why not you?

This brings us to the end of this series of articles on digital preservation — who’s that cheering?

What I hope you have taken away from them is the realisation that digital is the most ephemeral form of storage for your intellectual property and that archiving your files in digital format and forgetting about them is probably the best way to ensure that they won’t survive in the long term.

Use them or lose them. Even better, publish them, even if its on the internet, so that other people can become involved in their maintenance and propagation onto the next new platform. After all, you can’t live forever, but your work might if they are kept in circulation.

References:

Information about Internet spiders, web crawlers and robots.

http://en.wikipedia.org/wiki/Web_spider

About the Internet Archive.

http://en.wikipedia.org/wiki/Internet_archive

Where to find “Pandora”, Australia’s web archive.

http://pandora.nla.gov.au/

N.B. Please note that I although I use the Wikipedia (and WikiMedia Commons) a lot for references, this is for expediency and the familiarity of my readers. Anyone interested in further studies should make use of the references where available and understand the Wikipedia is a co-operative project contributable to by anyone and must always be looked at in that light.

Phill Berrie, September, 2008.