REALITY BIT(E)S

The Digital Dark Age - Unicode:

A lot of the texts that survived the medieval dark ages were hand transcribed copies of originals that weren’t so lucky. These imperfect copies are the subject of much study by scholars as they compare differences (and similarities) between the various copies to try to determine what the lost original did indeed say. For example, no copy of Chaucer’s Canterbury Tales exists in the writer’s own hand. The tales as we know them have been put together from the eighty-three known copies of the work that have survived from the 14th century.

It is my fear that the same situation will apply for today’s digital texts, but in decades rather than centuries.

Last week I touched briefly on how obsolescence of platforms, both hardware and software, was putting at risk the electronic files of as little as a decade ago. This danger is becoming more and more recognised and I noted in the Panorama supplement of the Canberra Times on the weekend an article on just this very subject. However, in my opinion, the problem is deeper than that because of obsolescence in the way we store the very characters our digital texts are made up of.

The first digital computers were designed and made in English speaking countries: USA, England and, believe it or not, Australia. For these pioneers, all that was necessary was a way of encoding the basic letters of the English language and that could be covered by an encoding scheme (i.e. a bit representation) that had 128 different possible codes.

However, when you take into account the diacritical marks used in European languages, the twenty-eight characters of Arabic, and the manifold glyphs of the Asian languages a one byte encoding scheme (ASCII: originally 7, now 8 bits to a byte) becomes woefully inadequate. This is especially the case when you want to combine glyphs from different alphabets in the same file. Hence the need for Unicode, a character encoding scheme, whose goal is to provide a unique code for every language glyph (and all other symbols used in communication) in the world. Perceptive readers will appreciate that the adoption of Unicode in the non-English speaking world is quite well advanced.

However, what is the price we must pay for this sensible and much needed advancement?

It is most likely that the operating system on the computer you are using to read this article already recognises and uses Unicode at some level. And converting from straight ASCII to Unicode is child’s play as the former is subset of the latter. However, its round the edges where the problems lurk because computer programmers and font designers are and were a crafty lot and there are many existing files whose on-screen and printed appearance rely solely on the typeface of the font used to display them.

Use another font, lose the original font, or lose the ability to render that old font and the file becomes meaningless gibberish, reverting at worst to the basic ASCII character set, or at best requiring someone taking the effort to convert the font into Unicode.

And herein lies the problem, because when the content creator is no longer around to tell us how the file should be represented this starts to become a non-trivial problem. It starts costing money and this means decisions, possibly short sighted ones, will be made as to which digital files to convert and which will be left to become unreadable as technology marches on.

This has been a woefully inadequate look at what is truly a difficult problem and to my readers who have any background in this field I apologise, but then I’m not really writing these articles for you. What I hope to engender in my less technically oriented readers is an awareness that backups of your digital files, while useful for disaster recovery, are not sufficient for the longer term.

Next week I will look at some of the ways you can hopefully futureproof some of that, apparently not so deathless prose, we all have lying in state on our hard drives.

References:

Canberra Times Article:

“A stitch in time ... might save you nothing” by Craig Gamble, Pg 29., Panorama Supplement, Canberra Time, 2-8-2008.

About the Canterbury Tales:

http://en.wikipedia.org/wiki/The_Canterbury_Tales#Analysis

The Wikipedia on Unicode:

http://en.wikipedia.org/wiki/Unicode

N.B. Please note that I although I use the Wikipedia (and WikiMedia Commons) a lot for references, this is for expediency and the familiarity of my readers. Anyone interested in further studies should make use of the references where available and understand the Wikipedia is a co-operative project contributable to by anyone and must always be looked at in that light.

Phill Berrie, August, 2008.