4c. Encoding/algorithmic issues
A history of writing is also a history of textual encoding. Every writing technology — from shapes traced into clay tablets to HTML tags used on web pages — conforms to established rules (that may be more or less explicit), without which recorded language would be incomprehensible. Humanists, as students of the history of human communication, have always been at the forefront of developing systems for describing textual artefacts and establishing common practices for their study. As such, it is not surprising that humanists have played such a prominent role in the development and standardization of digital encoding practices, from early uses of punch cards for encoding (shortly after WWII) to the more recent development of markup languages such as SGML and XML. One of the most sustained efforts has been the Text Encoding Initiative (TEI), which has been striving since 1987 to establish flexible standards for encoding information about an infinitely large range of textual objects. The diversity of extant texts explains why the TEI's efforts are ongoing and as needed and relevant today as ever.
Digital humanists working on text encoding are confronted with two specific sets of challenges:
- how to digitize texts — that is, how to transcribe information about print, manuscript, and other texts into a digital format (with all of the risks inherent to any form of transcription); and
- how to encode information in ways that are useful to both the human reader and the computer.
The first set of challenges, regarding strategies of digitization, requires researchers to make (often difficult) decisions about what types of information to preserve when passing from print (or manuscript, stone, etc.) to a digital format. Manuscripts and printed texts have a conceptually limitless amount of information associated with them — the recorded linguistic signs, the physical traits of the object (size, materials used, preservation level), and metatextual information (date of creation, publisher, sales). Digitization imposes compromise and sacrifice: it is not a question of whether information is lost, but how much. Few attempts have been made at producing electronic editions that encode a wide spectrum of information about print objects. The HCI-Book project will be able to contribute substantially to efforts to standardize encoding practices of textual objects as multidimensional and multimedia objects, and not just as sequences of linguistic signs. Moreover, we anticipate that the ability to encode a wider array of features found in textual objects will create new opportunities for representing and studying them.
The question of designing encoding schemes that may useful to both humans and computers has been raised throughout the development of digital encoding formats and markup languages. Since most markup is performed manually (graphical editors are more useful for markup languages with relatively small vocabularies, such as HTML), there has no doubt been a historical tendency to favour human readability of markup over computability. In recent years, however, that trend has somewhat reversed. For instance, eXtensible Markup Language (XML) is in many ways an attempt to create an encoding scheme even more constrained than SGML (Standard Generalized Markup Language). XML saves processing, analysis, and rendering tools from having to anticipate and deal with a wide array of variations and exceptions. Moreover, a new generation of analysis tools designed and developed by the digital humanities community is prompting a rethinking of how text encoding can be better adapted to the demands of algorithmic processing. (Examples of such next-generation tools may be found on the TAPoR website.)
The HCI-Book project will be well positioned to contribute to the collaborative process of adapting existing text encoding practices for a range of purposes that relate to creating rich electronic reading environments. These environments must be able to represent information relevant to traditional textual critics (interested in the material characteristics of textual artefacts), and to provide a variety of integrated analytic tools that can exploit the underlying encoding and facilitate interpretation.
Top