Project Structure | Beta Project | Project Personnel
DLS was contacted by Professor Robert Hegel with the idea of doing a project similar to that of the Red Brush project. Professor Beata Grant had co-written a book of translations and interpretation in English of texts by Chinese women writers, and wanted to create a digital resource of the texts in the original Chinese on which the translations and interpretations were based. Having recently completed his book, True Crimes in Eighteenth-Century China: Twenty Case Histories, Professor Hegel was interested in similarly creating a digital resource of the original Chinese texts on which his interpretations and translations were based.
There were many challenges to the creation of this resource. One significant obstacle was the fact that each character would have to be keyed-in by hand. While there were several talented and dedicated students with the language skills to do this, managing a project based in a foreign language and character set unknown to the full-time staff was problematic. Another issue was the character and structure of the documents themselves. The major XML text-encoding standard, TEI (Text Encoding Initiative) has the expectation of a structure found in modern, published monographs, with elements such as title, table of contents, chapters and paragraphs. The documents in this collection do not conform at all to these expectations, and required someone with strong Chinese language skills, familiarity with the documents and enough knowledge of the TEI to structure the XML in such a way as to reflect the original documents as closely as possible.
Scott Paul McGinnis, then a Master’s student of Professor Hegel originally came to DLS working on the Red Brush project. When the True Crimes project began, Scott was familiar enough with the basics of the TEI to begin researching the encoding guidelines of the TEI at a deeper level to meet this challenge. Along with DLS staff (especially then-Library Assistant, and now-Digital Projects Librarian Shannon Davis) Scott created a document model for the texts. We adopted the workflow from the Red Brush project, in which XML templates were created for each case based on the document model, into which transcribers could key-in the content of the documents.
While Scott was able, by and large, to represent the documents in TEI P5, with transcribers encoding characters in Unicode, there were two obstacles to indexing the documents and making them available online. First, as the documents can date to the 17th century, some of the characters are relatively obscure. Scott developed a work-flow where characters not found in the Microsoft IME could be looked up in the Unihan Radical-Stroke Index.
Each of these issues created distinct problems. While the TEI P5 was sophisticated enough to represent the documents in XML, the XML indexer used for the project, DLXS, not only has similar expectations of modern, published books, but its default is for simpler instances of such texts. We have made the decision to try to make the content of the texts available sooner, and this has been at the expense of reflecting the sophisticated structure in which the documents were encoded. We hope to revisit this issue in the next 6-9 months and reindex the documents to better reflect this structure. We have also sacrificed reliable bibliographic representation (including appropriate credit for all the individual transcribers of the documents) to index the documents in this “beta-release.”
Additionally, while DLXS was able to index the majority of characters, a handful of them do not fall in the “plane zero” table of the Unicode, and cannot be indexed in DLXS. As a work-around, we have included these characters as represented only by the decimal each is assigned in Unicode. So while the characters cannot be searched, it is still possible to search for those characters by the decimal. In the texts, the decimal appears with a link to the codepoint in the Unihan Unicode database.
The following is a list of the characters not found in the zero plane of the Unihan Unicode database found in these texts:
Decimal Code Number:
|Unihan Unicode Identifier:|
|Decimal Code:14199||Unihan: U+3777|
|Decimal Code:14911||Unihan: U+3A3F|
|Decimal Code:131648||Unihan: U+20240|
|Decimal Code:132235||Unihan: U+2048B|
|Decimal Code:133877||Unihan: U+20AF5|
|Decimal Code:136917||Unihan: U+216D5|
|Decimal Code:137739||Unihan: U+21A0B|
|Decimal Code:137754||Unihan: U+21A1A|
|Decimal Code:138038||Unihan: U+21B36|
|Decimal Code:139460||Unihan: U+220C4|
|Decimal Code:139989||Unihan: U+222D5|
|Decimal Code:140060||Unihan: U+2231C|
|Decimal Code:142627||Unihan: U+22D23|
|Decimal Code:142771||Unihan: U+22DB3|
|Decimal Code:143365||Unihan: U+23005|
|Decimal Code:144665||Unihan: U+23519|
|Decimal Code:145528||Unihan: U+23878|
|Decimal Code:148426||Unihan: U+243CA|
|Decimal Code:149393||Unihan: U+24791|
|Decimal Code:149946||Unihan: U+249BA|
|Decimal Code:151349||Unihan: U+24F35|
|Decimal Code:153207||Unihan: U+25677|
|Decimal Code:153440||Unihan: U+25760|
|Decimal Code:154007||Unihan: U+25997|
|Decimal Code:154501||Unihan: U+25B85|
|Decimal Code:154698||Unihan: U+25C4A|
|Decimal Code:154748||Unihan: U+25C7C|
|Decimal Code:155827||Unihan: U+260B3|
|Decimal Code:157107||Unihan: U+265B3|
|Decimal Code:164976||Unihan: U+28470|
|Decimal Code:165138||Unihan: U+28512|
|Decimal Code:166235||Unihan: U+2895B|
|Decimal Code:167575||Unihan: U+28E97|
|Decimal Code:167578||Unihan: U+28E9A|
|Decimal Code:168199||Unihan: U+29107|
While we will not be able to index these 35 characters, in addition to reindexing with XML that better reflects the original encoding, we also hope to digitize microfilm of the original documents, where available, and provide links to the images synched with the text.
Project Personnel: Student assistants in Digital Library Services performed much of the work for this project under the supervision of DLS staff (especially Shannon Davis). Scott Paul McGinnis, a master’s degree candidate at Washington University (now in the doctoral program at the University of California, Berkeley) created XML templates for each case document using TEI P5 markup. Once the templates were complete, other students, including Yanning Wang (Now Professor of Chinese at Florida State University) and Chun-yu Lu (currently a doctoral student at Washington University) along with a number of other students, transcribed the cases and corrected any missing or problematic characters. The collection is indexed in DLXS and is available for full-text searching in Chinese, with XML elements searchable in English.