About the Project
Project Structure | Beta Project | Project Personnel

Project Structure DLS was contacted by Professor Robert Hegel with the idea of doing a project similar to that of the Red Brush project. Professor Beata Grant had co-written a book of translations and interpretation in English of texts by Chinese women writers, and wanted to create a digital resource of the texts in the original Chinese on which the translations and interpretations were based. Having recently completed his book, True Crimes in Eighteenth-Century China: Twenty Case Histories, Professor Hegel was interested in similarly creating a digital resource of the original Chinese texts on which his interpretations and translations were based.

There were many challenges to the creation of this resource. One significant obstacle was the fact that each character would have to be keyed-in by hand. While there were several talented and dedicated students with the language skills to do this, managing a project based in a foreign language and character set unknown to the full-time staff was problematic. Another issue was the character and structure of the documents themselves. The major XML text-encoding standard, TEI (Text Encoding Initiative) has the expectation of a structure found in modern, published monographs, with elements such as title, table of contents, chapters and paragraphs. The documents in this collection do not conform at all to these expectations, and required someone with strong Chinese language skills, familiarity with the documents and enough knowledge of the TEI to structure the XML in such a way as to reflect the original documents as closely as possible.

Scott Paul McGinnis, then a Master’s student of Professor Hegel originally came to DLS working on the Red Brush project. When the True Crimes project began, Scott was familiar enough with the basics of the TEI to begin researching the encoding guidelines of the TEI at a deeper level to meet this challenge. Along with DLS staff (especially then-Library Assistant, and now-Digital Projects Librarian Shannon Davis) Scott created a document model for the texts. We adopted the workflow from the Red Brush project, in which XML templates were created for each case based on the document model, into which transcribers could key-in the content of the documents.

Beta Project While Scott was able, by and large, to represent the documents in TEI P5, with transcribers encoding characters in Unicode, there were two obstacles to indexing the documents and making them available online. First, as the documents can date to the 17th century, some of the characters are relatively obscure. Scott developed a work-flow where characters not found in the Microsoft IME could be looked up in the Unihan Radical-Stroke Index.

Each of these issues created distinct problems. While the TEI P5 was sophisticated enough to represent the documents in XML, the XML indexer used for the project, DLXS, not only has similar expectations of modern, published books, but its default is for simpler instances of such texts. We have made the decision to try to make the content of the texts available sooner, and this has been at the expense of reflecting the sophisticated structure in which the documents were encoded. We hope to revisit this issue in the next 6-9 months and reindex the documents to better reflect this structure. We have also sacrificed reliable bibliographic representation (including appropriate credit for all the individual transcribers of the documents) to index the documents in this “beta-release.”

Additionally, while DLXS was able to index the majority of characters, a handful of them do not fall in the “plane zero” table of the Unicode, and cannot be indexed in DLXS. As a work-around, we have included these characters as represented only by the decimal each is assigned in Unicode. So while the characters cannot be searched, it is still possible to search for those characters by the decimal. In the texts, the decimal appears with a link to the codepoint in the Unihan Unicode database.

The following is a list of the characters not found in the zero plane of the Unihan Unicode database found in these texts:


Decimal Code Number:

Unihan Unicode Identifier:  
Decimal Code:14199Unihan: U+3777image of Unihan char U+3777
Decimal Code:14911Unihan: U+3A3Fimage of Unihan char U+3A3F
Decimal Code:131648Unihan: U+20240image of Unihan char U+20240
Decimal Code:132235Unihan: U+2048Bimage of Unihan char U+2048B
Decimal Code:133877Unihan: U+20AF5image of Unihan char U+20AF5
Decimal Code:136917Unihan: U+216D5image of Unihan char U+216D5
Decimal Code:137739Unihan: U+21A0Bimage of Unihan char U+21A0B
Decimal Code:137754Unihan: U+21A1Aimage of Unihan char U+21A1A
Decimal Code:138038Unihan: U+21B36image of Unihan char U+21B36
Decimal Code:139460Unihan: U+220C4image of Unihan char U+220C4
Decimal Code:139989Unihan: U+222D5image of Unihan char U+222D5
Decimal Code:140060Unihan: U+2231Cimage of Unihan char U+2231C
Decimal Code:142627Unihan: U+22D23image of Unihan char U+22D23
Decimal Code:142771Unihan: U+22DB3image of Unihan char U+22DB3
Decimal Code:143365Unihan: U+23005image of Unihan char U+23005
Decimal Code:144665Unihan: U+23519image of Unihan char U+23519
Decimal Code:145528Unihan: U+23878image of Unihan char U+23878
Decimal Code:148426Unihan: U+243CAimage of Unihan char U+243CA
Decimal Code:149393Unihan: U+24791image of Unihan char U+24791
Decimal Code:149946Unihan: U+249BAimage of Unihan char U+249BA
Decimal Code:151349Unihan: U+24F35image of Unihan char U+24F35
Decimal Code:153207Unihan: U+25677image of Unihan char U+25677
Decimal Code:153440Unihan: U+25760image of Unihan char U+25760
Decimal Code:154007Unihan: U+25997image of Unihan char U+25997
Decimal Code:154501Unihan: U+25B85image of Unihan char U+25B85
Decimal Code:154698Unihan: U+25C4Aimage of Unihan char U+25C4A
Decimal Code:154748Unihan: U+25C7Cimage of Unihan char U+25C7C
Decimal Code:155827Unihan: U+260B3image of Unihan char U+260B3
Decimal Code:157107Unihan: U+265B3image of Unihan char U+265B3
Decimal Code:164976Unihan: U+28470image of Unihan char U+28470
Decimal Code:165138Unihan: U+28512image of Unihan char U+28512
Decimal Code:166235Unihan: U+2895Bimage of Unihan char U+2895B
Decimal Code:167575Unihan: U+28E97image of Unihan char U+28E97
Decimal Code:167578Unihan: U+28E9Aimage of Unihan char U+28E9A
Decimal Code:168199Unihan: U+29107image of Unihan char U+29107

While we will not be able to index these 35 characters, in addition to reindexing with XML that better reflects the original encoding, we also hope to digitize microfilm of the original documents, where available, and provide links to the images synched with the text.

Project Personnel: Student assistants in Digital Library Services performed much of the work for this project under the supervision of DLS staff (especially Shannon Davis). Scott Paul McGinnis, a master’s degree candidate at Washington University (now in the doctoral program at the University of California, Berkeley) created XML templates for each case document using TEI P5 markup. Once the templates were complete, other students, including Yanning Wang (Now Professor of Chinese at Florida State University) and Chun-yu Lu (currently a doctoral student at Washington University) along with a number of other students, transcribed the cases and corrected any missing or problematic characters. The collection is indexed in DLXS and is available for full-text searching in Chinese, with XML elements searchable in English.