By Charles Muller
May 25-26, 2001
13:30-15:30 Session 1: Textual Analysis(Chair: Charles Muller)
Prof. Braarvig, a specialist in early Mahayana literature, has been doing detailed philological work for many years, especially focusing on the creation of parallel combined editions of scriptures wherein the variant renderings of the scriptures in classical Buddhist languages (along with modern English translation when available) are collated by passage. While Prof. Braarvig's paper publications that are developed along these lines are already quite valuable, he demonstrated for us how the usefulness of such editions can be dramatically enhanced by their conversion into digital format. His project, which is entitled Thesaurus Literaturae Buddhicae (TLB) is aimed toward the production of a complete collection of this kind of collated version, with eventual availability on the internet. Searching one version (e.g. the Sanskrit) will also give access to the parallel versions (e.g. Chinese, Tibetan, etc.) and thus the thesaurus can provide complete lexicographical access to the whole Buddhist multilingual canon with references. [paper]
"Classifying the Genealogies of Variant Editions in the Chinese Buddhist Corpus: N-gram Based System for Variant Document Comparison and Analysis (NGSV)"Ishii Kosei, Komazawa Junior College, Japan (kosei[at]ceres.dti.ne.jp)
Prof. Ishii's presentation, like the one above by Prof. Braarvig, was one of a series of examples of the variety of things that one may do with literary texts once they are in digital format. In this case, the focus was on the usage of the N-gram program, which is capable of performing various kinds of analysis on digitized textual corpora. The basic function of N-gram is to identify, count, and tabulate word strings of various lengths in a document, or set of documents. In so doing, the researcher can offer statistics that reveal precise characteristics of the writing of a particular author or tradition. Prof. Ishii went on to show how these statistical analyses could be used to provide hard evidence regarding the authenticity, lineage, provenance, or other problematic point regarding a given text. The N-gram program itself is still in a rather early stage of development by its authors, so even more useful functionality can be expected for the future.
With a digitized version of the Yogācārabhūmi-śāstra, along with all of its related summaries, outlines, commentaries, etc. now readily on hand, it has become possible to create a specialized program that allows for in-depth computer-aided study of a particular text(-family). Ven. Huimin demonstrated the application for the YBh that he has developed in collaboration with a team of colleagues in Taiwan. This version of the YBh, which will be released on CD, started with the markup the contextual features of the text, including all of the various books, articles on so forth. This was followed by a structural markup, which treats the various translations of the YBh and the content of their outline books as ordered hierarchy instead of linear texts structuring the original chapters with a depth of 20 levels. Ven. Huimin's team then identified and tagged non-structural features of the YBh and related documents, relying on dictionary entries, such as in the Yogacam Dictionary as an index. This preparation eventually allowed for the creation of an application that allows for full cross-referencing, with all related passages appearing in various windows. This is intended as a prototype to demonstrate the kind of technology that can be applied to any text.
The SAT project, which is digitizing the Taishō shinshū daizōkyō in Japan, is aiming at the construction of a highly accurate new electronic Buddhist canon. In the work of digitizing the Taishō canon, SAT has already dealt with many of the basic problems, such as that of encoding and missing characters. However, the digitization of the Taishō often involves much more than dealing with plain text, as there are numerous instances of tabular materials, talismanic logographs, mixed text-and-art, and so forth. Thus, a way must be found to properly encode this spatial information in the digitized text. Mr. Moro provided us with a few examples of problematic pages, such as Euisang's Chart of the Dharma-realm of the Single Vehicle of the Huayan (Hwaeom ilseung beopgyedo), and documents that include scores and other complex shapes as found in volume 84 of the Taishō. He showed how these could be treated through the markup strategy called SVG (Scalable Vector Graphics).
16:00-18:00 Session 2: Canonical Collections (Chair: John Lehman)
Robert Chilton appeared once again at the EBTI as the representative of the Asian Classics Input Project. Started in 1987, the Asian Classics Input Project (ACIP) is an ongoing effort to preserve and disseminate important classical Tibetan literature in digital format. Approximately 45,000 pages have been input from woodblock prints of the Kangyur and Tengyur collections of classical Sanskrit literature in Tibetan translation. The works in these collections, dating from the period of 500 BCE up to 900 CE, set forth some of the most significant ideas of Asian thought and culture. ACIP has also input more than 75,000 pages of native Tibetan writings, based on these two collections and dating from 1000 CE to the present, which cover such topics as philosophy, ethics, logic, epistemology, psychology, hermeneutics and metaphysics. The Project has cooperated with local institutions to create comprehensive electronic catalogs of the extensive but relatively inaccessible Tibetan collections located in St. Petersburg, Russia and Ulaanbaatar, Mongolia. Over 70,000 separate titles have been cataloged to date, with work projected to continue for another decade or more. ACIP materials are distributed without charge on diskette, CD-ROM, and via the Internet.
The Research Institute of Tripiṭaka Koreana, one of the early members of the EBTI, became, in 1995, the first group to complete input of an East Asian Buddhist canon. Since that time, the RITK has been working to refine this data, and develop various methods of delivery. Dr. Hur demonstrated for us a beta version of their latest CD-ROM. In addition to offering the full Korean Tripitaka in Unicode text format, this CD includes advanced search functions, text comparison functions, as well as a standard Chinese-Korean dictionary and a Buddhist Chinese-Korean dictionary. It includes a self-installation program for Windows in both Korean and English. We were treated to a view of a beta version of this system, but a fully functional version is expected to be available later this year, providing us will full access to the Korean Canon. The RITK will be continuing to develop new implementations of their KT system in the years to come.
Here we again heard from a project that completed input of a canon (in this case the Pali Canon) quite some time ago. As Mr. Chavan informed us, the VRI is continuing to work at the development of their resources. They are continuing to do input, both with Pali literature--which they are endeavoring to make available at one source-as well as Sanskrit Buddhist texts, which they are now also beginning to input. The inclusion of lexicographical lookup tools is gradually becoming standard in all of the canon delivery projects, and the VRI version is no exception, as they are now working to incorporate a comprehensive Pali-Sanskrit-Burmese-Hindi dictionary, which they expect to be available in the next version of their CD. Looking toward the future, they are-again like other canon projects-beginning to plan the development of an interlinked version of the tripitaka, which will be connected with the East Asian versions where appropriate. The latest version of their CD (ChaṭṭhaSaṅgāyana CD-ROM version 3) is available through their website at http://www.vri.dhamma.org.
Our final canon project presentation was made by Mr. Aming Tu of the Chinese Buddhist Text Association (CBETA http://www.cbeta.org). Started only a short three years ago, the CBETA project has succeeded in delivering the full Chinese portion of the Taishō canon (Vol. 1-55 and 85). They are now working hard at developing various strategies to broaden the usability of this material. Based on a single set of XML-marked source files, they are making available on CD, a HTMLHelp version, and an HTML version, while other versions, such as plaintext versions in a variety of encodings (Shift-JIS, Big5, UTF-8 and GBK) and formats, as well as a MS-Word version can be created from the XML source upon installations. As with the above-mentioned Korean and Pali canons, CBETA is gradually adding a variety of tools and other functions useful for those who want to conduct research on this material, including search capabilities and various dictionary lookup capabilities. CBETA will continue to add further enhancements to this project.
| 18:00-19:30 Reception (Main Room, Sangnok Hall) On Friday night, we were treated to a delicious banquet, capped by a enthralling performance by the Shim Ka-Hee Kum dance group, a first-class troupe that specializes in traditional Korean forms of dance and percussion. Especially impressive was the final performance, in which three drummers in separate cubicles were led by a fourth drummer, with the first three swirling, dancing, and beating an array of drums in perfect synchronization with each other. | 
May 26, 2001 (Saturday)
10:00-12:00 Session 3: Online Reference Works (Chair: Christian Wittern)
"TBRC and Its Model for Linking Text Images with a Bio-Bibliographical Finding Database"Fred Coulson, Tibetan Buddhist Resource Center, USA (coulson[at]iname.com) [project founder and organizer: E. Gene Smith]
As the first presenter on the second day of the conference, Fred Coulson, representing the Tibetan Buddhist Resource Center (presenting at EBTI for the first time), explained the aims of this project, which was founded by E. Gene Smith for the purpose of providing access to the corpus of Indo-Tibetan Buddhist religious culture. During the year 2000, they developed "TBRCDat", amodel for the presentation of a bio-bibliographical dictionary of Indo-Tibetan Buddhism linked to the scanned images of texts that they have begun to deliver. This model epitomizes the following principles: universal access (by avoiding proprietary standards), extensibility (by employing a flexible relational database architecture) and inter-operability (through the use and development of lightweight open-source programming code). In addition to providing a service to teachers, scholars, and students of Indo-Tibetan Buddhism, TBRCDat will also serve as an automated information clearing-house for other Web-based data providers in related fields. Currently, such providers (including TBRC) offer their resources in isolation. TBRCDat, however, is being developed with an eye to overcoming this isolation by collaborating with other projects to develop standard XML Document Type Definitions. By thus facilitating the interchange of data, TBRCDat can serve as a nucleus for a vast multi-media research tool of truly international scope. The TBRC is also working in close partnership with other Tibetan-studies data providers to develop a working standard for dynamic information interchange. The beginnings of this proposal may be viewed at http://tbrc.org/proto-howto.php3.
The Tibetan and Himalayan Digital Library (http://faculty.virginia.edu/tibet-initiative/library/), founded by Prof. David Germano of the University of Virginia, is a unique initiative aimed at building a digital repository of information on the Tibetan and Himalayan cultural region. Interactive by nature, the THDL is creating the mechanisms whereby scholars can both access and contribute information on the area's environment, culture, and geography. The Digital Library is thus compiling several multi-media databases. The environmental and cultural databases are organized thematic. However, one can also access the information through the geographical database by navigating through GIS-based maps of the plateau to a particular area and access video-clips, sound-files, and textual data about any aspect of that region. The THDL also includes Textual Collections, such as the Samantabhadra Collection, which focuses on The Collected Tantras of the Ancients (rnying ma rgyud bum). For such collections, the Library has developed a new SGML/XML mark-up system specifically suited to publish Tibetan Buddhist literature over the internet. With this new system that accounts for the peculiarities of Tibetan religious literature and provides powerful searching mechanisms, they are compiling a master catalogue of the six major editions of this important and scantly studied canon.
Charles Muller, who has made presentations on his combined web dictionaries at the EBTI on a few occasions in the past, gave an update report on advances made with the dictionaries during the past couple of years, primarily focusing on the Digital Dictionary of Buddhism (DDB). The major advances include: (1) an increase in the number of entries, since the last EBTI, from 4200 to 9500. This increase is mainly due to work done by graduate students, paid for by grant support. (2) A new, more complete form of direct access to the XML source files. This new XML version of the dictionaries was made possible due to the programming efforts of Dr. Michael Beddow, who, using Perl, brought about XPath and XLinking functionality. He also developed a search engine capable of handling mixed Chinese/Roman text in a UTF-8 encoding environment. (3) The integration (also thanks to Dr. Beddow) of a large, composite index of CJK lexicographical sources into the DDB search process. Thus, users can search this index while searching the DDB itself. Muller is continuing active pursuit of this project. [paper]
Michel Mohr's presentation was aimed at issues related to the construction of a database including Buddhist figures and their written legacy. One of the projects of the IRIZ "Zen Knowledge Base" (initiated by Urs App at Hanazono) was to establish a unique ID number for each Chan/Seon/Zen figure, thus laying the basis for creating links between authors and extant documents. Have brought to fruition some of the basic stages of this project, Dr. Mohr elaborated some of constituents and problematic issues. Of special interest to scholars will be the critical assessment of traditional lineage accounts, which are, in so many cases, nothing more than "tradition" and based only negligibly on concrete facts. Facts have been skewed since the outset of the construction of lineage charts due to causes as simple as calendrical calculation errors. Thus, the entire theory of how to develop links and use the ID numbers needs to be thought out carefully. Dr. Mohr offered some concrete examples, however, of how useful this kind of system could be, as he called up the name of a couple of Zen figures, along with their attributed works, lineage affiliations, and related scholarship.
13:30-16:00 Session 4: Characters and Encoding (Chair: Robert Chilton)
Christian Wittern, who is invariably working at the cutting edge of the newest digital technology, again introduced EBTI delegates to a relatively new approach to organizing and analyzing digitized textual data, the so-called Topic Map. The Topic Map is an SGML/XML document in which different element types are used to represent topics, occurrences of topics, and relationships (or 'associations') between topics, thus providing a model and architecture for the semantic structuring of information networks. In applying topic maps to texts of the Chinese Chan School, Christian is trying to use them as a means to encode information in a number of 10th to 13th century Chan chronicles. With the TEI markup, the basic structure of the documents and features such as names of persons and places, datable events and the like have been marked. Topic maps are now used to encode the following features in the text: (1) Topics and occurrences of Chan masters and other persons, places, datable events; (2) Quotations and allusions occurring in the text. (3) Interjections and comments of later Chan-masters relating to a given anecdote; (4) Information about the lineage of Chan-masters; (5) Other external information related to these topics, and (6) Links to other resources relevant to topics occurring in the texts. The development of Topic Maps will allow researchers to formulate and explore questions concerning the material in a way that is much closer to the needs of a researcher than is the case with current information retrieval technology.
Prof. Hong explained how several approaches have been attempted over the past several years in handling missing characters in the digitization of ancient documents written in Chinese characters, such as Hanguk Pulgyo Chŏnsŏ [HPC]. Since 1998, The Electronic Buddhist Text Institute (EBTI) at Dongguk University has conducted a research project to digitize the HPCbased on the Unicode standard, which can be accessed on the World Wide Web (www.ebti.dongguk.ac.kr). In building the database for storing web pages for the HPC, missing characters represented in the corresponding image files were also included in HTML documents as a form of image tags, which will be displayed with the Unicode characters on a monitor by the web browser. Even though it was possible to retrieve a part of Chinese texts including missing characters, until now, it was not possible to retrieve documents by using keywords with missing characters. Thus, this retrieval system has been redesigned to that keywords including missing characters can also be entered into the keyword dictionary. That is, by storing keywords including the image tag for missing characters in the index table, one is now able to retrieve documents by using keywords with missing characters. Since these technical problems were resolved by redesigning the retrieval system, we can access the HPC on the WWW, regardless of missing characters.
Prof. John Lehman introduced a project that is based on the seven-year experience of the Academia Sinica Institute of Information Science Document Processing Laboratory in studying the missing character problem. It is intended to develop a packaging and processing system that is compatible with international standards for use in an Internet or Windows environment. The initial design constraints were to use XML markup as part of a system to allow the viewing and processing of documents containing missing characters by any personal computer capable of running Windows 98, and using standard software such as Microsoft Word. It requires no modification of the user's system or software, and provides facilities for sharing data between users. The project was funded by a grant from the Republic of China (Taiwan) Ministry of Education, and is expected to be further developed to run on a larger variety of systems.
The Institute for Medieval Japanese literature, headed by former Columbia professor Barbara Ruch, has been engaged with the task of the digitization of the extensive records contained in a number of nunneries located in the Kyoto area. This is a project that has not really gotten started yet, and so the presentation was aimed directly at introducing the primary issues that need to be dealt with, with the hope of obtaining some advice on how to go about it. The documents that will be digitized and archived will be scrolls written in scripts. Thus, the first problem is simply that of deciding on the best means of digitizing these as images, with the various methods including microfilm, photography-both analog and digital. The ensuing concern will be determining the most efficient of marking up categorizing the data. The members of this project heartily welcome suggestions from those who have expertise in this area.
This RITK project can be seen as the first of many possible derivative projects that will result from the fruits of the technological advances of the digitization of the Korean Tripitaka. Using the basic data, and many of the same tools used in the digital Tripiṭaka Koreana, this project added XML markup to the basic textual data. Utilizing this markup, a prototype application has been developed in which the user who browses particular passages in the Lotus Sutra will have simultaneous access its Taishōversion and Sanskrit original, as well as lexicons and other reference tools. This work is intended to serve as a model for the possibilities of what can be accomplished with every text in the canon, meaning, by extension, the eventual creation of a Korean-Chinese Unified Tripitaka (KCUT), The KCUT project is supposed tobegin around the end of 2001, which will be carried out in cooperation with the DonggukYeoggyeongweon and the Dongguk Electronic Buddhist Text Institute.
16:30-17:30 EBTI Business Meeting (Chair: John Lehman)
This was an especially important business meeting, in which issues were discussed concerning the future course of the EBTI. In its short decade of existence, the EBTI has witnessed dramatic changes in the development of digital technology, along with a concomitant, across-the-board maturation of many of its member projects. Thus, it can certainly be said that the current scope of the EBTI lies far beyond that of its initial role of an "initiative." Indeed, the basic input of the important Buddhist canons-- Tibetan, Pali, Korean, and Japanese-has been well accomplished. But these accomplishments have in turn led us to an equally challenging, and equally exciting new phase of our task, for which the end is still far from visible.
While the EBTI will no doubt be lending advice and assistance to fledgling input projects for some time to come, we have also clearly reached the stage where the focus will come to bear on the wide range of possibilities of what can be done with the data now that it is input. Thus, the focus of each one of the canon projects at this meeting was no longer on issue of input, but on application. Now, however, we need to think not only how applications can be developed, but how they can be developed in such a way that they are interoperable. To implement these aims, a variety of proposals were made and accepted. These, along with other related matters, were as follows.
1. The effort will be made to develop a more centralized and tightly organized EBTI, offering a central web site, and a secretariat, which can be the place for the beginning of a unified interface for the EBTI projects. As a site for this web site and headquarters, an offer was made on the part of the Electronic Buddhist Text Institute of Dongguk University to act as an ongoing central location. The staff members of this institute have secured the domain name of www.ebti.org for this purpose, and are beginning the process of applying to Dongguk University for formal recognition of this role, to obtain some measure of direct support from Dongguk.
2. The usefulness of the former arrangement of "regional representatives" was called into question. The suggestion was made that perhaps the central web site could simply offer contact addresses for representatives from different linguistic traditions, such as Korean, Chinese, Japanese, English, German, etc. This matter will be further explored by a working group, chaired by John Lehman.
3. A move was also made to establish a technical advisory committee, to which newer projects could turn to for advice. Christian Wittern was nominated to head this committee.
4. Elections were held for new chairpersons. The new EBTI co-chairs are Ven. Han Bo-Kwang (Dongguk University) and Charles Muller (Toyo Gakuen University). Ven. Jongnim, former co-chair, was named as honorary co-chairman. We hope to continue to benefit from his valuable input.
5. The announcement of invitation to the EBTI for participation in the September 2002 meeting of the Pacific Neighborhood Consortium (PNC) in Osaka was greeted with enthusiasm. Thus, the EBTI will fully participate in this meeting.
6. News will be forthcoming regarding the reports of working groups, and the efforts of the Dongguk EBTI to be formally accepted in their new status by the Dongguk administration.