Technical Overview

The Linguistic Discovery web site is designed by Barbara Knauff and programmed by Paul Merchant Jr., based on an original concept by Lenore Grenoble and Lindsay Whaley. Ann McHugo and Reinhart Sonnenburg provided organizational assistance. Site graphics were designed by Sarah Horton. Linguistic Discovery is the initial publication of the Dartmouth College Library Digital Publishing Program.

Introduction

This document is intended for those who are curious about the process behind producing the Linguistic Discovery electronic journal. We hope this overview will help you understand the variety of challenges facing Web based publishing. If you have more specific questions about the design of this journal, please contact us through our feedback form.

Throughout the development of the journal we have had to make a number of choices between competing technologies. In many cases there was a pleasant variety of alternatives, any one of which would suffice. In other instances there was no solution that would completely solve the problem. As we made compromises in the design we endeavored to identify as many alternative designs as possible. Our site is constructed so that these alternative designs can be incorporated into the site in the future with minimal effort.

Project Goals

Linguistic Discovery has been designed with these goals in mind:

Freely accessible: Linguistic Discovery should be freely and easily accessible to anyone with access to the Internet.
Web based content: The content of Linguistic Discovery is designed for the Web and should take advantage of the medium to add value to the journal. It is not intended to be a web representation of a print journal.
Subscriber Notifications: It must contain a feature for subscribers to receive notification of new issues and articles.
Multimedia content: The internet opens up the possibility for the delivery of more than just a text based journal. Linguistic Discovery should promote the use and sharing of sound and video among linguists.

Constraints

Linguistic Discovery has been implemented with a minimal budget on a very short time frame. Articles were collected and edited and the entire site designed and programmed in just over 5 months.

The Editorial Process

The production of an issue of Linguistic Discovery follows this process:

Authors submit articles in digital format over E-mail or on a disk.
Using E-mail, the editors send the documents to the members of the editorial board for review.
The editorial board returns comments and a recommendation for publication via E-mail. These are forwarded to the authors via E-mail.
If an article is accepted for publication, the author submits a revised copy of the article via E-mail. A printed copy is also requested so the editors can verify special symbols have been transferred properly.
The editors review and edit the papers for style and format, then generate the final published versions of the document.
The articles are added to the Linguistic Discovery database through a Web-based administrative interface.
After a complete issue has been loaded, the issue is marked as published and becomes immediately available to the public.
Using the Web site, the editors send an E-mail message to all readers who have requested notice of new issues.

The Tools We're Using

One goal of Linguistic Discovery was to have a production process that is simple enough that some of the steps could be carried out by minimally trained staff or students. Our choice of tools reflects our desire to use as much off-the-shelf software as possible, to use software that was familiar to the editorial staff, and to make the steps of the process as easy as possible.

Editing Articles

Microsoft Word is used to edit the articles and prepare them for publication. Word has such a large installed base that we expect that most authors will already be using it to write their articles which will simplify the editorial process (but see the discussion of transferring files between Macintosh and Windows). Word also runs on both Macintosh and Windows computers which is important on the Dartmouth campus.

The markup process to prepare articles to be loaded into Linguistic Discovery is done by applying a Word stylesheet to the document. Styles are defined to approximate the appearance of the final HTML so that the editors have an early preview of the final document. We chose a stylesheet approach rather than a tagging approach because we felt that manually entering individual tags is too tedious and error-prone a process. Various tools are being developed which will simplify XML encoding of documents, but in the time frame of the Linguistic Discovery Process we felt we needed to make use of existing proven tools that were familiar to the editors.

Producing Web Content

To be a truly web based journal we felt that Linguistic Discovery articles should be represented in HTML primarily. Web technology has only barely arrived at a state were producing such complicated HTML markup is possible, so we also include the more traditional PDF document format for each article. While it is easier to control the appearance of a PDF document, PDF is closely tied to print format and is not as easy to read and manipulate online.

After the Word documents are marked up with the appropriate styles, two versions are saved. To produce an HTML version of the document, an RTF version is saved directly from word. A Postscript version is also created using the "print to file" feature of the Macintosh Laserwriter driver. The Postscript version is converted into a PDF document using Adobe Acrobat Distiller.

Using R2Net, a conversion tool by LogicTran we convert the RTF document into an HTML document. R2Net is inexpensive, capable of producing a variety of document formats, including XML or XHTML, and runs on Windows, Macintosh and UNIX. The conversion process is controlled by several configuration files which can be edited with any text editor, allowing us to refine the conversion process as necessary. The conversion process asks the user to merely select the RTF file to be converted and to choose a name for the resulting HTML file.

Font Filter Process

After the HTML versions of the document are produced, text in the phonetic font must be replaced with GIF images. (See Phonetic Fonts on the Web, below.) This is a three part process. First, through a Web form, the text is sent through a filter. The filter, written in Java, identifies all of the symbols in words that contain characters in the phonetic font. (Due to diacritics, symbols may span several characters and contain characters in normal font, phonetic font or both.) If a GIF image for the symbol already exists, the URL for the image is substituted. If it does not exist, the symbol is added to a list of needed symbols. After the entire document is filtered, a list of needed symbols is produced.

The editor saves this list of needed images using the save command from the web browser, then runs an image generation program on a Macintosh to create Macintosh PICT files containing the needed images. The process requires the editor to select the symbol list file, and select a folder in which to save the PICT files. These PICT files are converted into GIF images using Adobe Photoshop which is licensed to create GIFs.

The final GIFs are FTP'd to our web site using the Macintosh application Fetch. Fetch allows the user to select a folder full of files for uploading with a single command.

The Database Manager

From the beginning of the project we realized we needed some kind of database manager to support the subscriber list feature. Dartmouth has a site license to Oracle which is used heavily in the administrative functions of the college, so it became the obvious choice in terms of support and availability. As we developed the web site and examined our searching needs, it became clear that the interMedia Text indexing system provided with Oracle 8 meshed well with our needs. InterMedia Text offers a stem function that matches multiple forms of a word. For example, if a user searches for "language", the stem function would match "language" and "languages". We felt that users would be expecting this kind of behavior due to the widespread popularity of web search engines like Google. InterMedia Text includes the ability to index HTML and PDF documents directly making it the ideal choice for our indexing needs.

With each article record we include additional data such as an abstract, descriptors, author's name, and internal notes. This information is manually entered (though it is usually copyable from the original article) as we felt the detailed markup necessary to allow automatic extraction was more work than the copy and paste process for filling in the web form to enter it.

WebObjects

We looked at several options for producing the dynamic HTML pages for our web site. Keeping in mind cost, availability and development time, we compared Perl, C/C++/Java, WebObjects (an Apple product) and PHP. WebObjects and PHP both appeared adequate for implementing the site. We chose WebObjects because of its project organization and underlying technology. WebObjects applications are written in Java and so may be hosted anywhere Java is available. The database connectivity is provided through the Java Database Connectivity (JDBC) interface making it compatible with any database manager for which there is a JDBC driver. WebObjects code is stored separately from the HTML template files allowing pages to be designed in parallel with the application code. Finally, WebObjects includes a number of tools that provide a graphical way of manipulating the web pages and the database structure. These tools run both on Macintosh and Windows.

The Problems We Encountered

Choice of Phonetic Fonts

A variety of commercial phonetic fonts exist and are used, however, before we fully understood the problems of incorporating phonetic symbols into web pages, we chose the freely available SIL phonetic fonts so that readers would not need to pay for a font to read the journal. When we realized we needed to use a image solution, SIL gave us permission to create images using their font.

Choice of Document Formats

Many journals employ PDF to deliver their articles over the Internet. While it is clear from our experience with cross platform font issues that PDF is much easier to produce, we wanted to make it easy for users to view our journal online. PDF documents reproduce a print version of a document, which we did not want for an online journal. In addition, PDF files are generally more difficult to view online depending on the browser configuration. We chose to include PDF versions of articles in our journal for users who wish to print high quality copies of an article, but we also wanted to offer an HTML version that would be easier to read and view online.

XML is the latest trend in information exchange, and indeed is very well suited for that purpose. However, from the beginning we intended for Linguistic Discovery to be readable within a web browser, but the current support for XML within browsers is marginal. We're also not aware of any graphical XML editors that might be suitable for non-technical staff to use. If we were to produce XML documents, we'd still need a way to produce a browser renderable format from the XML which means introducing yet another intermediate document format to work with.

We finally settled on producing a pure HTML version of each article. Tools have existed for several years to produce HTML from RTF documents, which can be easily saved from Microsoft Word documents. Proven tools also exist to manipulate HTML, and by generating HTML directly, we gave the editors the ability to directly manipulate the final document instead of having to guess how a marked up text would be transformed into HTML. Since we do our markup on text before it is converted into HTML, we leave open the possibility of generating XML at a later time from the RTF document. We also have reserved the ability to generate XML from the article database using any of the Java libraries supplied with WebObjects or Oracle.

File Transfer Between Macintosh and Windows

Macintosh and Windows computers use different character sets to represent text. (A character set is a set of numerical values used for encoding text and other symbols.) Microsoft Word ordinarily translates a document created on one platform into the character set of the other when it is opened on the other platform. For normal text, this makes sense. However, fonts such as the SILDoulos IPA 93 font that contain graphics symbols use the same codes on both Windows and Macintosh. Translating the codes for characters in this font results in a garbled document when moving from one platform to the other. A variety of fonts besides the SIL fonts fall into this category of Symbol fonts, and Microsoft Word on the Macintosh maintains a list of these fonts. When it encounters text in a font in that list, it does not change the codes for that text. On the Windows side, the character set represented by the font is part of the font information.

This table is not updated during the SIL font installation process, and unless an author transfers his document between Windows and a Macintosh he will not even be aware that Word will blindly translate his phonetic symbols. For the editors of Linguistic Discovery who have no control over the system on which an article is created, this is a significant problem.

Manually adding the SIL font to the table of symbol fonts did not solve the problem because that table also affects the codes Word generates when characters are typed in the font. If we defined the SIL font on a Macintosh as a symbol font, we would be able to open documents created on Windows, but not documents created on a Macintosh that did not have the definition in place.

Phonetic Fonts on the Web

The character set translation issue also affects web applications. HTML documents are encoded in a character set that differs from both Windows and Macintosh. The character set can be different from one document to the next, but to be properly displayed the receiving browser must recognize the specified character set. Web browsers must translate characters from the document's character set into the user's computer's character set. This translation process is similar to the translation process when a Word document is transferred from one platform to another. As a result, we could not rely on font tags in the HTML document to accurately reproduce phonetic symbols in the articles. Instead we determined that GIFs were the most reliable way of delivering phonetic characters.

Role of Unicode

The Unicode project aims to assign a unique code to every printed symbol, and ultimately will provide a solution for delivering phonetic symbols on the Web. However, the support for Unicode is not yet available. Users will need Unicode compatible browsers and fonts. Early support is found in current browsers, but the configuration is complex and the support is by no means complete. While we expect this situation to improve over the next few years, the current state-of-the-art isn't compatible with our desire for free and easy access to the journal.

Stylesheet support

A minor but important issue was the varied support for stylesheets. Different browsers and even different versions of the same browser vary in how exactly they follow the CSS standard. We were particularly concerned with how the browsers would draw and align inline GIF images. After much reading and experimentation we came to the familiar conclusion that we couldn't perfectly align the images on every browser that we expected readers to use. After looking at statistics from some of our other web sites, we concluded that our readers were most likely to be using Windows systems running Internet Explorer, so we focused our attention on producing a site that looked best on the latest versions of that browser. When alternative solutions to layout presented themselves we also considered which choice looked best on Netscape, however we found that Netscape more often deviated from the formatting standard than did Internet Explorer and was thus more difficult to adequately design for.

Browser Performance

Still another compatibility problem is that of browser performance. Linguistic Discovery pushes and even passes the limits of Web technology. In the first issue, articles average around 200K of HTML code. In our compatibility testing we found that the latest versions of Netscape and Internet Explorer were quite capable of handling such large documents efficiently (aside from the 45 second download time on a 56k modem), however earlier versions were much more problematic. In one extreme case, a version of Netscape running on a Macintosh appeared to freeze after receiving one of these lengthy documents. In reality, the application was simply in an inner loop formatting the document, but the poor performance of the application created the appearance of a system crash.

To avoid surprising users with this poor performance we've introduced a page between the table of contents and the text of each article that explains the implications of using older browsers. Users can then make an informed choice as to whether to wait for the lengthy formatting process (in addition to the download time) or to view the PDF document which is not ideal for online use.

GIF Images

We are unable to produce GIF images within the font filter directly for two reasons. First, to support imaging the font characters in our Java filter would have required coding beyond the scope of this project. But more important, the GIF image format makes use of the patented LZW compression scheme and incorporating code to generate GIF images into the filter would be prohibitively expensive. At the time of development, Unisys required a $5000 license for sites incorporating GIF images generated by unlicensed software.

PNG is evolving as a possible replacement for the GIF format, but it is not yet widely supported in Web browsers. High quality JPEGs are too large to be practical given the number of images that are present in each article, and low quality JPEGS are too fuzzy for the images to be clear enough. Thus we resorted to a more complex procedure for generating the images, but which used available fully licensed software.

Published by the Dartmouth College Library.
Copyright © 2002 Trustees of Dartmouth College.
For comments or feedback E-mail the site editor.
Page last updated Wednesday, February 27, 2002.
ISSN 1537-0852