The Linguistic Discovery web site is designed by Barbara Knauff
and programmed by Paul Merchant Jr., based on an original concept
by Lenore Grenoble and Lindsay Whaley. Ann McHugo and Reinhart Sonnenburg
provided organizational assistance. Site graphics were designed
by Sarah Horton. Linguistic Discovery is the initial publication
of the Dartmouth College Library Digital Publishing Program.
Introduction
This document is intended for those who are curious
about the process behind producing the Linguistic Discovery electronic
journal. We hope this overview will help you understand the variety
of challenges facing Web based publishing. If you have more specific
questions about the design of this journal, please contact us through
our feedback
form.
Throughout the development of the journal we have
had to make a number of choices between competing technologies.
In many cases there was a pleasant variety of alternatives, any
one of which would suffice. In other instances there was no solution
that would completely solve the problem. As we made compromises
in the design we endeavored to identify as many alternative designs
as possible. Our site is constructed so that these alternative designs
can be incorporated into the site in the future with minimal effort.
Project Goals
Linguistic Discovery has been designed with these goals in mind:
- Freely accessible: Linguistic Discovery should be freely
and easily accessible to anyone with access to the Internet.
- Web based content: The content of Linguistic Discovery
is designed for the Web and should take advantage of the medium
to add value to the journal. It is not intended to be a web representation
of a print journal.
- Subscriber Notifications: It must contain a feature for
subscribers to receive notification of new issues and articles.
- Multimedia content: The internet opens up the possibility
for the delivery of more than just a text based journal. Linguistic
Discovery should promote the use and sharing of sound and video
among linguists.
Constraints
Linguistic Discovery has been implemented with a minimal
budget on a very short time frame. Articles were collected and edited
and the entire site designed and programmed in just over 5 months.
The Editorial Process
The production of an issue of Linguistic Discovery follows this
process:
- Authors submit articles in digital format over E-mail or on
a disk.
- Using E-mail, the editors send the documents to the members
of the editorial board for review.
- The editorial board returns comments and a recommendation for
publication via E-mail. These are forwarded to the authors via
E-mail.
- If an article is accepted for publication, the author submits
a revised copy of the article via E-mail. A printed copy is also
requested so the editors can verify special symbols have been
transferred properly.
- The editors review and edit the papers for style and format,
then generate the final published versions of the document.
- The articles are added to the Linguistic Discovery database
through a Web-based administrative interface.
- After a complete issue has been loaded, the issue is marked
as published and becomes immediately available to the public.
- Using the Web site, the editors send an E-mail message to all
readers who have requested notice of new issues.
The Tools We're Using
One goal of Linguistic Discovery was to have a production process
that is simple enough that some of the steps could be carried out
by minimally trained staff or students. Our choice of tools reflects
our desire to use as much off-the-shelf software as possible, to
use software that was familiar to the editorial staff, and to make
the steps of the process as easy as possible.
Editing Articles
Microsoft Word is used to edit the articles and prepare them for
publication. Word has such a large installed base that we expect
that most authors will already be using it to write their articles
which will simplify the editorial process (but see the discussion
of transferring files between Macintosh and Windows). Word also
runs on both Macintosh and Windows computers which is important
on the Dartmouth campus.
The markup process to prepare articles to be loaded into Linguistic
Discovery is done by applying a Word stylesheet to the document.
Styles are defined to approximate the appearance of the final HTML
so that the editors have an early preview of the final document.
We chose a stylesheet approach rather than a tagging approach because
we felt that manually entering individual tags is too tedious and
error-prone a process. Various tools are being developed which will
simplify XML encoding of documents, but in the time frame of the
Linguistic Discovery Process we felt we needed to make use of existing
proven tools that were familiar to the editors.
Producing Web Content
To be a truly web based journal we felt that Linguistic Discovery
articles should be represented in HTML primarily. Web technology
has only barely arrived at a state were producing such complicated
HTML markup is possible, so we also include the more traditional
PDF document format for each article. While it is easier to control
the appearance of a PDF document, PDF is closely tied to print format
and is not as easy to read and manipulate online.
After the Word documents are marked up with the appropriate styles,
two versions are saved. To produce an HTML version of the document,
an RTF version is saved directly from word. A Postscript version
is also created using the "print to file" feature of the
Macintosh Laserwriter driver. The Postscript version is converted
into a PDF document using Adobe Acrobat Distiller.
Using R2Net, a conversion tool by LogicTran we convert the RTF
document into an HTML document. R2Net is inexpensive, capable of
producing a variety of document formats, including XML or XHTML,
and runs on Windows, Macintosh and UNIX. The conversion process
is controlled by several configuration files which can be edited
with any text editor, allowing us to refine the conversion process
as necessary. The conversion process asks the user to merely select
the RTF file to be converted and to choose a name for the resulting
HTML file.
Font Filter Process
After the HTML versions of the document are produced, text in the
phonetic font must be replaced with GIF images. (See Phonetic Fonts
on the Web, below.) This is a three part process. First, through
a Web form, the text is sent through a filter. The filter, written
in Java, identifies all of the symbols in words that contain characters
in the phonetic font. (Due to diacritics, symbols may span several
characters and contain characters in normal font, phonetic font
or both.) If a GIF image for the symbol already exists, the URL
for the image is substituted. If it does not exist, the symbol is
added to a list of needed symbols. After the entire document is
filtered, a list of needed symbols is produced.
The editor saves this list of needed images using the save command
from the web browser, then runs an image generation program on a
Macintosh to create Macintosh PICT files containing the needed images.
The process requires the editor to select the symbol list file,
and select a folder in which to save the PICT files. These PICT
files are converted into GIF images using Adobe Photoshop which
is licensed to create GIFs.
The final GIFs are FTP'd to our web site using the Macintosh application
Fetch. Fetch allows the user to select a folder full of files for
uploading with a single command.
The Database Manager
From the beginning of the project we realized we needed some kind
of database manager to support the subscriber list feature. Dartmouth
has a site license to Oracle which is used heavily in the administrative
functions of the college, so it became the obvious choice in terms
of support and availability. As we developed the web site and examined
our searching needs, it became clear that the interMedia
Text indexing system provided with Oracle 8 meshed well with our
needs. InterMedia Text offers a stem function that matches
multiple forms of a word. For example, if a user searches for "language",
the stem function would match "language" and "languages".
We felt that users would be expecting this kind of behavior due
to the widespread popularity of web search engines like Google.
InterMedia Text includes the ability to index HTML and PDF
documents directly making it the ideal choice for our indexing needs.
With each article record we include additional data such as an
abstract, descriptors, author's name, and internal notes. This information
is manually entered (though it is usually copyable from the original
article) as we felt the detailed markup necessary to allow automatic
extraction was more work than the copy and paste process for filling
in the web form to enter it.
WebObjects
We looked at several options for producing the dynamic HTML pages
for our web site. Keeping in mind cost, availability and development
time, we compared Perl, C/C++/Java, WebObjects (an Apple product)
and PHP. WebObjects and PHP both appeared adequate for implementing
the site. We chose WebObjects because of its project organization
and underlying technology. WebObjects applications are written in
Java and so may be hosted anywhere Java is available. The database
connectivity is provided through the Java Database Connectivity
(JDBC) interface making it compatible with any database manager
for which there is a JDBC driver. WebObjects code is stored separately
from the HTML template files allowing pages to be designed in parallel
with the application code. Finally, WebObjects includes a number
of tools that provide a graphical way of manipulating the web pages
and the database structure. These tools run both on Macintosh and
Windows.
The Problems We Encountered
Choice of Phonetic Fonts
A variety of commercial phonetic fonts exist and are used, however,
before we fully understood the problems of incorporating phonetic
symbols into web pages, we chose the freely available SIL phonetic
fonts so that readers would not need to pay for a font to read the
journal. When we realized we needed to use a image solution, SIL
gave us permission to create images using their font.
Choice of Document Formats
Many journals employ PDF to deliver their articles over the Internet.
While it is clear from our experience with cross platform font issues
that PDF is much easier to produce, we wanted to make it easy for
users to view our journal online. PDF documents reproduce a print
version of a document, which we did not want for an online journal.
In addition, PDF files are generally more difficult to view online
depending on the browser configuration. We chose to include PDF
versions of articles in our journal for users who wish to print
high quality copies of an article, but we also wanted to offer an
HTML version that would be easier to read and view online.
XML is the latest trend in information exchange, and indeed is
very well suited for that purpose. However, from the beginning we
intended for Linguistic Discovery to be readable within a web browser,
but the current support for XML within browsers is marginal. We're
also not aware of any graphical XML editors that might be suitable
for non-technical staff to use. If we were to produce XML documents,
we'd still need a way to produce a browser renderable format from
the XML which means introducing yet another intermediate document
format to work with.
We finally settled on producing a pure HTML version of each article.
Tools have existed for several years to produce HTML from RTF documents,
which can be easily saved from Microsoft Word documents. Proven
tools also exist to manipulate HTML, and by generating HTML directly,
we gave the editors the ability to directly manipulate the final
document instead of having to guess how a marked up text would be
transformed into HTML. Since we do our markup on text before it
is converted into HTML, we leave open the possibility of generating
XML at a later time from the RTF document. We also have reserved
the ability to generate XML from the article database using any
of the Java libraries supplied with WebObjects or Oracle.
File Transfer Between Macintosh and Windows
Macintosh and Windows computers use different character sets to
represent text. (A character set is a set of numerical values used
for encoding text and other symbols.) Microsoft Word ordinarily
translates a document created on one platform into the character
set of the other when it is opened on the other platform. For normal
text, this makes sense. However, fonts such as the SILDoulos IPA
93 font that contain graphics symbols use the same codes on both
Windows and Macintosh. Translating the codes for characters in this
font results in a garbled document when moving from one platform
to the other. A variety of fonts besides the SIL fonts fall into
this category of Symbol fonts, and Microsoft Word on the Macintosh
maintains a list of these fonts. When it encounters text in a font
in that list, it does not change the codes for that text. On the
Windows side, the character set represented by the font is part
of the font information.
This table is not updated during the SIL font installation process,
and unless an author transfers his document between Windows and
a Macintosh he will not even be aware that Word will blindly translate
his phonetic symbols. For the editors of Linguistic Discovery who
have no control over the system on which an article is created,
this is a significant problem.
Manually adding the SIL font to the table of symbol fonts did not
solve the problem because that table also affects the codes Word
generates when characters are typed in the font. If we defined the
SIL font on a Macintosh as a symbol font, we would be able to open
documents created on Windows, but not documents created on a Macintosh
that did not have the definition in place.
Phonetic Fonts on the Web
The character set translation issue also affects web applications.
HTML documents are encoded in a character set that differs from
both Windows and Macintosh. The character set can be different from
one document to the next, but to be properly displayed the receiving
browser must recognize the specified character set. Web browsers
must translate characters from the document's character set into
the user's computer's character set. This translation process is
similar to the translation process when a Word document is transferred
from one platform to another. As a result, we could not rely on
font tags in the HTML document to accurately reproduce phonetic
symbols in the articles. Instead we determined that GIFs were the
most reliable way of delivering phonetic characters.
Role of Unicode
The Unicode project aims to assign a unique code to every printed
symbol, and ultimately will provide a solution for delivering phonetic
symbols on the Web. However, the support for Unicode is not yet
available. Users will need Unicode compatible browsers and fonts.
Early support is found in current browsers, but the configuration
is complex and the support is by no means complete. While we expect
this situation to improve over the next few years, the current state-of-the-art
isn't compatible with our desire for free and easy access to the
journal.
Stylesheet support
A minor but important issue was the varied support for stylesheets.
Different browsers and even different versions of the same browser
vary in how exactly they follow the CSS standard. We were particularly
concerned with how the browsers would draw and align inline GIF
images. After much reading and experimentation we came to the familiar
conclusion that we couldn't perfectly align the images on every
browser that we expected readers to use. After looking at statistics
from some of our other web sites, we concluded that our readers
were most likely to be using Windows systems running Internet Explorer,
so we focused our attention on producing a site that looked best
on the latest versions of that browser. When alternative solutions
to layout presented themselves we also considered which choice looked
best on Netscape, however we found that Netscape more often deviated
from the formatting standard than did Internet Explorer and was
thus more difficult to adequately design for.
Browser Performance
Still another compatibility problem is that of browser performance.
Linguistic Discovery pushes and even passes the limits of Web technology.
In the first issue, articles average around 200K of HTML code. In
our compatibility testing we found that the latest versions of Netscape
and Internet Explorer were quite capable of handling such large
documents efficiently (aside from the 45 second download time on
a 56k modem), however earlier versions were much more problematic.
In one extreme case, a version of Netscape running on a Macintosh
appeared to freeze after receiving one of these lengthy documents.
In reality, the application was simply in an inner loop formatting
the document, but the poor performance of the application created
the appearance of a system crash.
To avoid surprising users with this poor performance we've introduced
a page between the table of contents and the text of each article
that explains the implications of using older browsers. Users can
then make an informed choice as to whether to wait for the lengthy
formatting process (in addition to the download time) or to view
the PDF document which is not ideal for online use.
GIF Images
We are unable to produce GIF images within the font filter directly
for two reasons. First, to support imaging the font characters
in our Java filter would have required coding beyond the scope of
this project. But more important, the GIF image format makes use
of the patented LZW compression scheme and incorporating code to
generate GIF images into the filter would be prohibitively expensive.
At the time of development, Unisys required a $5000 license for
sites incorporating GIF images generated by unlicensed software.
PNG is evolving as a possible replacement for the GIF format, but
it is not yet widely supported in Web browsers. High quality JPEGs
are too large to be practical given the number of images that are
present in each article, and low quality JPEGS are too fuzzy for
the images to be clear enough. Thus we resorted to a more complex
procedure for generating the images, but which used available fully
licensed software.
|