Prepare Text For TTS

A Text Preprocessor For TTS Applications

Introduction

If you have ever sent "Thu Mar 23 14:04:45 est 2000" to a voice synthesizer, you were probably confused by the resulting stream of abbreviations and digits. And if you didn't know (in advance) that the output represented a date and time, the speech would be as comprehensible as ancient Greek. Clearly we need an intelligent text preprocessor that translates "computer English" into standard English. If the above had been run through such a preprocessor, you might hear "march twenty third two thousand at 2 o 4 PM eastern standard time." For brevity, we might omit the year and time zone, leaving "march twenty third at 2 o 4 PM." Finally, if the date coincides with the current date, the preprocessor might simply generate "today at 2 o 4 PM." After all, that's how a human would render the expression if he were reading it to you. "This mail message was sent today at 2 o 4 PM." We might also use the word yesterday, or even Friday, if the mail was sent last Friday. The text preprocessor described here contains all these features, and many more.

The Jupiter Speech System

This preprocessor was originally written for the Jupiter Speech System, which is a talking version of Linux for blind users. Therefore, some of its features are geared towards that particular application. However, the text preprocessor can be used in many other applications as well. I deliberately separated this capability from the rest of the Jupiter speech system, so it can be pasted into a process (such as an email reader), or another speech package for blind users. If you are a blind Linux user, I hope you will give the Jupiter speech system a try. If you are developing another speech package for the blind, I hope you will incorporate this preprocessor, along with the clicktty module (also part of the Jupiter speech system). Finally, if you are developing a voice application that renders general text (as opposed to canned messages), you might want to employ this preprocessor. Bear in mind, this software cannot be used for profit, without the expressed written consent of the author (see below).

Copyright Notice

The Jupiter Speech Package, which includes this preprocessor, is copyright (C) Karl Dahlke, 2000. It may be freely distributed under the terms of the General Public License, as articulated by the Free Software Foundation.

Download The Package

Click here to download a compress tar archive of the package. The preprocessor consists of one header file and two source files, and this documentation (which you are reading). The header file defines other variables and constants for the Jupiter speech system. If these do not interest you, you can remove them. Two regression test files are also included. If you just want to look at the regression tests, to see what the preprocessor actually does, download the following two ASCII files, bring them up in separate windows, and compare them line by line.

before --- after

Many Languages

Although this preprocessor recognizes and generates English text, its design anticipates most western European languages. For instance, we accept 8-bit characters, rather than 7-bit ASCII, and we pass these through to the synthesizer. If you want a Spanish version, the 8-bit code 0xf1 represents ñ, as in El Niño. This code, along with e-accent (French), o-umlaut (German), etc, is passed to the voice synthesizer, and (we hope) converted into the appropriate phoneme for that language. If your synthesizer uses ASCII escape codes for these meta-characters, you will need to put another translater into the pipeline. Fortunately this is a simple program, replacing each instance of a character with a fixed string.

The software that determines whether a word is "pronounceable", as opposed to an acronym (such as CPU), is, in large part, table driven. Transition tables are used to indicate typical pairs of letters in an English word -- at the beginning, in the middle, and at the end. These tables can be replaced with Spanish versions.

Finally, a two-part encode/decode design makes it easier to support other languages. Encoding software maps the string 09/09/1960 into a date, and decoding software then renders the date as "September nighnth nighnteen sixdy." All the words in these output messages are contained in static arrays. As a first approximation, you can replace the English words with their Spanish equivalents, to get a Spanish date. Of course some languages, such as German, say "three and twenty", rather than "twenty three", so the conversion is not actually as simple as swapping in a new batch of words. However, the encode/decode design should simplify the process somewhat. The language independent routines are isolated, and the language specific routines employ tables wherever feasible. This should facilitate a wide range of languages, with a modest development effort.

Remove Binary Data, tc_ascify()

The preprocessor is actually a series of small consecutive functions, which may be thought of as one long pipeline. Each function makes certain assumptions about the incoming data (from the previous function), and produces data that is compatible with the next function. We shall describe each function in turn.

The first pass, implemented by tc_ascify(), removes blocks of binary data. Note that we do not remove isolated binary characters, as these may represent international symbols, such as ñ. However, if they represent meta characters in a European language, they should be relatively sparse. When half the bytes are binary, we have a problem. Each region of binary data is excised, and replaced with a formfeed character, ASCII 12, which is used throughout to separate paragraphs.

In practice, we don't have to worry about binary data. Binary files are usually sent as MIME attachments, or uu-encoded data, hence they become ASCII text. But if your worst enemy sends you a raw binary file, we don't want the preprocessor to blow up. The result may not be pretty, but at least it doesn't core dump.

This function makes cr/lf (DOS convention) and lf (Unix convention) equivalent. Each is replaced with cr (my convention). Hereafter, cr, ASCII 13, is used to denote physical line breaks, whether the file came from DOS or Unix.

If a character is followed by backspace, both characters are deleted. This effectively removes the overstrikes or underlines that are used to emphasize certain phrases in character based output. Run `man ls >ls.out', and look at the result. The titles and section headers contain overstrike characters, which would not be read properly if fed directly to a synthesizer.

Finally, control characters, delete, and 0x80 are culled from the input. The latter is a special code that we use internally, so we don't want to see it come in from the outside. This character is used to encode dates and times etc, as described earlier. Thus 09/09/1960 might become <0x80>D09091960<0x80>.

Linearize A Split-Screen Display, tc_relinearize()

The trailing block of an email message is often a split screen display, with name and email information on the left, and the street address on the right (for example). This routine rearranges the text, so that the left side is read first, then the right. Here is a split screen example.


Karl Dahlke                  4704 Bonniebrook
kdahlke@ptek.com             Troy, MI 48098
http://www.this.that.com     (248) 524-1004
http://some.other.site

Encode Whitespace, tc_whitespace()

Once this function is complete, the space character represents a single space, tab represents a sequence of spaces and tabs (also known as linear whitespace), cr represents a line break, and formfeed represents a paragraph break. The latter can appear in the original text, but more often it is derived. One or more blank lines indicate a paragraph break, and are replaced with a single formfeed. When literal mode is disabled, a line that contains nothing but graphics characters is also treated as a paragraph break. This is often a line of dashes or stars. When literal mode is active, the user wants to hear the dashes and stars, so they cannot be discarded.

When this preprocessor is built into the Linux kernel, as part of an adaptive package for the blind, literal mode is the default. In my experience, you usually want to run with literal mode on. The synthesizer can say the word "dot" faster than it will pause for the corresponding period, and time is a precious commodity. However, if the preprocessor is built into an application, such as an email reader, literal mode is disabled.

With literal mode off, emoticons, such as :-), are encoded here. Most constructs are encoded later, but emoticons are language independent, and consist entirely of graphics characters, so it makes sense to encode them now. In the same spirit, we translate a leading * or - or --, before text, into 0xb7, which is the 8-bit code for a bullet list marker.

Remove Garbage Lines, tc_ungarbage()

If a line looks like encoded data, as indicated by a lack of whitespace characters, that line is removed, along with small amounts of text that might precede or follow the garbage line, up to a paragraph break. Thus an entire uuencoded file will be skipped.

Isolate Titles, tc_titles()

Look for titles -- a sequence of words that are all upper case, surrounded by mixed case text. Or, a sequence of capitalized words surrounded by text that is predominantly lower case. The state machine that calculates this is rather complicated, taking newlines and periods into account. For instance, a title must begin with a newline, and a prior period adds points to the equation. When a title is identified, it becomes its own paragraph.

Encode List Items, tc_listItem()

The aforementioned byte 0xb7 is encoded as a bullet list marker, whether that byte was in the original text, or derived, from a leading star. A lead letter or number, followed by colon or period, is also encoded as a list item indicator. When the actual list item is short, the colon or period becomes a comma, for a short pause. When the list item is an entire sentence or paragraph, we use a period after the designator. Thus the lead letter or number is read as its own sentence. Then we read the list item. Of course these changes are not made in literal mode, where the user wants to hear every punctuation mark as it was written.

In-Line Replacements, tc_encode()

Certain hard-coded words are replaced with other words in the stream. The primary motivation for this is the management of exception words -- words that are not handled properly by the down-stream acronym detecter. For instance, the word "kiwi" certainly looks strange. The current acronym detector would flag this as unpronounceable, and translate it into its constituent letters K-I-W-I. Yet kiwi is a perfectly good English word. This routine turns it into keywey, which looks, and is, perfectly pronounceable. In this case it is not worth "improving" the acronym detecter, since most synthesizers do a rather poor job of pronouncing kiwi, and we'd have to replace it with something anyways. We may as well replace it here, and leave the acronizer alone. The same holds for most of the other exception words. If they weren't replaced with something, they would be turned into letters by the acronizer, or, they would be mispronounced by the synthesizer.

Common English and metric units are turned into words. Thus "3 lb 6 oz" becomes "3 pounds six ounces". We make an effort to say pound, rather than pounds, for 1lb. Note that "1/2 gal" is read as "one half gallon", while "0.5 gal" is read as 0.5 gallons.

Small Roman numbers, less than 20, are turned into Arabic numbers, in the proper context. VIII becomes 8 no matter what, but we have to be a bit more careful with I, V, X, and VI. The first three could be somebody's initials; the last is a popular Unix screen editor. If the previous word is "chapter", "section", "phase", etc, the Roman number is translated.

In the proper context, ie and eg become i.e. and e.g. respectively. If literal mode is off, a subsequent routine then translates these into "that is" and "for example" respectively. Again, we must be careful, for IE is now a popular browser written by Microsoft.

Standard prefixes such as mr., and standard suffixes such as Jr., are rendered as words. Roman numeral suffixes are handled as above. This is somewhat suboptimal, since "King George III" becomes "King George 3". It would be nice to look back, detect the name, and say "King George the third", but this is not yet implemented.

Standard shorthand fragments such as re=reply, vol=volume, inc=incorporated, corp=corporation, no=number, etc, are translated, provided the context is appropriate. After all, we wouldn't want to turn every "no" into "number". A few words are dropped entirely when enclosed in parentheses. One example is (tm). I suppose the lawyers think this is important, but I don't have time to read everybody's trademarks and copyrights. Other examples include [link] and [inline]. These are generated in large quantities by lynx, and are not terribly useful for blind programmers.

Certain abbreviations such as rd=road, ln=lane, st=street, etc, are replaced within addresses. Note that st could also be a prefix, as in "St. Louis Missouri".

Finally, the last period in a series of letters is removed, unless that period appears to end a sentence. In most contexts, the final period of U.S.A. is dropped. Periods are also stripped from the aforementioned abbreviations, such as mr. -> mister. Once this routine is complete, a period probably indicates the deliberate end of a sentence. It should not delimit a train of initials or an abbreviation.

Encoded Constructs, tc_encode()

Various multi-word constructs are encoded and passed onto the next routine. The special byte 0x80 begins and ends an encoded construct. These will be translated back into words downstream. The primary examples are dates, times, and days of the week, which can be represented by dozens of formats. When they are concisely encoded, subsequent software becomes simpler. For instance, we can render a single hyphen as the word "to" or "through" when it separates two dates or times. It would be difficult to make this translation if dates and times were not encoded. Here are just a few examples:

State names and abbreviations are also encoded inside addresses. This is used to expand MI into Michigan, but the concise encoding also helps manage the commas. The last line of an address is typically written "Troy, MI 48098" with the implicit pause after the city. Yet it is read with the pause after the state, hence the comma must be moved. By encoding both abbreviations and full names, the software automatically moves the comma in "Troy, Michigan 48098".

References to Bible verses, or ranges of Bible verses, are encoded. Thus "John 19:16" is rendered "John chapter 19 verse 16". That's better than "John 7 16 PM".

Common URL notation, such as http://, is encoded, and read using simple English words such as "web site". If the domain is followed by a long awkward path, the pathname is not read. Instead, the software says something like "a web page under www.microsoft.com". It wouldn't do you any good to hear the long pathname anyways. You really need to get to a terminal and click on it, or paste it into your bookmark file. If it is short enough to understand (verbally) and remember, such as www.feingold.org/research.shtml, there may be some point in reading the entire URL.

Implicit Sentences, tc_encode()

When text runs into a paragraph boundary (formfeed), that marks the end of the current sentence, whether there is a period or not. This is an implicit sentence boundary. A newline character, ASCII 10, is inserted to indicate the end of each sentence. Remember, any original newlines were turned into cr, so there are no newlines coming into this routine. All newlines going out mark the end of sentences. When literal mode is off, we also make sure each sentence ends with a period, if it does not already end in a period, question mark, or exclamation point. We assume the TTS engine only understands these three punctuation marks, and the comma, for a short pause.

Other implicit sentence boundaries occur at list items, a short line followed by a longer line, a line wholy contained in parentheses, a line that starts with P.S., and a very long line, exceeding 200 characters. These long lines usually result from a word processor, as it dumps its document to ASCII. Each line is actually an entire paragraph, hence we treat it as such. The previous line, and the current long line, are marked as sentences. Furthermore, the original cr separater is turned into formfeed, making the long line a paragraph unto itself.

In contrast to the very long line, a short line, 40 characters or less, is assumed to be part of a block address, or some other formatted construct. Thus each short line receives a comma, if it does not already end with a punctuation mark.

Explicit Sentences, tc_encode()

Sentences are split according to the standard punctuation rules. Remember that Mr. Flintstone and Harry S. Truman all belong in one big sentence, even if cr follows one of the internal periods. Also, some sentence end in a closing quote or parenthesis, which must change places with the prior punctuation mark -- the one that actually ends the sentence. As above, a newline is appended to each sentence in the output stream.

Decode Constructs, expandCode()

The multi-word items that were encoded before are now rendered as words. As mentioned earlier, these words are all stored in static arrays at the top of the file. Thus they can be replaced with words from another language. Note that the "standard" words are rarely used. Instead, words are slightly misspelled, so that the majority of synthesizers will pronounce them correctly. Thus Zephaniah, a rather obscure book of the Bible, is rendered Zephinigha, because that spelling elicits the correct pronunciation from most commercial synthesizers.

Punctuation Marks, expandPunct()

In literal mode, each punctuation mark is turned into words. When literal mode is off, punctuation is often discarded. Some marks are passed directly through, if they indicate the end of a sentence or phrase.

Some punctuation marks may receive special translations in certain contexts. We've already described the role of the hyphen between two dates or times. Hyphen may also be read as "dash", inside a social security or ID number. A period may be read as "point" in 3.75 or as "dot" inside 10.86.9.27. In the appropriate contexts, @ is read "at", and # is read "number". Most of the time, slash is read as "slash", but special software detects phrases such as and/or and he/she.

Numbers And Money, expandNumber()

Short numbers, four digits or less, are read naturally, rather than digit by digit. Thus 12 becomes twelve, 338 becomes three thirty eight or three hundred thirty eight (depending on context), 1492 becomes fourteen ninety two, 2nd becomes second, and 104th becomes one hundred fourth. This translation is not done if the number begins with a zero. Multi-token numbers such as 12,345 are also translated, up to six digits with literal mode on, and 15 digits with literal mode off. This is one of many tradeoffs between clear English speech and accurate unambiguous information. In the C language, you might encounter an array of initializers {102,305,907,484}. We wouldn't want to read this as a number in the billions. Since C programmers always work with literal mode enabled, they will hear each 3 digit initializer, separated by commas. On the other hand, 1,000 is probably one thousand, even in a C program (perhaps a comment or print statement), so we read it that way. We make statistical choices and hope we are right most of the time.

Strings that follow the format of the North American numbering plan are read as phone numbers. Thus 800-555-1212 becomes "area code 800, 5 5 5, 1 2 1 2".

Monitary amounts such as $29.37 are read "29 dollars and 37 cents". The biggest challenge here is the plural issue. "That $500 will sure come in handy, so please send the $500 check." Simple amounts such as $4 are not read as money in literal mode, because they are more likely to be positional parameters in an awk/shell script.

Words , expandWord()

When a token consists of numbers and letters, this routine first separates the token into its components, such that each piece contains either letters or numbers, but not both. Numeric pieces are read as described above. Letter components are then split at case boundaries. This is especially useful for programmers, who have to deal with runTogetherVariableNames in their work. Once a letter component is isolated, the acronizer is invoked. This determines whether the word is "pronunceable". If not, it is spelled letter by letter, as in "xyz". This software is not trivial, and will not be described in detail. Needless to say, it sometimes makes mistakes. Words longer than six letters are never read letter by letter. It may not be an English word, but you don't really want to hear all those letters either. A words that looks like a name, with a leading capital letter, is pronounced, unless it contains no vowels.

User Defined Dictionary , tc_userReplace()

This function is not part of the preprocessor software. Instead, you must supply it. The routine tc_userReplace() receives a string, which contains a word in the input text. It compares this word against a list of words, supplied by the user, and if it is found in the user's "dictionary", the alternate spelling is returned. This is used to correct pronunciation errors within the synthesizer, and establish the desired pronunciation for names and acronyms. Your application might store the user's dictionary in memory, a Unix file, or an SQL database. It may contain a dozen words, or 10,000. The implementation details are left to you. However, you must supply a function tc_userReplace(), even if it is simply a stub. Note that the Jupiter speech system provides a working version of this routine, so the blind programmer can establish the pronunciation of various words.

When dealing with a runTogetherWord, the dictionary is first given the entire string, in case its pronunciation is specified as a whole. If that string is not present, the preprocessor invokes tc_userReplace() for each component. If your dictionary respells "together -> tugether", this change will be seen when runTogetherWord is translated. Finally, replacements are made on the roots of possessives, such as together's, and future contractions, such as together'll.

Punctuation Pronunciation , tc_speakChar()

This function is given a single character, usually a punctuation mark. It leaves the user-defined pronunciation for that punctuation mark in the global array tc_token[]. Like the previous routine, this is not part of the preprocessor; you must suply it. Note that this routine is only used in literal mode, where the user wants to hear every punctuation mark. If your application does not enable literal mode, you can simply provide a stub.

Cleanup , postCleanup()

The previous routines pepper the output with commas and periods in a rather uncoordinated fashion. Thus the end of a phrase or sentence might be marked with several commas and/or periods, surrounded by somewhat unpredictable bursts of whitespace. This routine compresses whitespace and punctuation, leaving only the strongest mark (comma < period < question mark < exclamation point).

API , tc_prepTTS()

This is the access routine for the entire capability. It calls the previous routines in order, and manages the entire pipeline. Since this function has no arguments and no returns, you must communicate with it through global arrays. An awkward interface to be sure, but it works well within the Jupiter speech system, where it is tightly coupled with the Linux kernel. A better interface, for applications, is described below.

API , tc_prepTTSmsg()

This is the access routine for the entire capability. It calls the previous routines in sequence, and manages the entire pipeline. It is called as follows.

#include "tc_hdr.h"

ichar *tc_prepTTSmsg(ichar *msg, int len);

The header file tc_hdr.h establishes the typedef ichar, which is an international character. For now this is an 8-bit unsigned char, but it may become a 16-bit unicode in the future. The message (which may contain nulls), and the message length are passed. A pointer to the translated message is returned. This pointer belongs to an allocated block of memory, which is freed upon the next call to tc_prepTTSmsg(). If you want to retain the translated text beyond the next call, you better make a copy.

Interactive Applications

Consider an application that reads the first sentence of a document or email message, then waits for a command from the user. This is an interactive speech application. The user might ask the application to read the next sentence, or reread the current sentence, or move to the next mail message. At each step the application needs to translate a sentence, or perhaps a paragraph, yet it has no idea where these boundaries are until it invokes the preprocessor. In a typical design, the application collects a bit more text than is necessary (statistically), passes it to the preprocessor, and scans the output for a sentence marker (newline) or paragraph marker (formfeed). This delimited block of text is then routed to the synthesizer. In addition to the reformatted words, the output includes back-pointers into the original text. By following the pointer associated with the delimiting newline character, the application can return to the original message and determine precisely where the first sentence ends. If the user asks for the next sentence, the application gathers another 200 bytes, starting where the first sentence left off. Once again it looks for the newline character, passes the translated sentence to the synthesizer, and uses the pointer to locate the end of the second sentence in the original text.

For increased granularity, each word or phrase can also be mapped back to its location in the original text. Thus the application can procede word by word, if that is what the user wants. Of course this degrades speech quality. A synthesizer always does a better job when it is handed an entire sentence. This is especially true in other languages, such as French, where the pronunciation of one word may depend on the next word. In an ideal design, the synthesizer speaks an entire sentence, yet it notifies the application as each word is read. Using the parallel array of pointers, the application maps these spoken words back to their ASCII counterparts in the original text. In other words, a reading cursor traverses the original text in lock-step with the synthesizer's speech. When the user hits the pause key, the application knows exactly what word the user was listening to. It can resume reading from that point, read the previous word, read the next word, etc.

All of the above has been implemented in my Jupiter speech system. My preprocessor translates an entire sentence, which is passed to the Doubletalk synthesizer. The doubletalk renders the entire sentence, with emphasis and inflection, and returns index markers in realtime as each word is read. I map these back to the original text. Thus I enjoy perfect tracking and high speech quality at the same time. Other synthesizers support index markers as well, including the Dectalk family.

Contact Information

If you have any questions or feedback, please contact me, Karl Dahlke, via email, or by phone at 248-524-1004 during regular business hours.