Hyphenation Exception Dictionary

A few weeks ago, I mentioned that I'd been working on a long, complex book that had to be typeset in Microsoft Word. I learned a lot from the experience, and I'll be passing on some of that hard-won knowledge in future issues. As I worked on the book, one problem quickly became apparent: Microsoft Word has no hyphenation exception dictionary. A hyphenation exception dictionary is a list of words that specifies how certain words should (or should not) be broken at the end of a line. For example, a really tiny hyphenation exception dictionary might include the following entries as words that shouldn't be broken at all:

people

little

create

It might also include the following words, with optional hyphens indicating breaking points:

con-vert-ible (not con-ver-ti-ble)

tan-gible (not tang-i-ble)

tri-angle (not trian-gle)

Microsoft Word will break all of those words badly.

Dedicated typesetting programs such as QuarkXPress will automatically check a hyphenation exception dictionary (if you've provided one) and break words accordingly. Microsoft Word won't, but there is a way around the problem. First, compile your hyphenation exception list. Then record a macro that finds each word on your list and replaces it with the same word including optional hyphens and zero-width nonbreaking spaces as needed. You can learn more about zero-width nonbreaking spaces here:

http://www.topica.com/lists/editorium/read/message.html?mid=1711888513

And you can learn more about optional hyphens here:

http://www.topica.com/lists/editorium/read/message.html?mid=1711932079

Using the words above, our list might look like this (I'm using a hyphen [-] to represent optional hyphens and a tilde [~] to represent a zero-width nonbreaking space):

peo~ple

lit~tle

cre~ate

con-vert-ible

tan-gible

tri-angle

So you'd find "people" and replace it with "peo~ple," "triangle" and replace it with "tri-angle," and so on. Then, when Word does its automatic hyphenation, the words will break in the way you've specified rather than in the (incorrect) way Microsoft Word uses by default (using, in my case, American English rules). It's not that Word does a bad job of hyphenation, mind you. It's actually pretty good. But even the best hyphenation algorithms need a little help.

A more elegant (and probably more reliable) way of preventing breaks is to mark the words in question so that they are not "proofed"--that is, so that they won't be checked for spelling, grammar, or (most important) hyphenation. To do that, select a word, click Tools > Language > Set Language, and put a check in the checkbox labeled "Do not check spelling or grammar." This has the advantage of not introducing an invisible character into the word, which will keep an unwanted space from showing up later if you use the document to create a Web page, an ebook, or whatever.

A better way than recording all of these words in a macro is to use our RazzmaTag program, which will run your hyphenation exception list on a whole folder full of documents at one time. It will also let you edit and add to your list as needed. I've prepared preliminary versions of such lists that you can download and play with. The one using the zero-width nonbreaking space is here (this list will work with MegaReplacer as well as RazzmaTag):

http://www.editorium.com/ftp/nonbreakinglist.zip

And the one marking the words so they won't be proofed is here (RazzmaTag only):

http://www.editorium.com/ftp/noproofinglist.zip

But what I'd really like is for you to send me any hyphenation exception lists you already have (maybe check with your typesetter). Then I'll merge them and include the comprehensive list in next week's newsletter! Come on--what do you say? Please email your lists to mailto:editor [at symbol] editorium.com.

You can learn more about RazzmaTag here:

http://www.editorium.com/razzmatag.htm

And you can learn more about MegaReplacer here:

http://www.editorium.com/14843.htm

_________________________________________

READERS WRITE

After reading the article on finding and replacing weird WordPerfect characters wrote:

I work with Word 98 on a Mac and occasionally tangle with WP documents. A few days ago, a writer sent a WP file to me. I opened it in Word, as Text, and found lots of gibberish--my favorite character being the letter Y with two dots over it: . This character was interspersed between *every* legitimate letter and space. (I would see, basically, this: letter and space.) Copying it and using Find and Replace was fruitless. Word refused to cooperate.

I decided to try opening the file with Word, but not as ASCII Text. Opening it as RTF gave me the same results, as did opening it as a Word Document. But I then tried an option called "Recover Text from Any File," and the document opened with text that was absolutely clean. I mean *really* clean.

The only caveat I can think of is that there was no special formatting in this file. I'm responsible for formatting the document and sending it on to my editor.

Yateendra Joshi (yateen@teri.res.in) wrote:

Thank you for the interesting and useful article on ellipses in the 16 January 2001 issue of Editorium Update.

Most often, ellipses stand for omitted matter, and the dots will represent it even better if they do not sit on the line but are raised a bit, say to the centre of the letter x (lowercase eks). The extent to which the dots should be raised will depend on the font (raising by 2 points works best with 11-point Georgia). The sequence is therefore to type the dots as you explain, then select them, and raise them by Format > Font > Character Spacing > Position > Raised By followed by typing in the appropriate value. It helps to see the text enlarged by 500%.

Thanks to Fran and Yateendra for the great tips!

This entry was posted in Typesetting. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

You must be logged in to post a comment.

  • The Fine Print

    Thanks for reading Editorium Update (ISSN 1534-1283), published by:

    The EDITORIUM, LLC
    http://www.editorium.com

    Articles © on date of publication by the Editorium. All rights reserved. Editorium Update and Editorium are trademarks of the Editorium.

    You may forward copies of Editorium Update to others (but not charge for it) and print or store it for your personal use. Any other broadcast, publication, retransmission, copying, or storage, without written permission from the Editorium, is strictly prohibited. If you’re interested in reprinting one of our articles, please send an email message to editor@editorium.com

    Editorium Update is provided for informational purposes only and without a warranty of any kind, either express or implied, including but not limited to implied warranties of merchantability, fitness for a particular purpose, and freedom from infringement. The user (you) assumes the entire risk as to the accuracy and use of this document.

    The Editorium is not affiliated with Microsoft Corporation or any other entity.

    We do not sell, rent, or give our subscriber list to anyone. Period.

    If you’d like to subscribe, please enter your name and email address below. We publish the newsletter once a week, and on rare occasions we may send an important announcement. We never, ever send spam. Thank you for signing up!