Hyphenation Exception Dictionary

A few weeks ago, I mentioned that I'd been working on a long, complex book that had to be typeset in Microsoft Word. I learned a lot from the experience, and I'll be passing on some of that hard-won knowledge in future issues. As I worked on the book, one problem quickly became apparent: Microsoft Word has no hyphenation exception dictionary. A hyphenation exception dictionary is a list of words that specifies how certain words should (or should not) be broken at the end of a line. For example, a really tiny hyphenation exception dictionary might include the following entries as words that shouldn't be broken at all:

people

little

create

It might also include the following words, with optional hyphens indicating breaking points:

con-vert-ible (not con-ver-ti-ble)

tan-gible (not tang-i-ble)

tri-angle (not trian-gle)

Microsoft Word will break all of those words badly.

Dedicated typesetting programs such as QuarkXPress will automatically check a hyphenation exception dictionary (if you've provided one) and break words accordingly. Microsoft Word won't, but there is a way around the problem. First, compile your hyphenation exception list. Then record a macro that finds each word on your list and replaces it with the same word including optional hyphens and zero-width nonbreaking spaces as needed. You can learn more about zero-width nonbreaking spaces here:

http://www.topica.com/lists/editorium/read/message.html?mid=1711888513

And you can learn more about optional hyphens here:

http://www.topica.com/lists/editorium/read/message.html?mid=1711932079

Using the words above, our list might look like this (I'm using a hyphen [-] to represent optional hyphens and a tilde [~] to represent a zero-width nonbreaking space):

peo~ple

lit~tle

cre~ate

con-vert-ible

tan-gible

tri-angle

So you'd find "people" and replace it with "peo~ple," "triangle" and replace it with "tri-angle," and so on. Then, when Word does its automatic hyphenation, the words will break in the way you've specified rather than in the (incorrect) way Microsoft Word uses by default (using, in my case, American English rules). It's not that Word does a bad job of hyphenation, mind you. It's actually pretty good. But even the best hyphenation algorithms need a little help.

A more elegant (and probably more reliable) way of preventing breaks is to mark the words in question so that they are not "proofed"--that is, so that they won't be checked for spelling, grammar, or (most important) hyphenation. To do that, select a word, click Tools > Language > Set Language, and put a check in the checkbox labeled "Do not check spelling or grammar." This has the advantage of not introducing an invisible character into the word, which will keep an unwanted space from showing up later if you use the document to create a Web page, an ebook, or whatever.

A better way than recording all of these words in a macro is to use our RazzmaTag program, which will run your hyphenation exception list on a whole folder full of documents at one time. It will also let you edit and add to your list as needed. I've prepared preliminary versions of such lists that you can download and play with. The one using the zero-width nonbreaking space is here (this list will work with MegaReplacer as well as RazzmaTag):

http://www.editorium.com/ftp/nonbreakinglist.zip

And the one marking the words so they won't be proofed is here (RazzmaTag only):

http://www.editorium.com/ftp/noproofinglist.zip

But what I'd really like is for you to send me any hyphenation exception lists you already have (maybe check with your typesetter). Then I'll merge them and include the comprehensive list in next week's newsletter! Come on--what do you say? Please email your lists to mailto:editor [at symbol] editorium.com.

You can learn more about RazzmaTag here:

http://www.editorium.com/razzmatag.htm

And you can learn more about MegaReplacer here:

http://www.editorium.com/14843.htm

_________________________________________

READERS WRITE

After reading the article on finding and replacing weird WordPerfect characters wrote:

I work with Word 98 on a Mac and occasionally tangle with WP documents. A few days ago, a writer sent a WP file to me. I opened it in Word, as Text, and found lots of gibberish--my favorite character being the letter Y with two dots over it: . This character was interspersed between *every* legitimate letter and space. (I would see, basically, this: letter and space.) Copying it and using Find and Replace was fruitless. Word refused to cooperate.

I decided to try opening the file with Word, but not as ASCII Text. Opening it as RTF gave me the same results, as did opening it as a Word Document. But I then tried an option called "Recover Text from Any File," and the document opened with text that was absolutely clean. I mean *really* clean.

The only caveat I can think of is that there was no special formatting in this file. I'm responsible for formatting the document and sending it on to my editor.

Yateendra Joshi (yateen@teri.res.in) wrote:

Thank you for the interesting and useful article on ellipses in the 16 January 2001 issue of Editorium Update.

Most often, ellipses stand for omitted matter, and the dots will represent it even better if they do not sit on the line but are raised a bit, say to the centre of the letter x (lowercase eks). The extent to which the dots should be raised will depend on the font (raising by 2 points works best with 11-point Georgia). The sequence is therefore to type the dots as you explain, then select them, and raise them by Format > Font > Character Spacing > Position > Raised By followed by typing in the appropriate value. It helps to see the text enlarged by 500%.

Thanks to Fran and Yateendra for the great tips!

Break That Word Here!

Last week's newsletter explained how to use a zero-width nonbreaking space to keep a word from breaking at the end of a line when hyphenation is turned on (Tools > Language > Hyphenation > Automatically hyphenate document). Fine as far as it goes. But what can you do to break a word at a place other than one Microsoft Word insists on using? For example, Word will happily break "convertible" as "converti-ble." Ugh. (See your favorite style manual for more information about how to break words properly; I prefer The Chicago Manual of Style.)

The solution is to insert an optional hyphen at any acceptable breaking points. In "convertible," for example, you could insert optional hyphens as follows: con-vert-ible. The optional hyphens will override word's automatic hyphenation and break the word at one of the points you've specified.

To get an optional hyphen, click Insert > Symbol > Special Characters > Optional hyphen. Or, easier yet, press CTRL + - (on a Macintosh press COMMAND + -).

In our shop, proofreaders check galleys for bad breaks, which are then corrected manually by our typesetters, who insert optional hyphens as needed (although usually in QuarkXPress rather than Word). Wouldn't it be nice if there was a way to insert optional hyphens automatically? As it turns out, there is--even in Microsoft Word.

Stay tuned; next week I'll tell you all about it.

You can learn more about The Chicago Manual of Style here:

http://www.press.uchicago.edu/cgi-bin/hfs.cgi/00/12245.ctl

And you can see the FAQ here:

http://www.press.uchicago.edu/Misc/Chicago/cmosfaq/cmosfaq.html

_________________________________________

READERS WRITE

After reading last week's article about how to use a zero-width nonbreaking space to keep a word from breaking, Patsy Price sent a tip about an elegant alternative:

I too have been very frustrated when specific words insisted on breaking in Word 98 (Mac) whether I wanted them to or not. I tried everything I could think of, including inserting a nonbreaking hyphen before the word, but nothing worked. Then somebody on one of the lists I belong to made a suggestion that has worked for me so far: select the word and change the language to No Proofing [Tools > Language > Do not check spelling or grammar]. Even when the file is opened in Word 2000 PC the word doesn't hyphenate.

Patsy made the effort to track down the person who originally made the suggestion, H?l?ne Dion on the McEdit list. So thanks to H?l?ne for the tip and to Patsy for passing it on.

Bill Rubidge (wbr@aya.yale.edu) sent the following tip on how to make a zero-width nonbreaking hyphen in Word 97, along with a brilliant wildcard find-and-replace routine to keep words together at the end of a paragraph.

Interesting zero-width action. In my case I wanted to break long URLs in a narrow text column. Unfortunately, I am still using Word 97, so I had to resort to a conditional hyphen solution, but I set the hyphen size to 1 point and colored it white to hide it.

In any case, my experience on that issue and your description of the one below made me think you could take your "Don't break that word" solution a step further. I never use hyphenation, so I don't have your issue, but I dislike short words ending up all by their lonesome as the final line of a paragraph. My solution is:

Search for:

([A-Za-z0-9,.$?;:'"")!*]{1,8}) ([A-Za-z0-9,.$?;:'"")!*]{1,8})[^013]

Replace with:

1^s2

This forces the last two words (up to eight characters long) to be on the last line together.

Your hyphenation problem seems similar, but I shudder at the thought of inserting the Unicode characters manually. Would it do the job for you to search for the end of a paragraph and then insert the nonbreaking zero-width space character between EVERY letter of the last word? This way, you could run this macro automatically for the whole document.

By the way, I found I had to do an additional undo search to take out these things where I knew that the item was part of a small column. For example, if the found item was in a table, I would undo the nonbreaking material, as the table columns might be too narrow for this to be appropriate.

Thanks to Bill for the great tips.

_________________________________________

RESOURCES

Possibly the ultimate treatise on the subject, the fascinating book Hyphenation, by Ronald McIntosh and David Fawthrop, is available free online:

http://www.hyphenologist.co.uk/book/BOOK-ED3.HTM

Don't Break That Word!

I've recently been editing a long, scholarly tome that, for reasons I'll discuss in a future newsletter, my co-workers and I decided to typeset in Microsoft Word, following the techniques explained here:

http://www.topica.com/lists/editorium/read/message.html?mid=1708956278

Our intrepid typesetter has been fairly content except for one thing: there seems to be no way to keep a word from breaking at the end of a line. Microsoft Word happily breaks "Je-sus" and "Bud-dha," for example, which we'd like to avoid. We could force a word down with a soft return, but that doesn't seem like a very elegant solution. Clever idea: how about putting an optional hyphen (CTRL + -) at the *beginning* of the word? That works, but it also *displays* a hyphen at the beginning of the word, which certainly won't do. Can the optional hyphen go at the end of the word? No, that doesn't work at all. So where might we find an answer?

Well, Unicode fonts include all kinds of interesting things. Would they, by chance, include a zero-width nonbreaking space? If we had one of those, we could insert it at the spot where we didn't want a break to occur. I went to Alan Wood's spectacular Unicode Resources site and searched for "zero-width nonbreaking space":

http://www.alanwood.net/unicode/search.html

There it was, not under general punctuation but as the last entry under Arabic Presentation Forms, of all things:

http://www.alanwood.net/unicode/arabic_presentation_forms_b.html

The Web site told me the decimal number (65279) and hex number (FEFF) of the character, so I fired up Word 2002 (XP) and entered the character by typing the hex number followed by ALT + x. With nonprinting characters showing, I could see the little beauty--it looked like a gray box inside a gray box. When nonprinting characters *weren't* showing, the character was invisible, since it had no width. And sure enough, when I put the character into a word and then pushed that word to the end of the line, the word refused to break. Success!

I sent a sample to Word guru Steve Hudson, who tested the idea in various ways and pronounced it good. Thanks, Steve! So now I share this little marvel with you. If you'd like to see the character in action (and get a sample of the character that you can copy and use in your own documents), you can download the following document to play around with:

http://www.editorium.com/ftp/nonbreaking.zip

After you download, unzip, and open the document, notice the automatically hyphenated "excellent" on the first line. Now add a character somewhere in the middle of the *second* line--enough to make the second "excellent" break. But it won't!

Incidentally, the character works in Word 2000 and later versions as long as you have Unicode fonts installed on your computer. You can learn more here:

http://www.alanwood.net/unicode/fonts.html

Please note that using this character within a word will mess up spell-checking for that word, so you might want to check spelling *before* inserting the character hither and yon. If you need to get rid of the characters, display nonprinting characters; then search for ^u65279 and replace with nothing.

_________________________________________

READERS WRITE

After reading last week's article on finding and replacing weird WordPerfect characters, Jane Lyle (jlyle@indiana.edu), managing editor at Indiana University Press, sent the following macro. Thanks, Jane! The macro does its work by searching for characters formatted in the WP TypographicSymbols font, and it includes some characters that last week's macro overlooked. If you don't know how to use such macros, you can find out here.

'MACRO BEGINS HERE

' WPTyp Macro

' Macro recorded 10/25/2001 by Jane Lyle

'
Selection.Find.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = ""

.Replacement.Text = ""

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = "A"

.Replacement.Text = """"

.Forward = True

.Wrap = wdFindContinue

.Format = True

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = "@"

.Replacement.Text = """"

.Forward = True

.Wrap = wdFindContinue

.Format = True

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = ">"

.Replacement.Text = "'"

.Forward = True

.Wrap = wdFindContinue

.Format = True

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = "="

.Replacement.Text = "'"

.Forward = True

.Wrap = wdFindContinue

.Format = True

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = "B"

.Replacement.Text = "^="

.Forward = True

.Wrap = wdFindContinue

.Format = True

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = "C"

.Replacement.Text = "^+"

.Forward = True

.Wrap = wdFindContinue

.Format = True

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = "?"

.Replacement.Text = """"

.Forward = True

.Wrap = wdFindContinue

.Format = True

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Replacement.Font.Name = "Times New Roman"

.Font.Name = "WP TypographicSymbols"

.Text = "Y"

.Replacement.Text = ". . ."

.Forward = True

.Wrap = wdFindContinue

.Format = True

.MatchCase = False

.MatchWholeWord = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

'MACRO ENDS HERE

_________________________________________

RESOURCES

You can find lots of interesting spaces, hyphens, and other Unicode characters here:

http://www.alanwood.net/unicode/general_punctuation.html