Converting MS Word Documents to Text for Posting
------------------------------------------------

Version 2.1, February, 2000
by Titmouse


This is a substantial revision of a discussion that 
was originally published in August, 1999.  Earlier 
versions should be discarded.  Included in this 
version is a note on setting margins for A4 paper and 
metric measurements.


Like many others, I prefer to do my serious writing in 
Microsoft Word.  While it's not perfect, it has the 
tools and capabilities to do everything I need and 
most of what I want.  But messages and stories posted 
to Usenet newsgroups should be in plain ASCII text 
and, as many have discovered, Word does not make this 
conversion gracefully. This is a source of much 
complication, confusion and irritation.  After many 
trials and errors, I think I've finally figured most 
of it out, and this is a report on my conclusions.
 
The best and most successful tactic is problem 
avoidance.  It is much easier to prevent things that 
will cause conversion problems than to fix them after 
the fact.  Some of the information presented here is 
general and concerns basic formatting, but most of it 
deals with the specific issues of converting Word 
documents.  

The discussion assumes you are working with Microsoft 
Word 97 or Word 2000.  Although I have not make 
exhaustive tests with the new version, no changes 
appear to be necessary in this document.  The two 
macros presented below also work without modification.


Initial Considerations

Most conversion problems stem from three sources:  
document formats, paragraph formats, and the extended 
character set. If you can avoid introducing problems, 
the conversion should go smoothly.  Word is designed 
to defeat our purpose here, though, so we will have to 
force it to do what we want.  Defaults for all three 
of these problem areas are wrong for text documents 
posted to the Internet.  

Unless all you do is write stories for posting to the 
Internet, however, the changes you will need to make 
are not ones you will want for other kinds of 
documents.  A secondary problem, then, is how to avoid 
wrecking Word for other purposes.
 
I thought originally that it would be relatively 
simple -- just create a new template designed for 
plain text documents with all the bells and whistles 
turned off, and there should be no problem.  Not so.  
My second theory was that the issues could be resolved 
by creating an alternate version of the Normal.Dot 
template.  Also not so.  Rather than recount a long 
process of experimentation, I'll just report my 
conclusions.


The Document Format

Problems with both document and paragraph formats are 
most easily handled by creating a template that you 
can use whenever you start a new story or other longer 
document intended for the Internet.  The template will 
have the correct font, margins and paragraph format.  
Then, if you remember to turn off Word's fancy text 
gizmos as explained later, you can write your document 
without creating problems for yourself.

To create a new template, launch Word and modify the 
blank document as follows.  First, go to File, Page 
Setup and set the left and right margins.  The top and 
bottom margins don't matter, as they are ignored when 
the document is saved as text.

You'll be using a fixed width font, so the line length 
will determine the number of characters per line.  I 
use and recommend 55 characters per line (as in this 
document) and strongly recommend that you not exceed 
60 characters per line.  When you post to the 
Internet, your message will be handled and read by a 
wide variety of programs.  If your line length is too 
long, one or more of them may force an early line 
wrap.  You won't know about it until your story 
arrives on the newsgroup with alternating long and 
short lines.  I've never seen this problem occur with 
55 character lines, however.

How to set the margins depends on the font size.  
Word's default is 10-point type, but I recommend 12-
point type, which produces 10 characters per inch in 
fixed width fonts.  In this case, you'll need a 5.5-
inch line length.  Any combination of left and right 
margins that totals three inches will work.  I use a 
left margin of one inch and a right margin of two 
inches, but 1.25 and 1.75 works just as well.

If you insist on 10-point type, you'll need a 4.5-inch 
line length, so the margins should total four inches.  
That's actually 54 characters per line, if you're 
paying close attention, but near enough to 55.

METRIC NOTE:  If you use the common alternative 
standard of A4 paper and metric measurements, the 
above recommendations translate to left and right 
margins of 33mm, assuming the font is Courier New at 
12 points.  This produces 56 characters per line.

All the other settings on the Page Setup dialog should 
remain the same as usual, so just click okay to set 
the margins.  If you prefer to work in Page Layout 
View (rather than Normal View), set it now by 
selecting View, Page Layout.  Assuming you plan to use 
12-point type, click on the Zoom control near the 
right edge of the Standard toolbar (the one that 
begins with the New, Open and Save icons) and set the 
Zoom to 75%.  If you stick with 10-point type, skip 
this last step.

If your standard setup includes headers or footers, 
eliminate them from this document.  Go to View, 
Headers and Footers and delete anything in either of 
them.  Otherwise, this information will appear in the 
final text version.


Font and Paragraph Format

Now, press Ctrl-A (or use Edit, Select All) to select 
the entire document.  While the document may appear 
blank, it contains a paragraph marker.  In Word, the 
paragraph marker is much more than an end-of-line 
character; a great deal of formatting is stored with 
it.  If you don't include that initial paragraph 
marker in your changes, the defaults will remain and 
return to bite you later.

With the entire document selected, change the font to 
Courier New, 12 point.  If you have another fixed-
width font that you prefer, you can use it, since font 
information will not be saved in the final text 
version.

Now, with the entire document still selected, go to 
Format, Paragraph.  Make sure that Alignment is set to 
Left, Indentation and Spacing are zeroed, and the 
Outline level is set to Body Text.  The most vital 
setting is for Special.  The default is First Line 
with a half-inch indent.  Set this to (none).

This last setting causes problems for many users.  
Although the First Line setting will indent the first 
line of paragraphs, no tab or other character is 
actually placed in the document to cause it.  Instead, 
the setting is stored in the paragraph marker and 
disappears when converted to text, which is why you 
see a lot of documents where the indent appears to 
have been lost.  In fact, it was never really there in 
the first place.

The final option, Line Spacing, should be set to 
single.  Don't worry about tab settings.  Click okay 
to implement the paragraph format.

Now, we're ready to save our new template.  Select 
File, Save As.  Give your template a name -- I call 
mine 'Text' -- and change the type to 'Document 
Template.'  Word will automatically place it in your 
template folder.  Click the Save button.


Sticking with ASCII

Now for the thorniest problem, which is Word's 
insistence on putting extended character codes in 
documents and leaving them there even when you convert 
them to text.  A little explanation is needed here, 
although some will already be familiar with this 
information.

The Usenet standard for text-oriented newsgroups calls 
for plain ASCII text.  ASCII (American Standard Code 
for Information Interchange) predates widespread 
computer use and is most closely associated with 
Teletype machines.  It is a seven-bit coding scheme, 
since seven bits provide 128 numbers (0-127).  At the 
time, that seemed sufficient to represent the 52 
capital and lower case letters, the 10 digits, common 
punctuation symbols, and various control codes for 
line feeds, carriage returns, tabs, page feeds and so 
on.

Binary computers, though, use powers of two, most 
famously the eight-bit byte.  ASCII coding fit neatly 
into a byte, with one bit left over which was 
initially ignored.  That didn't last, of course, and 
several schemes evolved for extending the character 
set by using that spare bit to provide an additional 
128 codes (128-255).  The most popular of these today 
is ANSI (American National Standards Institute) in 
which the first 128 codes correspond to ASCII.  What 
the upper 128 represent, at least in the Microsoft 
world, depends on context, including language, font 
and software.

Here's the problem.  When Word converts a document to 
text, it uses ANSI, not ASCII.  Extended character 
codes above 127 remain in the text.  What shows up on 
the screen -- letters from other languages, math 
symbols, and little black boxes for anything the 
software can't display -- depends partly on which 
flavor of conversion you used but mostly on the 
software used to read it.

There is no cure within Word; your only choice is 
prevention.


Avoiding Extended Character Codes

While you can put extended codes in your documents 
intentionally -- nearly everything on the Insert menu 
will do so, for example -- the ones Word does for you 
without asking are the biggest source of problems.  
These mostly originate from the 'AutoFormat As You 
Type' tab of the AutoCorrect page of the Tools menu.  
The 'AutoCorrect' tab contributes a few additional 
gotchas, and the (plain) 'AutoFormat' tab can also 
cause problems.  

The crux of the problem is that these settings are not 
stored in any template.  They stay with the program, 
not the document, and they retain their settings until 
you change them explicitly.

Since you probably will want at least some of these 
features turned on for standard Word documents, there 
are only two choices.  One is to turn them on and off 
manually depending on what you're working on; the 
other choice is to use a pair of macros to do the work 
for you.  (You'll still have to remember to run the 
macros, of course.)  I have included the two macros 
necessary and will explain how to implement and use 
them later.
 

Copying Your AutoCorrect Setup

Before making any changes, make a copy of your current 
setup.  Start Word with a blank document, click on 
'Tools' on the top ribbon menu, and then choose 
'AutoCorrect...'  You will see the AutoCorrect page 
with four tabs: AutoCorrect, AutoFormat As You Type, 
AutoText, and AutoFormat.

The third of these, AutoText, provides boilerplate 
entries that require a manual step to insert in a 
document.  If you use this facility in documents 
intended for publication in Usenet newsgroups, just 
make sure such entries don't contain non-ASCII 
characters.  This caveat aside, AutoText is not 
relevant to our text-conversion problems.

We may change the other three tabs, though, so let's 
make a backup copy.  Click on the AutoCorrect heading 
to make sure the dialog has the focus, then hold down 
the Alt key and press PrintScreen.  This copies the 
dialog to the clipboard.  Close the dialog and, in 
your blank document, press Ctrl + V (or click Edit, 
Paste) to insert a picture of the dialog in your 
document.  Press Enter.

Now return to Tools, AutoCorrect.  Select the 
'AutoFormat As You Type' tab.  Press Alt + PrintScreen 
again.  Close the dialog and press Ctrl + V to insert 
a copy of this tab in your document.  Press Enter.

Finally, return to Tools, AutoCorrect one more time 
and select the 'AutoFormat' tab.  Copy it to the 
clipboard with Alt + PrintScreen, close the dialog, 
and insert it into your document with Ctrl + V.  Now, 
save the document as 'AutoCorrect Settings' and print 
a copy for reference.


The AutoCorrect Tab

This tab is concerned with typing mistakes.  In the 
top part are five checkbox options.  I have four of 
the five turned on normally, omitting the second 
'Capitalize first letter of sentences.'  In my 
experience, checking this box makes Word capitalize 
things I don't want it to.  In any case, you can set 
the first four checkboxes according to your 
preferences.  They don't create conversion problems.

The fifth checkbox controls the bottom half of this 
tab and can cause problems, however.  In particular, 
it converts (c) and (r) to the Copyright and 
Registered symbols and three successive periods to the 
ellipsis symbol.  These, of course, all require 
extended codes.  In preparing a plain text document, 
you don't need to change any of the replacements.  
Just uncheck the 'Replace text as you type' checkbox, 
and Word will ignore the list.  This also means it 
will not correct the many common typographical errors 
on the list, however, so a spelling check becomes more 
important than ever.

There is an alternative, which is what I've chosen to 
do.  I deleted the first several entries in the table 
-- the ones that convert smilies as well as the 
copyright and registered symbols.  Now I can leave the 
autocorrection of common typos turned on without 
danger of substituting an illegal character.  It's 
something of an awkward choice, but personally I'd 
rather catch the typos.


The AutoFormat-As-You-Type Tab

This is the bad boy, responsible for most of the 
problems experienced in converting Word documents to 
plain text.  For standard documents, I have everything 
checked except for hyperlinks, the third from last.  
For text documents, I turn everything off.

As you can see, the middle section converts straight 
quotes to curly quotes, ordinals to superscript, 
common fractions to their graphic equivalents, dual 
hyphens to real dashes, and *bold* and _underlining_ 
to actual bold and underlining.  All of these use 
upper level codes and most of them don't convert 
properly to text.


The AutoFormat Tab

The settings on this tab are almost identical with 
those on the previous one.  Where the first makes its 
changes as you type, the changes on this tab are made 
only if you tell Word -- by selecting Format, 
AutoFormat -- to perform them.  If you don't do that, 
you can leave these settings alone.  Since I change my 
settings via macro, it's just as easy to switch them 
off and on.


Macros to Turn Text Settings On and Off

As you can see, considerable labor is required to 
change these settings manually, especially if you 
switch between document types frequently.  As 
mentioned earlier, you can't solve this problem by 
putting the desired settings in the Text.dot template.  
You can't even fix it by creating an alternate version 
of Normal.dot, the template Word always uses.  The 
AutoCorrect settings are independent of the template.

Instead, the simplest way to switch is with a pair of 
macros.  You could record them yourself if you know 
how, but I've provided copies here and directions on 
how to create them.

First, if you haven't already, save this document and 
load it into Word.  Find this location again, and 
follow the steps below.  Be sure you have a copy of 
your original AutoCorrect settings before proceeding.

As provided, the macros switch almost everything off 
for text documents and back on for others.  You may 
prefer a different setup.  It's easy to change.  The 
lines in the macro correspond exactly to the 
checkboxes on the three AutoCorrect tabs, with True 
meaning checked and False meaning unchecked.  Using 
the copy of your setup as a guide, change the Text_OFF 
settings in the provided example from True to False or 
vice versa.

The Text_OFF settings should correspond to your 
current, preferred setup for normal documents.  I 
recommend that you use the suggested settings for the 
Options section of Text_ON, but the first four entries 
in the AutoCorrect section can be changed as desired.  
The fifth entry under AutoCorrect toggles the Replace 
Text feature off and on.  If you delete the problem-
causing entries from the table, you can leave this 
alone.  Just delete the line from both macros and it 
won't be changed by either of them.

Keep a copy of this document with your preferred 
settings.  If you decide later to modify them, it's 
easy to change the macros.  First, edit the text to 
reflect your new preferences.  Then go to the Macros 
dialog (Alt + F8), delete the old versions, and then 
recreate them using your modified versions.


Creating the Macros

In the section immediately below labeled TEXT_ON 
MACRO, highlight and copy the lines between START and 
STOP.  The shortcut for Copy is Ctrl + C.

TEXT_ON MACRO
START
With AutoCorrect
  .CorrectInitialCaps = True
  .CorrectSentenceCaps = False
  .CorrectDays = True
  .CorrectCapsLock = True
  .ReplaceText = False
End With
With Options
  .AutoFormatAsYouTypeApplyHeadings = False
  .AutoFormatAsYouTypeApplyBorders = False
  .AutoFormatAsYouTypeApplyBulletedLists = False
  .AutoFormatAsYouTypeApplyNumberedLists = False
  .AutoFormatAsYouTypeApplyTables = False
  .AutoFormatAsYouTypeReplaceQuotes = False
  .AutoFormatAsYouTypeReplaceSymbols = False
  .AutoFormatAsYouTypeReplaceOrdinals = False
  .AutoFormatAsYouTypeReplaceFractions = False
  .AutoFormatAsYouTypeReplacePlainTextEmphasis = False
  .AutoFormatAsYouTypeReplaceHyperlinks = False
  .AutoFormatAsYouTypeFormatListItemBeginning = False
  .AutoFormatAsYouTypeDefineStyles = False
  .AutoFormatApplyHeadings = False
  .AutoFormatApplyLists = False
  .AutoFormatApplyBulletedLists = False
  .AutoFormatApplyOtherParas = False
  .AutoFormatReplaceQuotes = False
  .AutoFormatReplaceSymbols = False
  .AutoFormatReplaceOrdinals = False
  .AutoFormatReplaceFractions = False
  .AutoFormatReplacePlainTextEmphasis = False
  .AutoFormatReplaceHyperlinks = False
  .AutoFormatPreserveStyles = False
  .AutoFormatPlainTextWordMail = False
End With
STOP

Now, press Alt + F8.  This brings up the Macros 
dialog.  If there's anything in the top box, Macro 
Name, press the Delete key to clear it.  Type Text_ON, 
then click the Create box.

This will open the Visual Basic Editor.  In the right 
pane, you should see the cursor on a blank line.  
Above it will be several lines beginning with 'Sub 
Text_ON.'  Immediately below will be a line that says 
'End Sub.'  Press Ctrl + V (or use Edit, Paste) to 
insert the text you copied.  Click the X in the upper 
right corner, which will close the Visual Basic Editor 
and return you to this document.

Now, repeat the process to create a Text_OFF macro.  
Begin by copying the following lines between START and 
STOP as before:

TEXT_OFF MACRO
START
With AutoCorrect
  .CorrectInitialCaps = True
  .CorrectSentenceCaps = False
  .CorrectDays = True
  .CorrectCapsLock = True
  .ReplaceText = True
End With
With Options
  .AutoFormatAsYouTypeApplyHeadings = True
  .AutoFormatAsYouTypeApplyBorders = True
  .AutoFormatAsYouTypeApplyBulletedLists = True
  .AutoFormatAsYouTypeApplyNumberedLists = True
  .AutoFormatAsYouTypeApplyTables = True
  .AutoFormatAsYouTypeReplaceQuotes = True
  .AutoFormatAsYouTypeReplaceSymbols = True
  .AutoFormatAsYouTypeReplaceOrdinals = True
  .AutoFormatAsYouTypeReplaceFractions = True
  .AutoFormatAsYouTypeReplacePlainTextEmphasis = True
  .AutoFormatAsYouTypeReplaceHyperlinks = True
  .AutoFormatAsYouTypeFormatListItemBeginning = True
  .AutoFormatAsYouTypeDefineStyles = True
  .AutoFormatApplyHeadings = True
  .AutoFormatApplyLists = True
  .AutoFormatApplyBulletedLists = True
  .AutoFormatApplyOtherParas = True
  .AutoFormatReplaceQuotes = True
  .AutoFormatReplaceSymbols = True
  .AutoFormatReplaceOrdinals = True
  .AutoFormatReplaceFractions = True
  .AutoFormatReplacePlainTextEmphasis = True
  .AutoFormatReplaceHyperlinks = True
  .AutoFormatPreserveStyles = True
  .AutoFormatPlainTextWordMail = True
End With
STOP

Once again, press Alt + F8 to bring up the Macros 
dialog.  Press the Delete key to clear the Macro Name 
box, and type Text_OFF, then click the Create box.

The cursor will again be on a blank line below several 
lines beginning with 'Sub Text_OFF' and above a line 
that says 'End Sub.'  Press Ctrl + V (or use Edit, 
Paste) to insert the text you copied.  Click the X in 
the upper right corner to close the Visual Basic 
Editor and return to this document.

You should now have two macros, Text_ON and Text_OFF.  
To test them, press Alt + F8, and double-click the 
Text_ON macro (or click Text_ON and then the Run 
button).  Go to the Tools, AutoCorrect dialog and 
check the 'AutoFormat As You Type' tab.  Everything 
should be turned off.  Now run the Text_OFF macro and 
check the dialog again.  Everything should be switched 
back to your preferred settings.


Creating Your Text

So, with these tools in hand, you're ready to start a 
new project.  To create a document, use File, New and 
select the Text template you created earlier.  Before 
doing anything, run the Text_ON macro.  You'll need to 
run the macro again each time you begin a new editing 
session and run the Text_OFF macro whenever you switch 
to another kind of document.

Now, all you have to do is to keep in mind the 
eventual goal.  Mostly that means not doing things you 
know won't convert, such as Word styles, bulleted 
lists, sections breaks, columns, and so on. Avoid 
bold, italics and underlining.  If you need this kind 
of emphasis, follow the plain text conventions of 
indicating bold by preceding and following the text 
with asterisks like *this* and underlining or italic 
with underscores like _this._  With the AutoFormat 
features turned off, these will not be converted.

For titles, I recommend a simple block at the left 
margin, as in the following example.

Converting Word Documents to Text
By Titmouse
(C) August, 1999

You may wish to use capital letters for the actual 
title.  For section headings, I recommend placing two 
blank lines before and one after.  I've used this 
convention throughout this document.

If you want to underline a heading, do so with hyphens 
on a separate line beneath.  Keep in mind, however, 
that if you do this in any font other than Courier (or 
some other monospaced font), you actually have no idea 
how many hyphens are needed unless you count the 
characters in the heading.  Most fonts are 
proportional.  Each character, that is, has a separate 
width, so that an 'm' and an 'i' take up different 
amounts of line space.  With monospaced fonts like 
Courier, each character has the same width.

You also need to decide how you want to separate 
paragraphs in your text.  There are two basic 
approaches.  In one, paragraphs are not indented and 
an extra blank line separates them.  In the other, 
paragraphs are indented with a tab or spaces and the 
extra line is omitted.  Either of these is acceptable, 
but the first is preferred.  Some software seems to 
strip out tabs and spaces.


Saving the Document as Text

While you're working on the story or article, save it 
as a normal Word document.  You'll probably want to 
maintain an archive version in that format anyway.  
When you've finished the final editing for your story 
and are ready to post it, save a new copy as

MS-DOS Text with Line Breaks.

Then close the document in Word (or exit Word), 
double-click on your new document to load it into 
Notepad or Wordpad, and inspect it carefully for 
surprises.  If you need to make corrections other than 
centering titles and headings with spaces, go back to 
your Word document to make them and then resave over 
your text version, always specifying 'MS-DOS Text with 
Line Breaks.'

There is an alternative for those who use tab-indented 
paragraphs or spaces to provide formatting.  If you 
save your final text version as 'MS-DOS Text with 
Layout,' Word converts tabs to spaces and generally 
preserves the visual layout.  For reasons that escape 
me, an extension of 'asc' is used for such documents.  
You'll probably want to rename it with 'txt,' since 
the 'asc' extension probably won't be recognized.  Be 
aware, though, that some software eliminates "extra" 
spaces.  This is why block format is preferred.

When you're ready to post it, open the text version, 
copy the contents and paste into whatever software 
you're using to post with.  This should work in all 
cases except for longer stories that exceed the limits 
of certain providers (AOL, most notoriously).  If you 
have that problem, you'll need to go back to your 
original story and break it into segments that fit 
under the limits.


Final Thoughts

Okay, that's more than enough.  I hope I haven't left 
out anything significant or made any stupid mistakes.  
I'm sure wiser heads will let me know, if so. I'll 
repost this note periodically with accumulated 
corrections.  A copy of the latest version will also 
be available on the FAQs pages (both web and ftp 
versions) at ASSTR.
 
After the original publication of this document, there 
was considerable discussion about various problems in 
converting existing documents to ASCII and correcting 
format problems in other people's documents.  I 
included some ideas in the original version, but this 
seems to me to be a topic of sufficient complexity to 
require it's own discussion.  If there's enough 
interest, I would be willing to take it on.

Please note that if you want to e-mail me directly, 
the address is 'nitesweats@aol.com' not the dummy 
address in the header.

Peace,
Titmouse