Web Matters

Proper HTML quotation marks and dashes

A disadvantage of using a simple text editor to produce HTML is that it is relatively time-consuming to put in the proper typographical quotation marks and dashes. For example like this: “Welcome – come in” instead of "Welcome - come in". Furthermore Windows applications which do insert these characters not infrequently use Windows-specific characters instead of the proper platform-independent HTML characters.

I offer here a pair of sed scripts which automatically generate the proper characters.

sed

If you haven’t met sed before: it is a small (40kB – microscopic compared to your average Windows application!) batch editor. You can download it free from Sourceforge, which also has tutorials etc. (Not that you need tutorials to use sed scripts – only to write them). There is further information on sed at http://www.student.northpark.edu/pemente/sed/.

I have tested these scripts on sed 3.59 on Windows.

Script functions

There are two scripts, for Windows and for other operating systems.

Windows version: proper_quotes_win.sed
Non-windows version: proper_quotes_nonwin.sed

Feel free to download them and use them (for non-commercial purposes). If they’ve been useful I’d appreciate it if you’d tell me – and please tell me if you have problems with them.

What they do is:

  1. replace plain quotation marks (single and double) by the appropriate typographical symbols;
  2. replace a hyphen surrounded by spaces with an en-dash;
  3. replace a double-hyphen by an em-dash and strip any surrounding spaces;
  4. the Windows version replaces the non-standard Windows quote and dash characters in the range 128-159 with the proper HTML equivalents; a few other common characters in that range are also converted;
  5. any existing numeric character references for the quote and dash symbols are converted to character entity references.

Script usage

They are simple to use. For example, to convert file magic.html, just type the following on the command line (assuming the script and file are in the same directory and sed itself is either in the same directory or on the path):

sed -f proper_quotes_win.sed magic.html >magic2.html

You can then rename magic2.html back to magic.html after you have done any checking you may wish to do.

And of course you can, if desired, reduce the amount of typing by creating a small batch file. For DOS/Windows, if you create a file pq.bat containing:

echo off
sed -f D:\website\tools\proper_quotes_win.sed %1 >temp.html
copy temp.html %1
(obviously your path must be adjusted to wherever you keep your tools) then all you need to type is pq magic.html (or whatever your file name is).

Script restrictions

There are some minor restrictions:

  1. When using non-alphabetic characters, there are some situations where the script can’t unambiguously resolve whether right or left quotes are required; these are left untranslated.
  2. The scripts do not recognise multi-line HTML comments (handling these in sed would be quite complex) with the result that quotes within such comments will also be converted. This will not of course affect the rendered page.
  3. Similarly the scripts can’t, within the body element, handle HTML attributes which span a line break (these are probably rare and arguably bad practice anyway). Note: tags which span a line break are no problem, as long as an individual attribute does not contain a line break.

    However between <HEAD> and </HEAD> tags, no characters are converted, so HTML attributes can then span a line break (in practice this means within Meta elements).

And some very minor restrictions which you are unlikely to hit:

Acknowledgements

Thanks to David Warren Steel, who tested these for me on the SunOS and Irix dialects of Unix, and helped me track down a particularly obscure problem. Also to Stan Brown, who found several extra cases which needed handling (I didn’t manage to address all of them, but have done the ones most likely to occur.)