htmlCharset - project overview

Introduction

htmlCharset (SourceForge project page) is a file conversion tool, useful for replacing HTML entities (i.e., character encodings such as, for example, à or α) by the actual characters that they represent, or vice versa. As a spin-off, it can also be used as a general charset converter for arbitrary text files.

In general, depending on the charset in use, the conversion from a given entity to its corresponding character will only be possible if the charset has an entry for that particular character. With this restriction in mind, the program will only attempt to decode entities that do fit in the chosen target charset, leaving the rest in their encoded HTML-entity form. This behavior guarantees that no information will be lost or corrupted by the conversion process.

Transformation, for example, to a unicode charset such as UTF-8 ensures that all non-ASCII entities will disappear in favor of the corresponding characters. Conversely, setting US-ASCII as the target charset will bring these characters back to their HTML-entity representation. So the program can actually be used for both decoding and encoding, just by playing with the chosen target charset.

If no charset is specified, the underlying platform charset will be used, and all HTML entities covered by this charset will be decoded.

Features

Running htmlCharset

htmlCharset is written in Java (requires v1.5, or higher) and comes in two distinct flavors: as a graphical java swing application (named htmlCharset, see screenshot) and as a command line utility (named htmlCharsetCmd).

Win32 specific

The Win32 distribution contains .exe wrappers for the application's jars:

Command line usage

[options]
-?, -h, --help
   Show this usage info and exit.
-a, --ascii
   Encode to ASCII; i.e., write all characters beyond the 7-bit US-ASCII charset as HTML entities. Identical to option: -t US-ASCII.
-b, --backup-source
   Generate backup copies of the original source files.
-c, --supported-charsets
   Type the list of supported charsets and exit.
-s, --source-charset <charset>
   Read source files using the named charset. Defaults to your underlying platform charset.
-t, --target-charset <charset>
   Write output in the named charset. Defaults to your underlying platform charset. HTML entities representing a character that fits in the target charset will be replaced by the character itself. Characters laying outside the target charset will be written as HTML entities. This does not apply to the four markup-significant entities &quot; ("), &amp; (&), &lt; (<) and &gt; (>) which, by default, are not modified (see also options -m and -n).
-m, --decode-markup
   Output as characters all instances of the markup-significant entities &quot;, &amp;, &lt; and &gt;. No other transformations applied unless options -a or -t are also explicitly set.
-n, --encode-markup
   Output as entities all instances of the markup-significant characters ("), (&), (<) and (>). No other transformations applied unless options -a or -t are also explicitly set.
-r, --recursive
   Recurse into subdirectories.
-v, --verbose
   Warn of unknown entities encountered while parsing input files.
-x, --source-fileext <fileext>
   Process only source files with a file extension matching <fileext>.

<filenames>
   List of source files or directories to be processed.

Downloads

Download links for all distributions of htmlCharset are available from the SourceForge project download page.

Source code documentation

You can read the project javadoc here.

License

This program is free software, distributed under GNU GPL license (version 3).

Credits

htmlCharset uses elements from the following free projects: