htmlCharset - project overview

Introduction

htmlCharset (SourceForge project page) is a file conversion tool, useful for replacing HTML entities (i.e., character encodings such as, for example, à or α) by the actual characters that they represent, or vice versa. As a spin-off, it can also be used as a general charset converter for arbitrary text files.

In general, depending on the charset in use, the conversion from a given entity to its corresponding character will only be possible if the charset has an entry for that particular character. With this restriction in mind, the program will only attempt to decode entities that do fit in the chosen target charset, leaving the rest in their encoded HTML-entity form. This behavior guarantees that no information will be lost or corrupted by the conversion process.

Transformation, for example, to a unicode charset such as UTF-8 ensures that all non-ASCII entities will disappear in favor of the corresponding characters. Conversely, setting US-ASCII as the target charset will bring these characters back to their HTML-entity representation. So the program can actually be used for both decoding and encoding, just by playing with the chosen target charset.

If no charset is specified, the underlying platform charset will be used, and all HTML entities covered by this charset will be decoded.

Features

Capacity for batch multiple-file conversion, with the choice of pattern filename filtering and optional backup of original files.
Conversion between a large number of language encodings (as many charsets as supported by your java runtime version).
Can be configured for both HTML encoding or decoding.
Recognizes the complete set of character entity references defined in the HTML 4.0 specification.
Deals correctly with all unicode characters, including supplementary characters (i.e., code points lying outside of the Basic Multilingual Plane).
Deals correctly with numerical character references, either decimal or hexadecimal.
Independent transformation rules can be set for mark-significant HTML encodings/characters ('"', '&', '<', '>'): either leave them alone, or transform to their encoding or character form.

The project's jar file htmlCharset.jar can also be used in external java applications as a conversion library:

The following code, for example,

      import org.htmlCharset.core.*;

      ...
      String inText = "Some accented vowels: &agrave;, &eacute;, &egrave;, etc.";
      String outText = new HTMLTransformer().transform(inText);
      ...

will rewrite as characters the HTML encodings present in inText.

Running htmlCharset

htmlCharset is written in Java (requires v1.5, or higher) and comes in two distinct flavors: as a graphical java swing application (named htmlCharset, see screenshot) and as a command line utility (named htmlCharsetCmd).

To run the htmlCharset application type:
```
      java -jar htmlCharset.jar
      
```

To run the htmlCharsetCmd application type:

      java -jar htmlCharsetCmd.jar [options] <filenames>

Win32 specific

The Win32 distribution contains .exe wrappers for the application's jars:

htmlCharset.exe (gui version)
htmlCharsetCmd.exe (command line version)

Command line usage

[options] -?, -h, --help Show this usage info and exit. -a, --ascii Encode to ASCII; i.e., write all characters beyond the 7-bit US-ASCII charset as HTML entities. Identical to option: -t US-ASCII. -b, --backup-source Generate backup copies of the original source files. -c, --supported-charsets Type the list of supported charsets and exit. -s, --source-charset <charset> Read source files using the named charset. Defaults to your underlying platform charset. -t, --target-charset <charset> Write output in the named charset. Defaults to your underlying platform charset. HTML entities representing a character that fits in the target charset will be replaced by the character itself. Characters laying outside the target charset will be written as HTML entities. This does not apply to the four markup-significant entities " ("), & (&), < (<) and > (>) which, by default, are not modified (see also options -m and -n). -m, --decode-markup Output as characters all instances of the markup-significant entities ", &, < and >. No other transformations applied unless options -a or -t are also explicitly set. -n, --encode-markup Output as entities all instances of the markup-significant characters ("), (&), (<) and (>). No other transformations applied unless options -a or -t are also explicitly set. -r, --recursive Recurse into subdirectories. -v, --verbose Warn of unknown entities encountered while parsing input files. -x, --source-fileext <fileext> Process only source files with a file extension matching <fileext>. <filenames> List of source files or directories to be processed.

Downloads

Download links for all distributions of htmlCharset are available from the SourceForge project download page.

Source code documentation

You can read the project javadoc here.

License

This program is free software, distributed under GNU GPL license (version 3).

Credits

htmlCharset uses elements from the following free projects:

swing-layout library.
Silk icons from famfamfam icon sets.
jSmooth, for the generation of Win32 .exe wrappers.