htmlCharset (SourceForge project page) is a file conversion tool, useful for replacing HTML entities
(i.e., character encodings such as, for example, à
or α
) by
the actual characters that they represent, or vice versa. As a spin-off, it can also be used as a general charset converter for arbitrary text files.
In general, depending on the charset in use, the conversion from a given entity to its corresponding character will only be possible if the charset has an entry for that particular character. With this restriction in mind, the program will only attempt to decode entities that do fit in the chosen target charset, leaving the rest in their encoded HTML-entity form. This behavior guarantees that no information will be lost or corrupted by the conversion process.
Transformation, for example, to a unicode charset such as UTF-8 ensures that all non-ASCII entities will disappear in favor of the corresponding characters. Conversely, setting US-ASCII as the target charset will bring these characters back to their HTML-entity representation. So the program can actually be used for both decoding and encoding, just by playing with the chosen target charset.
If no charset is specified, the underlying platform charset will be used, and all HTML entities covered by this charset will be decoded.
htmlCharset.jar
can also be used in external java
applications as a conversion library:
The following code, for example,
import org.htmlCharset.core.*; ... String inText = "Some accented vowels: à, é, è, etc."; String outText = new HTMLTransformer().transform(inText); ...will rewrite as characters the HTML encodings present in
inText
.
htmlCharset is written in Java (requires v1.5, or higher) and comes in two
distinct flavors: as a graphical java swing application (named htmlCharset
,
see screenshot)
and as a command line utility (named htmlCharsetCmd
).
java -jar htmlCharset.jar
java -jar htmlCharsetCmd.jar [options] <filenames>
The Win32 distribution contains .exe wrappers for the application's jars:
htmlCharset.exe
(gui version)
htmlCharsetCmd.exe
(command line version)
[options]
-?, -h, --help
Show this usage info and exit.
-a, --ascii
Encode to ASCII; i.e., write all characters beyond the 7-bit US-ASCII
charset as HTML entities. Identical to option: -t US-ASCII.
-b, --backup-source
Generate backup copies of the original source files.
-c, --supported-charsets
Type the list of supported charsets and exit.
-s, --source-charset <charset>
Read source files using the named charset. Defaults to your underlying
platform charset.
-t, --target-charset <charset>
Write output in the named charset. Defaults to your underlying platform
charset. HTML entities representing a character that
fits in the target charset will be replaced by the character itself.
Characters laying outside the target charset will be written as HTML
entities. This does not apply to the four markup-significant entities
" ("), & (&), < (<) and > (>) which, by default, are
not modified (see also options -m and -n).
-m, --decode-markup
Output as characters all instances of the markup-significant entities
", &, < and >. No other transformations applied unless
options -a or -t are also explicitly set.
-n, --encode-markup
Output as entities all instances of the markup-significant characters
("), (&), (<) and (>). No other transformations applied unless options
-a or -t are also explicitly set.
-r, --recursive
Recurse into subdirectories.
-v, --verbose
Warn of unknown entities encountered while parsing input files.
-x, --source-fileext <fileext>
Process only source files with a file extension matching <fileext>.
<filenames>
List of source files or directories to be processed.
Download links for all distributions of htmlCharset are available from the SourceForge project download page.
You can read the project javadoc here.
This program is free software, distributed under GNU GPL license (version 3).
htmlCharset uses elements from the following free projects: