[CHISE-en:101] CHISE IDS in IDSgrep, syntactically invalid IDSes

mskala at ansuz.sooke.bc.ca mskala at ansuz.sooke.bc.ca
Thu Jul 19 11:43:34 JST 2012


I've added preliminary support for a dictionary derived from CHISE IDS in
my IDSgrep utility.  IDSgrep allows advanced queries of the spatial
structure of characters, such as "Show all the characters that contain
the grass radical at the top, above something split into left and right
halves, but do not include the 大 element."

More information on IDSgrep is available from the Tsukurimashou project at
this URL:
   http://sourceforge.jp/projects/tsukurimashou/

This screenshot gives some idea of what it can do:
   http://sourceforge.jp/projects/tsukurimashou/images?id=2999

The CHISE IDS support is currently only available by checking it out
through the version control system (SVN, or the mirror on GitHub) but
it will also be included in the next packaged release (version 0.3).

While working on this I found a significant number of errors in the CHISE
IDS 0.25 data.  Nearly 12000, or 6%, of the entries (including
all the entries in IDS-UCS-Compat-Supplement.txt, and nearly all the
entries in IDS-UCS-Compat.txt) contain syntactically invalid IDSes.

Here's an example of what I mean, from the file IDS-JIS-X0208-1990.txt:

J90-734F   &GT-65035;      ⿺麥

The binary operator ⿺ should have two arguments, but is only given one.
There are also many cases of too many arguments for an operator, like this
example from IDS-HZK01.txt:

HZK01-B5B6  &I-HZK01-B5B6;  ⿷匚山王

And there are entries where the IDS consists of just a string of kanji
with no operators at all, as in this example from IDS-HZK12.txt:

HZK12-EE65  &HZK12-EE65;    毛晶冖且

Is there somewhere I should send reports of this sort of thing?

-- 
Matthew Skala
mskala at ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/


More information about the CHISE-en mailing list