[CHISE-en:102] Re: CHISE IDS in IDSgrep, syntactically invalid IDSes

守岡知彦 / MORIOKA Tomohiko tomo at chise.org
Sat Jul 21 19:34:03 JST 2012


Hi Matthew,

>>>>> In <alpine.LNX.2.02.1207182120150.12743 at tetsu.ansuz.sooke.bc.ca> 
>>>>>	mskala at ansuz.sooke.bc.ca wrote:

> I've added preliminary support for a dictionary derived from CHISE IDS in
> my IDSgrep utility.  IDSgrep allows advanced queries of the spatial
> structure of characters, such as "Show all the characters that contain
> the grass radical at the top, above something split into left and right
> halves, but do not include the 大 element."
> 
> More information on IDSgrep is available from the Tsukurimashou project at
> this URL:
>    http://sourceforge.jp/projects/tsukurimashou/
> 
> This screenshot gives some idea of what it can do:
>    http://sourceforge.jp/projects/tsukurimashou/images?id=2999

Thanks for introduction of your interesting tool!

I would like to try it, so I downloaded and installed it, but it seems
that data file(s) are missing.  Does it require the ``Tsukurimashou''
package?


> The CHISE IDS support is currently only available by checking it out
> through the version control system (SVN, or the mirror on GitHub) but
> it will also be included in the next packaged release (version 0.3).

> While working on this I found a significant number of errors in the CHISE
> IDS 0.25 data.  Nearly 12000, or 6%, of the entries (including
> all the entries in IDS-UCS-Compat-Supplement.txt, and nearly all the
> entries in IDS-UCS-Compat.txt) contain syntactically invalid IDSes.

CHISE IDS 0.25 is too old.  It is better to use the latest version in
the Git repository:

    http://git.chise.org/gitweb/?p=chise/ids.git

    % git clone http://git.chise.org/git/chise/ids.git


> Here's an example of what I mean, from the file IDS-JIS-X0208-1990.txt:
> 
> J90-734F   麩      ⿺麥

It has been fixed in the latest version.  But there are some
incomplete lines.  Some lines are held to be incomplete to indicate to
be checked, but it may be not good way.  Perhaps we should introduce
comment notation.


> The binary operator ⿺ should have two arguments, but is only given one.
> There are also many cases of too many arguments for an operator, like this
> example from IDS-HZK01.txt:
> 
> HZK01-B5B6  &I-HZK01-B5B6;  ⿷匚山王
> 
> And there are entries where the IDS consists of just a string of kanji
> with no operators at all, as in this example from IDS-HZK12.txt:
> 
> HZK12-EE65  &I-HZK12-EE65;    毛晶冖且

IDS-HZK*.txt are not maintained.  Most of characters in them are
included in IDS-UCS-*.txt, so there are no need to use them.


> Is there somewhere I should send reports of this sort of thing?

Thanks for you report.  This mailing list is a right place.

-- 
tomo.


More information about the CHISE-en mailing list