Hello!

AnMin Li philosophy.dude at gmail.com
Sun May 24 08:34:31 JST 2009


Hi Erik

To your questions
.
I) The CDP that you're talking about is what is known as  "
Ideographic Description Characters"  ... ("CDP" itself stands for
Chinese Document Processing Lab) CDP is like treating the "a" part in
letter "æ"  as a different character.  The folks in charge of unicode
apparently never really figured out whether they want to add CDP into
to unicode since it'd be like having 2 "a" in unicode, one for the
first letter of many European languages,  and one specifically for
describing the left part of the letter "æ".

On a separate note, there ARE many non-unicode hanzi characters, for
example, the texting generation have what is called  "Martians"
(Chinese equivalent of 1337.)   And there are some old, obscure words
which only scholars like dr. tomo will be familiar with.   (A back of
envelop calculation will show you that for the past 300 years or so an
average of 1 Chinese character come into circulation every 2 weeks, so
non-unicode Chinese is hardly a surprise )

II) As for suggestions... since you're main goal is a Teaching
software for non-native speakers, I suggest you first grab Unihan
database @

http://www.unicode.org/reports/tr38/tr38-5.html#N100FB

Use data from unihan to to limit the scope of which character to
include in your software. (being too difficult is always a bad idea in
second language instruction, if you include EVERY character found in
Chise in your software, the end result will so user unfriendly that
even native speaker  won't understand what's going on. Chise includes
some very very obscure characters)

In unihan:

the kGradeLevel field tells in unihan you which words folks in Hong
kongs are expected to know by which grade.

the kBigFive field will tell you whether the character is commonly
used in Taiwan (if a character has an empty value in the kBigFive
field, it means that beginners should probably not bother with it)

the kFrequency field will tell you the rough frequency of how
frequently the character is used in Traditional Chinese Usenet
postings

If a character has a simplified version, by default it means that the
character is considered "frequently used" by China.  The set of
frequently used characters in China is roughly equal to the set of
Simplified characters + (the set of bigfive characters - the set of
bigfive characters with simplified variants)

Also, not found in unihan field is the list of joyo kanji, kanji that
folks in Japan are expected to learn in school you can find those
here:

in http://en.wikipedia.org/wiki/List_of_j%C5%8Dy%C5%8D_kanji
.

III)  Chise is extremely comprehensive, it deals with the detail
components and not just the usual semantic root / radicals.
so expect a lot of hand coding to perfect your software; personally
i'm not even too sure if Chise is  the right thing for your task.
(for example, there is a two horizontal stroke in "好" and Chise
recognizes this, however for the purpose os language instruction, it
seems more prudent to treat 好 as being two parts, namely,  "女" and
"子".

cheers

Lee






More information about the CHISE-en mailing list