Hello!

Erik Weitnauer eweitnauer at googlemail.com
Sun May 24 21:19:28 JST 2009


Hi Lee,
thanks for the explanations and suggestions!
At the moment I feel that understanding some aspects of all this
Character encoding / printing / information collection topic just
leads to even more questions and confusion ;)

So here is what I am wondering about after reading your reply and some
recherche reading:

> I) The CDP that you're talking about is what is known as  "
> Ideographic Description Characters"  ... ("CDP" itself stands for
> Chinese Document Processing Lab) CDP is like treating the "a" part in
> letter "æ"  as a different character.  The folks in charge of unicode
> apparently never really figured out whether they want to add CDP into
> to unicode since it'd be like having 2 "a" in unicode, one for the
> first letter of many European languages,  and one specifically for
> describing the left part of the letter "æ".

I see, that sounds very reasonable. Do you know of a way to display
these CDP characters, preferrably under linux? And is there any way to
translate these CDP codes to unicode (of course only in the cases
there is an unicode entry for it).
I was on the CDP website [http://cdp.sinica.edu.tw/service/], and they
even offer some files to download. However, its some Windows and
VisualBasic specific software and I am not sure if I should follow
this path any further. (Also Chinese websites are unfortunately still
kind of scary for me...)

> On a separate note, there ARE many non-unicode hanzi characters, for
> example, the texting generation have what is called  "Martians"
> (Chinese equivalent of 1337.)   And there are some old, obscure words
> which only scholars like dr. tomo will be familiar with.   (A back of
> envelop calculation will show you that for the past 300 years or so an
> average of 1 Chinese character come into circulation every 2 weeks, so
> non-unicode Chinese is hardly a surprise )

I can imagine that!
But as you write below, I only need to include the most frequently
used character (some 1000's) which are all available in Unicode. So
this might not be an issue for my project.

> Use data from unihan to to limit the scope of which character to
> include in your software. (being too difficult is always a bad idea in
> second language instruction, if you include EVERY character found in
> Chise in your software, the end result will so user unfriendly that
> even native speaker  won't understand what's going on. Chise includes
> some very very obscure characters)

I totally agree. I already had a look at the unihan database before
and was suprised to find that much information about each character.
As you suggested, I will use the kGradeLevel / kBigFive / kFrequency
fields to chose a gentle set of characters for a language learner.
Also the joyo kanji list and maybe the list of characters you are
supposed to know when taking the HSK should be valuable sources for
character selection.

> III)  Chise is extremely comprehensive, it deals with the detail
> components and not just the usual semantic root / radicals.
> so expect a lot of hand coding to perfect your software; personally
> i'm not even too sure if Chise is  the right thing for your task.
> (for example, there is a two horizontal stroke in "好" and Chise
> recognizes this, however for the purpose os language instruction, it
> seems more prudent to treat 好 as being two parts, namely,  "女" and
> "子".

Chise provides a huge amount of information about characters, indeed.
So this means I will have to chose the informations that I want to
display for each character quite carefully so it does not confuse a
language learner instead of helping him. However, I believe that
providing information about the decomposition of characters is very
helpful for learning. Often a decomposition into more that just
radical and remainding would be most helpful.
I did not find any database that can even give me the decomposition
into radical and remainding part (in unihan you only get the radical).
In the Chise database I even get a decomposition into all parts, for
which I can look up pronounication, meaning, etc.
So at the moment I don't see any alternative to using Chise...

The example with 好 you gave above is not yet clear to me. When I look
好 up in the IDS-UCS-Basic.txt file, there is the line "U+597D	好	⿰女子",
which is what I am looking for. How would I extract the two horizontal
strokes in 好, like you described above?

One final question:
Is there any place to get English documentation or sample code for
libchise and how to use it?

Kind regards,
Erik.






More information about the CHISE-en mailing list