wcwidth()追加してみました

Real Vim Hacks Project - Humanity
mbstrlen()に続いてwcwidth()も実装してみた。

id:mattnさんがwcwidth()がほしいと言っていたのでmbyte.cとか見てみたらmbstrlen()と同じく関数が用意されていたのであっさり追加。
今回いろいろとmbyte.cに触れたのでいろいろと分かったことをまとめる。

mbyte.c mb_init() 624行目から抜粋

/*
* Set the function pointers.
*/
if (enc_utf8)
{
mb_ptr2len = utfc_ptr2len;
mb_ptr2len_len = utfc_ptr2len_len;
mb_char2len = utf_char2len;
mb_char2bytes = utf_char2bytes;
mb_ptr2cells = utf_ptr2cells;
mb_ptr2cells_len = utf_ptr2cells_len;
mb_char2cells = utf_char2cells;
mb_off2cells = utf_off2cells;
mb_ptr2char = utf_ptr2char;
mb_head_off = utf_head_off;
}
else if (enc_dbcs != 0)
{
mb_ptr2len = dbcs_ptr2len;
mb_ptr2len_len = dbcs_ptr2len_len;
mb_char2len = dbcs_char2len;
mb_char2bytes = dbcs_char2bytes;
mb_ptr2cells = dbcs_ptr2cells;
mb_ptr2cells_len = dbcs_ptr2cells_len;
mb_char2cells = dbcs_char2cells;
mb_off2cells = dbcs_off2cells;
mb_ptr2char = dbcs_ptr2char;
mb_head_off = dbcs_head_off;
}
else
{
mb_ptr2len = latin_ptr2len;
mb_ptr2len_len = latin_ptr2len_len;
mb_char2len = latin_char2len;
mb_char2bytes = latin_char2bytes;
mb_ptr2cells = latin_ptr2cells;
mb_ptr2cells_len = latin_ptr2cells_len;
mb_char2cells = latin_char2cells;
mb_off2cells = latin_off2cells;
mb_ptr2char = latin_ptr2char;
mb_head_off = latin_head_off;
}

ここでmb_*というグローバル変数にそれぞれの関数がセットされている。
enc_utf8, enc_dbcsはそれぞれ現在の&encoding*1の値の情報を表している。
mbyte.cの先頭にあるコメントのenc_utf8とenc_dbcsのみ引用してみる。

* "enc_dbcs" When non-zero it tells the type of double byte character
* encoding (Chinese, Korean, Japanese, etc.).
* The cell width on the display is equal to the number of
* bytes. (exception: DBCS_JPNU with first byte 0x8e)
* Recognizing the first or second byte is difficult, it
* requires checking a byte sequence from the start.
* "enc_utf8" When TRUE use Unicode characters in UTF-8 encoding.
* The cell width on the display needs to be determined from
* the character value.
* Recognizing bytes is easy: 0xxx.xxxx is a single-byte
* char, 10xx.xxxx is a trailing byte, 11xx.xxxx is a leading
* byte of a multi-byte character.
* To make things complicated, up to six composing characters
* are allowed. These are drawn on top of the first char.
* For most editing the sequence of bytes with composing
* characters included is considered to be one character.

ようするにenc_utf8は現在の&encodingがutf-8であることを示すフラグで、
enc_dbcsは現在の&encodingが中国語、韓国語、日本語などのエンコーディングであることを示す。

(ちなみに

* Recognizing bytes is easy: 0xxx.xxxx is a single-byte
* char, 10xx.xxxx is a trailing byte, 11xx.xxxx is a leading
* byte of a multi-byte character.

の辺りはUTF-8 - Wikipedia見てもらうと分かる)

この2つの変数はこのmb_init()内でのみ代入されている。
またmb_init()が呼ばれている箇所は2箇所しかない。
どちらもoption.c内で、一方はset_init_1()という関数、
もう一方はdid_set_string_option()という関数で
おそらく名前からして文字列であるオプションを変更する時に呼ぶ関数であり、
mb_init()はこの関数で&encoding*2が変更される時に呼ばれていることが分かる。

それぞれの関数が何をするものなのか、コメントを抜粋して訳していく。

(TODO: すいません、時間があったらやりますorz とりあえずコメントのみ載せておきます)

ちなみに言うとVimはソース本体はアレだけどコメントがすごく適切なのでコメントを読むと分かることがとても多い。

それぞれの関数

mb_ptr2len()

/*
* mb_ptr2len() function pointer.
* Get byte length of character at "*p" but stop at a NUL.
* For UTF-8 this includes following composing characters.
* Returns 0 when *p is NUL.
*/

mb_ptr2len_len()

/*
* mb_ptr2len_len() function pointer.
* Like mb_ptr2len(), but limit to read "size" bytes.
* Returns 0 for an empty string.
* Returns 1 for an illegal char or an incomplete byte sequence.
*/

mb_char2len()

/*
* mb_char2len() function pointer.
* Return length in bytes of character "c".
* Returns 1 for a single-byte character.
*/

mb_char2bytes()

/*
* mb_char2bytes() function pointer.
* Convert a character to its bytes.
* Returns the length in bytes.
*/

mb_ptr2cells()

/*
* mb_ptr2cells() function pointer.
* Return the number of display cells character at "*p" occupies.
* This doesn't take care of unprintable characters, use ptr2cells() for that.
*/

mb_ptr2cells_len()

/*
* mb_ptr2cells_len() function pointer.
* Like mb_ptr2cells(), but limit string length to "size".
* For an empty string or truncated character returns 1.
*/

mb_char2cells()

/*
* mb_char2cells() function pointer.
* Return the number of display cells character "c" occupies.
* Only takes care of multi-byte chars, not "^C" and such.
*/

mb_off2cells()

/*
* mb_off2cells() function pointer.
* Return number of display cells for char at ScreenLines[off].
* We make sure that the offset used is less than "max_off".
*/

mb_ptr2char()

/*
* mb_ptr2char() function pointer.
* Convert a byte sequence into a character.
*/

mb_head_off()

/*
* mb_head_off() function pointer.
* Return offset from "p" to the first byte of the character it points into.
* If "p" points to the NUL at the end of the string return 0.
* Returns 0 when already at the first byte of a character.
*/

今回やったこと

前回mbstrlen()を実装した時にMB_CHARLEN()というマクロにお世話になったので、
それに倣ってMB_CHARWIDTH()というマクロを作った。
これはhas_mbyte(mbyte.c参照)が真であればmb_charwidth()という今回作った関数を呼び出す。
has_mbyteが偽ならとりあえず文字列の幅はSTRLEN()を呼ぶことにした*3。
またmb_charwidth()はim_commit_cb()という関数の処理を使えばよさそうなのでパクってまとめた。
http://github.com/tyru/vim/commit/d5949a91eee23771184a44f9f43db073be3a6426

そしたら後はwcwidth()からその関数を呼んでやるだけ。
http://github.com/tyru/vim/commit/13f4290a75e8c44a242eea85460177ddd89f71d1

XXX

ただちょっと気になったのが、上のmb_ptr2cells()のコメントに書いてある通り、
*_ptr2cells()はunprintableな文字を気にしないよ(=想定してない？)
みたいなことが書いてあったので、一応XXXコメントつけといた。
ちなみにptr2cells()はcharset.cにある。
http://github.com/tyru/vim/commit/0ced7a88cf2c74bc02d38c886001bb0221e39f92

一応

:echo wcwidth("\<Tab>")

とかやって試してみたんだけど、1だった。
想定してない場合には必ず1かどうかも分からない。
ここらへんはMLに投げてまずいか聞いた方がいいかもしれない。
mbstrlen()もそうだけどガンガン投げて、その際に不安なことは訊いた方がよさげ。

*1:「EXTERN char_u *p_enc」 in option.c

*2:p_enc

*3:これでいいんだろうか