-
Notifications
You must be signed in to change notification settings - Fork 31
Description
We would want to know when we have accumulated in a buffer enough bytes to decode a character, depending on the current encodng…
babel doesn't provide a convenient (efficient) API to test that, but I hoped to be able to use OCTETS-TO-STRING for that.
Unfortunately, handling of incomplete code sequences by the different encoding is not consistent.
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 2 :errorp nil :encoding :utf-8)
"¶"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 1 :errorp nil :encoding :utf-8)
"�"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 2 :errorp nil :encoding :utf-16)
"슶"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 1 :errorp nil :encoding :utf-16)
> Debug: Failed assertion: (= babel-encodings::i babel-encodings::end)
> While executing: (:internal swank::invoke-default-debugger), in process new-repl-thread(1481).
> Type cmd-/ to continue, cmd-. to abort, cmd-\ for a list of available restarts.
> If continued: test the assertion again.
> Type :? for other options.
1 > :q
; Evaluation aborted on #<simple-error #x302006CBABDD>.
cl-user> (babel:octets-to-string (babel:string-to-octets "こんにちは 世界" :encoding :eucjp) :start 0 :end 2 :encoding :eucjp)
"こ"
cl-user> (babel:octets-to-string (babel:string-to-octets "こんにちは 世界" :encoding :eucjp) :start 0 :end 1 :encoding :eucjp)
> Debug: Illegal :eucjp character starting at position 0.
> While executing: (:internal swank::invoke-default-debugger), in process repl-thread(3921).
> Type cmd-. to abort, cmd-\ for a list of available restarts.
> Type :? for other options.
1 > :q
; Evaluation aborted on #<babel-encodings:end-of-input-in-character #x302006CA4EAD>.
cl-user>
I would suggest to add a keyword parameter to specify what to do in such a case:
| :on-invalid-code substitution-character | would insert the given substitution-character in place of the code. |
| :on-invalid-code :ignore | would ignore the code and go on. |
| :on-invalid-code :error | would signal a babel-encodings:character-decoding-error condition. |
I would propose also, to provide an efficient function to query the length of a code sequence for the next character:
(babel:decode-character bytes &key start end encoding)
--> character ;
sequence-valid-p ;
length
-
If a character can be decoded, then it is returned as primary value, otherwise NIL.
-
If the code sequence is definitely invalid then NIL, else T. Notably if it is just too short, but could be a valid code sequence if completed, T should be returned.
-
If the character is decoded and returned, then the length of the decoded code sequence is returned; if sequence-valid-p then a minimal code sequence length with the given prefix is returned; otherwise a minimum code sequence length.
| character | sequence-valid-p | length |
|-----------+------------------+----------------------------------------------------------------|
| ch | T | length of the decoded sequence |
| ch | NIL | --impossible-- |
| NIL | T | minimal length of a valid code sequence with the given prefix. |
| NIL | NIL | minimal length of a valid code sequence. |
For example, in the case NIL T len, if len <= (- end start), then it means the given code sequence is valid, but the decoded code is not the code of a character. eg. #(#xED #xA0 #x80) is UTF-8 for 55296, but (code-char 55296) --> nil.
(babel:decode-character (coerce #(65 32 66) '(vector (unsigned-byte 8)))
:start 0 :end 3 :encoding :utf-8)
--> #\A
T
1
(babel:decode-character (coerce #(195 128 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
:start 0 :end 3 :encoding :utf-8)
--> #\À
T
2
(babel:decode-character (coerce #(195 128 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
:start 0 :end 1 :encoding :utf-8)
--> NIL
T
2
(babel:decode-character (coerce #(195 195 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
:start 0 :end 1 :encoding :utf-8)
--> NIL
T
2
(babel:decode-character (coerce #(195 195 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
:start 0 :end 2 :encoding :utf-8)
--> NIL
NIL
1
(babel:decode-character (coerce #(#xED #xA0 #x80) '(vector (unsigned-byte 8)))
:start 0 :end 3 :encoding :utf-8)
--> NIL
T
3