Decoding Invalid Code Sequence Consistency

We would want to know when we have accumulated in a buffer enough bytes to decode a character, depending on the current encodng…
babel doesn't provide a convenient (efficient) API to test that, but I hoped to be able to use OCTETS-TO-STRING for that.
Unfortunately, handling of incomplete code sequences by the different encoding is not consistent.


```
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 2 :errorp nil :encoding :utf-8)
"¶"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 1 :errorp nil :encoding :utf-8)
"�"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 2 :errorp nil :encoding :utf-16)
"슶"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 1 :errorp nil :encoding :utf-16)
> Debug: Failed assertion: (= babel-encodings::i babel-encodings::end)
> While executing: (:internal swank::invoke-default-debugger), in process new-repl-thread(1481).
> Type cmd-/ to continue, cmd-. to abort, cmd-\ for a list of available restarts.
> If continued: test the assertion again.
> Type :? for other options.
1 > :q
; Evaluation aborted on #<simple-error #x302006CBABDD>.
cl-user> (babel:octets-to-string (babel:string-to-octets "こんにちは 世界" :encoding :eucjp) :start 0 :end 2 :encoding :eucjp)
"こ"
cl-user> (babel:octets-to-string (babel:string-to-octets "こんにちは 世界" :encoding :eucjp) :start 0 :end 1 :encoding :eucjp)
> Debug: Illegal :eucjp character starting at position 0.
> While executing: (:internal swank::invoke-default-debugger), in process repl-thread(3921).
> Type cmd-. to abort, cmd-\ for a list of available restarts.
> Type :? for other options.
1 > :q
; Evaluation aborted on #<babel-encodings:end-of-input-in-character #x302006CA4EAD>.
cl-user>
```


I would suggest to add a keyword parameter to specify what to do in such a case:
```
| :on-invalid-code substitution-character | would insert the given substitution-character in place of the code. |
| :on-invalid-code :ignore                | would ignore the code and go on.                                    |
| :on-invalid-code :error                 | would signal a babel-encodings:character-decoding-error condition.  |
```


I would propose also, to provide an efficient function to query the length of a code sequence for the next character:
```
(babel:decode-character bytes &key start end encoding)
--> character ;
    sequence-valid-p ;
    length
```

- If a character can be decoded, then it is returned as primary value, otherwise NIL.

- If the code sequence is definitely invalid then NIL, else T. Notably if it is just too short, but could be a valid code sequence if completed, T should be returned.

- If the character is decoded and returned, then the length of the decoded code sequence is returned; if sequence-valid-p then a minimal code sequence length with the given prefix is returned; otherwise a minimum code sequence length.

```
| character | sequence-valid-p | length                                                         |
|-----------+------------------+----------------------------------------------------------------|
| ch        | T                | length of the decoded sequence                                 |
| ch        | NIL              | --impossible--                                                 |
| NIL       | T                | minimal length of a valid code sequence with the given prefix. |
| NIL       | NIL              | minimal length of a valid code sequence.                       |
```
For example, in the case NIL T len, if len <= (- end start), then it means the given code sequence is valid, but the decoded code is not the code of a character.  eg. ```#(#xED #xA0 #x80)``` is UTF-8 for 55296, but ```(code-char 55296) --> nil```.


```
(babel:decode-character (coerce #(65 32 66) '(vector (unsigned-byte 8)))
                         :start 0 :end 3 :encoding :utf-8)
--> #\A
    T
    1

(babel:decode-character (coerce #(195 128 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
                        :start 0 :end 3 :encoding :utf-8)
--> #\À
    T
    2

(babel:decode-character (coerce #(195 128 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
                        :start 0 :end 1 :encoding :utf-8)
--> NIL
    T
    2

(babel:decode-character (coerce #(195 195 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
                        :start 0 :end 1 :encoding :utf-8)
--> NIL
    T
    2

(babel:decode-character (coerce #(195 195 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
                        :start 0 :end 2 :encoding :utf-8)
--> NIL
    NIL
    1

(babel:decode-character (coerce #(#xED #xA0 #x80) '(vector (unsigned-byte 8)))
                        :start 0 :end 3 :encoding :utf-8)
--> NIL
    T
    3
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decoding Invalid Code Sequence Consistency #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decoding Invalid Code Sequence Consistency #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions