Skip to content

The "Search > Go to..." feature should not allow moving inside a multi-byte encoding of a character ! #9101

@guy038

Description

@guy038

Remark : This issue has been first noticed by Alan Kilborn, revisited by Peter Jones and discussed in that topic :

https://community.notepad-plus-plus.org/post/59397

Description of the Issue

When using the Search > Go to... feature, with the Offset option ticked, the different offsets corresponding to each byte of a multi-bytes encoding, after the first one, should be inaccessible !

Steps to Reproduce the Issue

  • Open a new tab in N++

  • if, necessary, use the Encoding > Convert to UTF-8 to get an empty UTF-8 encoded file

  • Just type in the text A👨Z on the first line

Note that, as the emoji MAN 👨 is the Unicode character of code-point U+1F468, we can describe this line, in an UTF-8 encoded file, as :

Characters :   A         👨        Z
Bytes      :   41  F0  9F  91  A8  5A
Offset     :   0   1   2   3   4   5
  • If you move the caret right before the A char, the Search > Go to... feature says you're at offset 0

  • If you move the caret right before the 👨 char, the Search > Go to... feature says you're at offset 1

  • If you move the caret right before the Z char, the Search > Go to... feature says you're at offset 5

All these offsets are correct. But these values should be the only possible offsets to type in in the You want to go to zone !

Actual Behavior

Now, let's force a move to offset 3, exactly in the middle of the multi-bytes sequence of the emoji char ( byte 91 ) and then click on the Go button

  • Seemingly, the caret seems right before the Z letter. In fact :

    • If you hit the Backspace key, you get the text Ax91xA8Z, so the first two bytes of the encoding xF0x9F, before the offset, are deleted

    • If you hit the Delete key, you get the text AxF0x9FxA8Z, so the next x91 byte, after the offset, is deleted

  • In addition, as you can see, the action of the two keys Backspace and Delete are not symmetrical as the former deletes two bytes ( the beginning of the multi-bytes sequence ) whereas the latter just deletes one byte ( x91 )

Expected Behavior

The offsets values, relative to the individual bytes of a multi-bytes sequence, after the 1 byte, in a Unicode encoded file, should not be allowed ! For instance, in the example above, the allowed values should be, exclusively, 0, 1 and 5

Then, the Backspace and Delete would just act on one character, only, as expected !

Best Regards,

guy038

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions