[BUG] Search Count can hang for a regex that matches null before a multi-byte utf-8 character

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Description of the Issue

When a regex search in UTF-8 matches null before a multi-byte character, a subsequent match can hang when backtracking across that character.

This problem was reported in [this post](https://community.notepad-plus-plus.org/topic/26642/) in the community forums.

### Steps To Reproduce

1.  Paste `a&#9492;c` into an empty Notepad++ tab.
2.  Open the **Find** dialog, set **Search Mode: Regular expression**, and **Find what:** `(?-i)\u*(?=[^\l])`.
3.  Click **Count**.

### Current Behavior

Notepad++ becomes unresponsive.

### Expected Behavior

Counting should proceed normally without hanging.

### Debug Information

```shell
Notepad++ v8.7.7   (64-bit)
Build time : Feb  6 2025 - 03:19:13
Path : C:\Program Files\Notepad++\notepad++.exe
Command Line : 
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
Periodic Backup : OFF
Placeholders : OFF
DirectWrite : ON
Multi-instance Mode : monoInst
File Status Auto-Detection : cdEnabledNew (for current file/tab only)
Dark Mode : OFF
OS Name : Windows 10 Pro (64-bit)
OS Version : 22H2
OS Build : 19045.5371
Current ANSI codepage : 1252
Plugins : 
    ColumnsPlusPlus (1.1.5.2)
    DSpellCheck (1.5)
    mimeTools (3.1)
    NppConverter (4.6)
    NppExport (0.4)
```

### Anything else?

I&rsquo;ve traced the cause of the problem, but I don&rsquo;t yet have a proposed solution.

When counting, [this code](https://github.com/notepad-plus-plus/notepad-plus-plus/blob/fd2157729a973c30184137c61fc3a8e36663cdbe/boostregex/BoostRegExSearch.cxx#L432) is executed after a null match:

```
Sci::Position BoostRegexSearch::SearchParameters::nextCharacter(Sci::Position position)
{
	if (_skip_windows_line_end_as_one_character && _document->CharAt(position) == '\r' && _document->CharAt(position+1) == '\n')
		return position + 2;
	else
		return position + 1;
}
```
to advance the match position before matching again. If the null match occurred before a multibyte character, that places the starting position on the second byte of the character.

A [UTF8DocumentIterator](https://github.com/notepad-plus-plus/notepad-plus-plus/blob/fd2157729a973c30184137c61fc3a8e36663cdbe/boostregex/UTF8DocumentIterator.cxx) is later initialized with this position. It processes the invalid start byte as if it were valid. In the example case, it calculates a length of two and does a bogus computation of the Unicode character it represents. It turns out this character satisfies `\u`, so `\u*` matches one character.

The next match in the example fails; when it does, the regex engine begins to unwind `\u*` by decrementing the iterator it saved as the end of that sub-expression. At this point, **UTF8DocumentIterator** has no record of the fact that it started in the middle of a character; when it decrements, it decrements to the start of the previous *valid* character, one byte before the position at which the match started.

I didn&rsquo;t trace the rest of the failure, since the above seems to be enough to explain the misbehavior. It&rsquo;s not yet clear to me what is the best way to fix this. If **BoostRegexSearch::SearchParameters::nextCharacter** is used in ANSI searches, too (I have not verified whether it is or not), the problem can&rsquo;t be fixed by just changing that to scan for valid UTF-8; either a flag or two versions would be needed, one for ANSI and one for UTF-8. What might be reasonable would be to make **UTF8DocumentIterator** skip forward to the start of the next valid character if it is positioned within a character. However, the implications for backwards searches (if they are enabled) and for error bytes would have to be considered carefully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Search Count can hang for a regex that matches null before a multi-byte utf-8 character #16207

Is there an existing issue for this?

Description of the Issue

Steps To Reproduce

Current Behavior

Expected Behavior

Debug Information

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Search Count can hang for a regex that matches null before a multi-byte utf-8 character #16207

Description

Is there an existing issue for this?

Description of the Issue

Steps To Reproduce

Current Behavior

Expected Behavior

Debug Information

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions