Skip to content

[BUG] Search Count can hang for a regex that matches null before a multi-byte utf-8 character #16207

@Coises

Description

@Coises

Is there an existing issue for this?

  • I have searched the existing issues

Description of the Issue

When a regex search in UTF-8 matches null before a multi-byte character, a subsequent match can hang when backtracking across that character.

This problem was reported in this post in the community forums.

Steps To Reproduce

  1. Paste a└c into an empty Notepad++ tab.
  2. Open the Find dialog, set Search Mode: Regular expression, and Find what: (?-i)\u*(?=[^\l]).
  3. Click Count.

Current Behavior

Notepad++ becomes unresponsive.

Expected Behavior

Counting should proceed normally without hanging.

Debug Information

Notepad++ v8.7.7   (64-bit)
Build time : Feb  6 2025 - 03:19:13
Path : C:\Program Files\Notepad++\notepad++.exe
Command Line : 
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
Periodic Backup : OFF
Placeholders : OFF
DirectWrite : ON
Multi-instance Mode : monoInst
File Status Auto-Detection : cdEnabledNew (for current file/tab only)
Dark Mode : OFF
OS Name : Windows 10 Pro (64-bit)
OS Version : 22H2
OS Build : 19045.5371
Current ANSI codepage : 1252
Plugins : 
    ColumnsPlusPlus (1.1.5.2)
    DSpellCheck (1.5)
    mimeTools (3.1)
    NppConverter (4.6)
    NppExport (0.4)

Anything else?

I’ve traced the cause of the problem, but I don’t yet have a proposed solution.

When counting, this code is executed after a null match:

Sci::Position BoostRegexSearch::SearchParameters::nextCharacter(Sci::Position position)
{
	if (_skip_windows_line_end_as_one_character && _document->CharAt(position) == '\r' && _document->CharAt(position+1) == '\n')
		return position + 2;
	else
		return position + 1;
}

to advance the match position before matching again. If the null match occurred before a multibyte character, that places the starting position on the second byte of the character.

A UTF8DocumentIterator is later initialized with this position. It processes the invalid start byte as if it were valid. In the example case, it calculates a length of two and does a bogus computation of the Unicode character it represents. It turns out this character satisfies \u, so \u* matches one character.

The next match in the example fails; when it does, the regex engine begins to unwind \u* by decrementing the iterator it saved as the end of that sub-expression. At this point, UTF8DocumentIterator has no record of the fact that it started in the middle of a character; when it decrements, it decrements to the start of the previous valid character, one byte before the position at which the match started.

I didn’t trace the rest of the failure, since the above seems to be enough to explain the misbehavior. It’s not yet clear to me what is the best way to fix this. If BoostRegexSearch::SearchParameters::nextCharacter is used in ANSI searches, too (I have not verified whether it is or not), the problem can’t be fixed by just changing that to scan for valid UTF-8; either a flag or two versions would be needed, one for ANSI and one for UTF-8. What might be reasonable would be to make UTF8DocumentIterator skip forward to the start of the next valid character if it is positioned within a character. However, the implications for backwards searches (if they are enabled) and for error bytes would have to be considered carefully.

Metadata

Metadata

Assignees

Labels

crashissue causing N++ to crash

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions