Skip to content

Fix tokenization of escaped backslashes in SQL string literals#837

Open
RamiNoodle733 wants to merge 1 commit intoandialbrecht:masterfrom
RamiNoodle733:fix-issue-814-escaped-backslash-tokenization
Open

Fix tokenization of escaped backslashes in SQL string literals#837
RamiNoodle733 wants to merge 1 commit intoandialbrecht:masterfrom
RamiNoodle733:fix-issue-814-escaped-backslash-tokenization

Conversation

@RamiNoodle733
Copy link

Fixes #814

Problem

Escaped backslashes in SQL string literals were not being tokenized correctly. For example, the query:

SELECT '\\', '\\'

currently results in incorrect tokenization where the two string literals are merged together with the comma.

Root Cause

The regex pattern for matching single-quoted strings did not handle escaped backslashes (\\) as a valid sequence within the string. When a backslash followed by a quote was encountered, it was interpreted as an escape sequence, but standalone escaped backslashes weren't properly consumed.

Solution

Added \\\\ to the string literal patterns to match escaped backslashes (\\) as valid content within string literals. This ensures that:

  • '\\' is tokenized as a single string containing one backslash
  • '\\', '\\' is correctly tokenized as two separate string literals

Changes

  • Modified sqlparse/keywords.py: Added \\\\ to String.Single and String.Symbol patterns
  • Added test case in tests/test_tokenize.py for escaped backslash tokenization

All existing tests continue to pass.

…ndialbrecht#814)

The regex pattern for matching single-quoted strings did not handle
escaped backslashes (\) properly. This caused strings like '\', '\'
to be incorrectly tokenized as a single string with a comma instead of
two separate string literals.

Changes:
- Add \\ to the string literal patterns in SQL_REGEX to match
  escaped backslashes as valid content within string literals
- Add test case for escaped backslash tokenization

Fixes andialbrecht#814
Copilot AI review requested due to automatic review settings February 7, 2026 23:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a lexer bug in sqlparse where SQL single-quoted string literals containing escaped backslashes (e.g. '\\') could incorrectly consume the closing quote and merge subsequent tokens (comma, whitespace, next string). It addresses #814 by updating the string-literal regex so pairs of backslashes are consumed as content, preventing the closing quote from being misinterpreted as escaped.

Changes:

  • Extend the String.Single and String.Symbol regex patterns to recognize escaped backslashes (\\\\) within quoted strings.
  • Add a regression test ensuring SELECT '\\', '\\' tokenizes into two separate string tokens (with punctuation/whitespace between).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
sqlparse/keywords.py Updates lexer regex patterns for quoted strings to correctly consume escaped backslashes.
tests/test_tokenize.py Adds a regression test covering correct tokenization of single-quoted strings containing escaped backslashes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +107 to +114
tokens = list(lexer.tokenize(sql))
# Should be: SELECT, ws, '\\', ,, ws, '\\'
assert tokens[0] == (T.Keyword.DML, 'SELECT')
assert tokens[1] == (T.Whitespace, ' ')
assert tokens[2] == (T.String.Single, "'\\\\'")
assert tokens[3] == (T.Punctuation, ',')
assert tokens[4] == (T.Whitespace, ' ')
assert tokens[5] == (T.String.Single, "'\\\\'")
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider asserting the full token stream shape (e.g., len(tokens) == 6) to ensure the lexer doesn’t emit extra tokens (like T.Error for an unterminated string) while still satisfying the first few positional assertions. This makes the regression test stronger and easier to diagnose if it fails.

Copilot uses AI. Check for mistakes.
Comment on lines +64 to 65
(r'"(""|\\"|\\\\|[^"])*"', tokens.String.Symbol),
(r'(""|".*?[^\\]")', tokens.String.Symbol),
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.Symbol was updated to treat escaped backslashes (\) as valid content, but there’s no test coverage for double-quoted tokens/identifiers containing escaped backslashes (e.g., "\" or "a\b"). Since this change can affect how quoted identifiers are tokenized/grouped, please add a focused regression test exercising the new branch.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect Tokenization of Escaped Backslashes in SQL String Literals

1 participant