Fix tokenization of escaped backslashes in SQL string literals#837
Fix tokenization of escaped backslashes in SQL string literals#837RamiNoodle733 wants to merge 1 commit intoandialbrecht:masterfrom
Conversation
…ndialbrecht#814) The regex pattern for matching single-quoted strings did not handle escaped backslashes (\) properly. This caused strings like '\', '\' to be incorrectly tokenized as a single string with a comma instead of two separate string literals. Changes: - Add \\ to the string literal patterns in SQL_REGEX to match escaped backslashes as valid content within string literals - Add test case for escaped backslash tokenization Fixes andialbrecht#814
There was a problem hiding this comment.
Pull request overview
This PR fixes a lexer bug in sqlparse where SQL single-quoted string literals containing escaped backslashes (e.g. '\\') could incorrectly consume the closing quote and merge subsequent tokens (comma, whitespace, next string). It addresses #814 by updating the string-literal regex so pairs of backslashes are consumed as content, preventing the closing quote from being misinterpreted as escaped.
Changes:
- Extend the
String.SingleandString.Symbolregex patterns to recognize escaped backslashes (\\\\) within quoted strings. - Add a regression test ensuring
SELECT '\\', '\\'tokenizes into two separate string tokens (with punctuation/whitespace between).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
sqlparse/keywords.py |
Updates lexer regex patterns for quoted strings to correctly consume escaped backslashes. |
tests/test_tokenize.py |
Adds a regression test covering correct tokenization of single-quoted strings containing escaped backslashes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tokens = list(lexer.tokenize(sql)) | ||
| # Should be: SELECT, ws, '\\', ,, ws, '\\' | ||
| assert tokens[0] == (T.Keyword.DML, 'SELECT') | ||
| assert tokens[1] == (T.Whitespace, ' ') | ||
| assert tokens[2] == (T.String.Single, "'\\\\'") | ||
| assert tokens[3] == (T.Punctuation, ',') | ||
| assert tokens[4] == (T.Whitespace, ' ') | ||
| assert tokens[5] == (T.String.Single, "'\\\\'") |
There was a problem hiding this comment.
Consider asserting the full token stream shape (e.g., len(tokens) == 6) to ensure the lexer doesn’t emit extra tokens (like T.Error for an unterminated string) while still satisfying the first few positional assertions. This makes the regression test stronger and easier to diagnose if it fails.
| (r'"(""|\\"|\\\\|[^"])*"', tokens.String.Symbol), | ||
| (r'(""|".*?[^\\]")', tokens.String.Symbol), |
There was a problem hiding this comment.
String.Symbol was updated to treat escaped backslashes (\) as valid content, but there’s no test coverage for double-quoted tokens/identifiers containing escaped backslashes (e.g., "\" or "a\b"). Since this change can affect how quoted identifiers are tokenized/grouped, please add a focused regression test exercising the new branch.
Fixes #814
Problem
Escaped backslashes in SQL string literals were not being tokenized correctly. For example, the query:
currently results in incorrect tokenization where the two string literals are merged together with the comma.
Root Cause
The regex pattern for matching single-quoted strings did not handle escaped backslashes (
\\) as a valid sequence within the string. When a backslash followed by a quote was encountered, it was interpreted as an escape sequence, but standalone escaped backslashes weren't properly consumed.Solution
Added
\\\\to the string literal patterns to match escaped backslashes (\\) as valid content within string literals. This ensures that:'\\'is tokenized as a single string containing one backslash'\\', '\\'is correctly tokenized as two separate string literalsChanges
sqlparse/keywords.py: Added\\\\toString.SingleandString.Symbolpatternstests/test_tokenize.pyfor escaped backslash tokenizationAll existing tests continue to pass.