Skip to content

Conversation

@RamiNoodle733
Copy link

Problem

SQL strings containing escaped backslashes (e.g., '\\') were incorrectly tokenized, causing subsequent tokens to be parsed as errors.

Root Cause

The regex patterns for string literals in keywords.py didn't include \\\\ to match escaped backslashes.

Fix

Updated SQL_REGEX patterns:

  • Single-quoted strings: Added \\\\ to the pattern
  • Double-quoted strings: Added \\\\ to the pattern

Testing

Added test_tokenize_escaped_backslash() to verify correct tokenization.

Fixes #814

Add \\ to the regex patterns for single and double quoted strings
to correctly tokenize SQL strings containing escaped backslashes.

Previously, a string like '\\' would be incorrectly tokenized,
causing subsequent tokens to be parsed as errors.

Fixes andialbrecht#814
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes sqlparse tokenization for SQL string literals containing escaped backslashes (e.g. '\\'), preventing strings from being split incorrectly and causing downstream tokens to be mis-parsed (Fixes #814).

Changes:

  • Extend the single-quoted and double-quoted string regexes to correctly match escaped backslashes (\\\\).
  • Add a regression test to validate correct tokenization of escaped backslashes in single-quoted strings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
sqlparse/keywords.py Updates string-literal regex patterns to properly handle escaped backslashes.
tests/test_tokenize.py Adds a regression test that verifies escaped backslashes in single-quoted strings tokenize correctly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +252 to +253
import sqlparse
from sqlparse import tokens as T
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file already imports sqlparse and tokens as T at module scope; re-importing them inside this test is redundant and inconsistent with the rest of the file. Prefer using the existing module-level imports to keep test style consistent.

Suggested change
import sqlparse
from sqlparse import tokens as T

Copilot uses AI. Check for mistakes.
tokens = list(sqlparse.parse(sql)[0].flatten())
token_types = [t.ttype for t in tokens]

# Should be: SELECT, whitespace, ',', ,, whitespace, ',', (6 tokens after keyword)
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment describing the expected token sequence is incorrect/garbled (it mentions commas/quotes in a way that doesn’t match the SQL). Please update it to reflect the actual expected flattened token order for SELECT '\\', '\\' to avoid misleading future readers.

Suggested change
# Should be: SELECT, whitespace, ',', ,, whitespace, ',', (6 tokens after keyword)
# Expected flattened token order: SELECT, <WS>, "'\\'", ',', <WS>, "'\\'"

Copilot uses AI. Check for mistakes.
Comment on lines +64 to 65
(r'"(""|\\\\|\\"|[^"])*"', tokens.String.Symbol),
(r'(""|".*?[^\\]")', tokens.String.Symbol),
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR also changes the double-quoted pattern, but the added test only covers single-quoted strings. Please add a regression test exercising a double-quoted value containing escaped backslashes (and verifying tokenization doesn’t produce T.Error / doesn’t merge tokens) so the \\ addition here is covered.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect Tokenization of Escaped Backslashes in SQL String Literals

1 participant