SqlQuantumLeap

Searching for Case-Sensitive patterns in SQL Server

Dec 4th, 2015
162
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. -- This SQL pertains to the following DBA.StackExchange answer:
  2. -- http://dba.stackexchange.com/questions/122612/how-to-find-values-with-multiple-consecutive-upper-case-characters/122625#122625
  3.  
  4. SET NOCOUNT ON;
  5. DECLARE @SampleData TABLE (Name NVARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL);
  6. INSERT INTO @SampleData (Name) VALUES (N'mr john smith'); -- no match #1
  7. INSERT INTO @SampleData (Name) VALUES (N'Mr John Smith'); -- no match #2
  8. INSERT INTO @SampleData (Name) VALUES (N'Mr JOHN Smith');
  9. INSERT INTO @SampleData (Name) VALUES (N'MR John Smith');
  10. INSERT INTO @SampleData (Name) VALUES (N'Mr John SMITH');
  11. INSERT INTO @SampleData (Name) VALUES (N'Mr JOhn Smith');
  12. INSERT INTO @SampleData (Name) VALUES (N'Mr ÜLala Jones');
  13. INSERT INTO @SampleData (Name) VALUES (N'Mr üLala Jones'); -- no match #3
  14. INSERT INTO @SampleData (Name) VALUES (N'Mr U' + NCHAR(0x0308) + N'Lala Jonez');
  15. INSERT INTO @SampleData (Name) VALUES (N'Mr U' + NCHAR(0x0308) + NCHAR(0x0309) + NCHAR(0x0308) + N'Lala Jonezzz');
  16. INSERT INTO @SampleData (Name) VALUES (N'Mr ULala Jonah'); -- FULLWIDTH LATIN CAPITAL LETTER U ( http://unicode.org/cldr/utility/character.jsp?a=FF35 )
  17. INSERT INTO @SampleData (Name) VALUES (N'Mr ÖLala Jonah'); -- U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS
  18. INSERT INTO @SampleData (Name) VALUES (N'Mr ѮLala Jonah'); -- U+046E CYRILLIC CAPITAL LETTER KSI
  19. INSERT INTO @SampleData (Name) VALUES (N'Mr ⓊLala Jonah'); -- U+24CA CIRCLED LATIN CAPITAL LETTER U  (noteable exception since Unicode category is Other_Symbol instead of Uppercase_Letter) ( http://unicode.org/cldr/utility/character.jsp?a=24CA )
  20.  
  21.  
  22. -- Case-Sensitive collation pulls back everything when using range syntax: [A-Z]
  23. SELECT Name AS [Case-Sensitive_Range]
  24. FROM @SampleData
  25. WHERE Name COLLATE Latin1_General_100_CS_AS LIKE N'%[A-Z][A-Z]%';
  26.  
  27. -- Binary collation filters too much:
  28. --   Ü isn't in A-Z
  29. --   U + NCHAR(0x0308) puts NCHAR(0x0308) between the "U" and any upper-case letter that follows.
  30. SELECT Name AS [Binary_Range]
  31. FROM @SampleData
  32. WHERE Name COLLATE Latin1_General_100_BIN2 LIKE N'%[A-Z][A-Z]%';
  33.  
  34. -- Case-Sensitive collation pulls back just the upper-case when specifying each letter,
  35. -- but can't see variations with accents or letters not in the US-English alphabet.
  36. -- However, it does find the wide character "U" that a Binary collation would exclude.
  37. SELECT Name AS [Case-Sensitive_Individual]
  38. FROM @SampleData
  39. WHERE Name COLLATE Latin1_General_100_CS_AS LIKE N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]%';
  40.  
  41. -- Case-Sensitive collation with additional non-US-English letters pulls back more than before:
  42. -- it finds the decomposed character "U + NCHAR(0x0308)" that a Binary collation would exclude.
  43. -- However, it is not always intuitively obvious since similar looking characters that are different
  44. -- are treated as being different: the "Ӧ" in the LIKE is U+04E6 CYRILLIC CAPITAL LETTER O WITH DIAERESIS
  45. -- whereas the "Ö" in the test name is U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS.
  46. -- It also does not find U + NCHAR(0x0308) + NCHAR(0x0309) + NCHAR(0x0308) since that doesn't match any
  47. -- precomposed (or composite) character that can be added to the LIKE definition.
  48. SELECT Name AS [Case-Sensitive_Individual_AdditionalLetters]
  49. FROM @SampleData
  50. WHERE Name COLLATE Latin1_General_100_CS_AS LIKE N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZÜӦ][ABCDEFGHIJKLMNOPQRSTUVWXYZÜӦ]%';
  51.  
  52. -- Case-Sensitive but Accent-INsensitive collation, even without the additional letters, pulls
  53. -- back even more, including both "CYRILLIC CAPITAL LETTER O WITH DIAERESIS" and
  54. -- "LATIN CAPITAL LETTER O WITH DIAERESIS", and now also U + NCHAR(0x0308) + NCHAR(0x0309) + NCHAR(0x0308),
  55. -- since only the base letter matters now that accent marks are being ignored.
  56. -- However, it cannot find the "Ѯ", which is U+046E CYRILLIC CAPITAL LETTER KSI, since it
  57. -- is not in the LIKE definition.
  58. SELECT Name AS [Case-Sensitive_Accent-INsensitive_Individual_NoAdditionalLetters]
  59. FROM @SampleData
  60. WHERE Name COLLATE Latin1_General_100_CS_AI LIKE N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]%'; -- U+04E6 CYRILLIC CAPITAL LETTER O WITH DIAERESIS
  61.  
  62.  
  63.  
  64. --==================================================================================
  65. --
  66. -- The following two examples make use of a SQLCLR Scalar Function available in the
  67. -- Free version of SQL# ( http://www.SQLsharp.com )
  68. --
  69. --=====================================================
  70.  
  71. -- Regular Expressions to the rescue:
  72. SELECT Name AS [RegEx_AllButCircledLetters]
  73. FROM @SampleData
  74. WHERE SQL#.RegEx_IsMatch4k(Name, N'(?:\p{Lu}\p{M}*){2}', 1, '') = 1
  75.  
  76.  
  77. -- There are some that are not matched by \p{Lu}, such as: Ⓐ - Ⓩ  (U+24B6 - U+24CF).
  78. -- Not a problem, we will just add in an "or" condition (i.e. "|") and specify that range.
  79. -- Diacritical marks cannot be added to the "circled letters" so the "or" is between the
  80. -- range and the "upper-case plus zero-or-more diacritical marks".
  81. SELECT Name AS [OneRegExToMatchThemAll]
  82. FROM @SampleData
  83. WHERE SQL#.RegEx_IsMatch4k(Name, N'(?:[\u24B6-\u24CF]|\p{Lu}\p{M}*){2}', 1, '') = 1
RAW Paste Data