ETL

SQL Server Function: TitleCase

Posted on

T-SQL TitleCase Function: Beyond the Obvious

Periodically, you may need to convert text into mixed-case in SQL Server. I needed it because our legacy database stores all our customer names in upper case. For example, one of our customers might look like the following:

  • Legacy sample customer: “ACME CLIENT, LLC”
  • Desired customer name: “Acme Client, LLC”

Anyone can write a simplistic title-case function pretty quickly. I enhanced mine to handle the following special cases:

  • State initials ID/WA/OR/WY/UT/NV/MT are capitalized

Select dbo.TitleCase(‘This is id and it is better than mT or Wy’)

This Is ID And It Is Better Than MT OR WY

  • Leaves most 3-letter abbreviations capitalized

Select dbo.TitleCase(‘RYO CORRECTIONAL FACILITY’)

RYO Correctional Facility

  • Removes repeated spaces

Select dbo.TitleCase(‘Had  double  spaces’)

Had Double Spaces

  • Preserves special characters

Select dbo.TitleCase(‘Has special character ‘ + NChar(8631) + ‘ embedded’)

Has Special Character ↷ Embedded

  • Runs reasonably fast (but would run faster if we built a table to hold my list of 3-letter English words)

In order to preserve 3-letter abbreviations, I had to get a list of English words. When my function encounters a 3-letter word, it checks to see if exists in my list. If so, convert to mixed case; otherwise leave all upper case.

Here’s the script

-- Author:	ActualRandy
-- Create date: 1/06/2016
-- Description:	Sends a string to Title Case while preserving abbreviations, also cleans-up repeated spaces
/*
Usage:
	SELECT dbo.TitleCase('This has "quotes" in it.  AND IS MIXED upper AND LOWER CASE.') 
	Select dbo.TitleCase('THIS IS WA')
	Select dbo.TitleCase('Has special character ' + NChar(8631) + ' embedded') 
*/
ALTER FUNCTION [dbo].[TitleCase] (@InputString NVARCHAR(4000) )
 RETURNS NVARCHAR(4000)
 AS
 BEGIN
 
  DECLARE @Index         INT                  --Position within the input string
  DECLARE @Char          NCHAR(1)             --The character we are examining
  Declare @Word          NChar(3)             --3-letter word that might need to be upper case
  DECLARE @OutputString  NVARCHAR(4000) = ''  --What we return
  Declare @IsStartOfWord TinyInt = 1          --When 0, we are in the middle of a word
  Declare @StateAbbrv    NChar(2)             --2-Letter string that might be a state
  Declare @LastChar      NChar(1) = ''        --Previous character we examined

  If @InputString Is Not Null
  Begin
    Set @Index = 1
    While @Index <= Len(@InputString)
    Begin
      Set @Char = Substring(@InputString, @Index, 1)
      --If we found repeated spaces...
      If Not (@Char = ' ' And @LastChar = ' ')
      Begin
        If @Char = ''''
        Begin
          Set @OutputString = @OutputString + @char
        End
        Else If @Char in (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&', '(' )
        Begin
          Set @IsStartOfWord = 1
          Set @OutputString = @OutputString + @char
        End
        Else
        Begin
          If @IsStartOfWord = 1
          Begin
            --Are we 2 letters from the end?
            If  (@Index = Len(@InputString) - 1)
              Or
              --Are we more than 2 letters from end of InputString, but just 2 letters from end of the current word?
              (
                @Index < Len(@InputString) - 1
                And 
                SUBSTRING(@InputString, @Index + 2, 1) in (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&', '(' )
              )
            Begin
              Set @StateAbbrv = Substring(@InputString, @Index, 2)
              If @StateAbbrv in ('ID', 'OR', 'WA', 'MT', 'NV','UT', 'WY')
              Begin
                Set @OutputString = @OutputString + Upper(@StateAbbrv)
                Set @Index = @Index + 1
                Set @IsStartOfWord = 1
              End
              Else
              Begin
                --We have a 2-letter word, but it is not a state abbrv
                Set @OutputString = @OutputString + Upper(@char)
                Set @IsStartOfWord = 0
              End
            End
            --Are we looking at a 3-letter word?, i.e. are we within
            -- the string and the 4th character hence is not a letter
            Else If (@Index <= Len(@InputString) - 2)
                And
                SUBSTRING(@InputString, @Index + 3, 1) in (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&', '(' )
            Begin
              Set @Word = SUBSTRING(@InputString, @Index, 3)
              --If it is a 3-letter word in English
			  --For brevity, I shrunk-down my list of words, which is actually quite huge
              If @Word In ('aah', 'ace','act','add','adj','ado','zoo')
              Begin
                --Set it to normal casing because it is an english word: 
                --uppercase on letter 1 and lowercase for the last 2:
                Set @OutputString = @OutputString + Upper(@Char) + Lower(Substring(@InputString, @Index + 1, 2))
                Set @IsStartOfWord = 0
                Set @Index = @Index + 2
              End
              Else
              --Probably some initials or abbreviation; i.e. a 3-letter string, but not an English word
              Begin
                Set @OutputString = @OutputString + Upper(@Word)
                Set @Index = @Index + 2
              End
            End
            Else
            Begin
              --We are at the beginning of a word, and it is not state abbrv or 3-letter word
              Set @OutputString = @OutputString + Upper(@char)
              Set @IsStartOfWord = 0
            End
          End
          Else
          --Not beginning of a word, so send it to lower case
          Begin
            Set @OutputString = @OutputString + Lower(@Char)
          End
        End
      End
      Set @LastChar = @Char
      Set @Index = @Index + 1
    End
  End

  RETURN @OutputString
END 

Comments

The method makes a single pass through your string, so it is reasonably efficient in that regard. Of course, the slowest part of the function is determining whether a 3-letter word is an abbreviation or an English word. If speed is crucial, you can save all the words in a table and put an index on it. As I said above, it runs reasonably fast. When I tested it, it converted roughly 20,000 names in one second.

Usage

Because it is a function, you can use it in any query you use for moving data around. It works great when you have data from older sources that is all upper case. Naturally, my function is not perfect. One enhancement you could make would be to insert handling for 4-letter words, since I have seen some company names with 4-letter abbreviations.

Download the code here