Month: September 2017

Finding An Anagram with Linq and Regex

Posted on Updated on

Once again, thanks to Will Shortz and NPR for supplying a challenging puzzle so we can play with some fun .NET features!

Solution

This time, the puzzle was easy; the challenge was getting a list of song titles. I experienced the following issues getting a song list:

  • Wikipedia (normally the primo place for lists) didn’t have one.
  • The first decent alternate site
    • Spread its list over 85 pages
    • Used an ‘anti-robot’ policy that blocked my IP if I tried to access too many pages too fast. Argh!
  • Fortunately, I was able to find GoldStandardSongList.com, a nice site with 5,000 titles all on one page☺

Code Techniques Used

The Secret Trick to Solving Anagrams

SlowThe first time you try to solve an anagram, you might try some inefficient method, like building every letter-combination of your candidate. That would be super-slow, because there are a lot of combinations for a 15 letter song title. Listing all combinations is an O(n!) algorithm. (Check-out Wikipedia Big-O Notation if you never  took a class in algorithm analysis.)

­……

Faster Method: Sort Your Candidate’s Letters

FastHere’s a better way: sort the letters for the target and candidate, then compare the result. For example:

  • ‘COOL HIT FARE IN LA’ → ‘aacefhiillnoort’
  • ‘HOTEL CALIFORNIA’    → ‘aacefhiillnoort’

Obviously, because the the sorted strings are equal, the two original strings must be anagrams. This is much faster, because you can sort a string in O(n∙ ln) time.

 

Compare n-Factorial Vs. n ∙Ln(n)

For comparison,

  • 15!            = 1,307,674,368,000 – n factorial
  • 15∙ln(15) = 17                             – n ln n

Hint: if you don’t like staring at an hourglass, use my algorithm! Would you rather run 17 operations, or 1.3 billion?

TriumphantFaceFor fun, challenge your bone-headed co-worker to a race and laugh while he pounds his screen in frustration!

 

Use Linq to Clean-Up your Song Title and Sort Letters: Easy!

Here’s the method I wrote that takes a song title as input, cleans it up, and sorts the letters:

private string CleanAndSortLetters(string input) {
  string result = input.Where(c => char.IsLetter(c))
                        .Select(c => char.ToLower(c))
                        .OrderBy(c => c)
                        .Aggregate("", (p, c) => p + c);
  return result;
}

Highlights

  • char.IsLetter  – Should be obvious! It removes spaces, quote marks, parenthesis, etc.
  • OrderBy – Sorts the letters in the string. Note that Linq will treat your string as an IEnumerable. Basically a list of char’s.
  • Aggregate – combines the char list back into a string

Use Regular Expressions to Extract Song Titles from a Web Site

The challenge is to extract just the song title from all the HTML markup.

SongListDisplay
Here’s how the HTML list looks in your browser. To the novice, it looks easy to grab just column-two from the formatted table.

Here’s the Some of the Same Table in HTML Markup

HtmlMarkup
You can barely see the ‘payload’ (Hotel California) for all the markup. Grabbing just that text is not as easy as it looks!

Regular Expressions Find Your Pattern!

LookForAPatternRegular expressions are a powerful means to match patterns, but kinda hard to learn. They’re worth the effort for a task like this, because doing it differently would take forever and be just as complicated.

Here’s my Regular Expression in action – it looks crazy, but I’ll explain it below:

//Build a pattern to match the song titles:
string pattern = "class=\"table1column2\"[^>]+><p>" +
				".*?<span[^>]+>([^<]+)</span>";
//Use the pattern to build a regular expression
Regex re = new Regex(pattern, 
        RegexOptions.Compiled | RegexOptions.IgnoreCase 
		| RegexOptions.Singleline);

//Find all the matches in the string 'pageText':
Match m = re.Match(pageText);

Quick note: the RegexOptions “Compiled” and “IgnoreCase” should be obvious. But “SingleLine” is counter-intuitive: you should use it when you try to match text containing linebreaks.

Also, I have some other code to loop through the matches, very easy but shown below, and also to extract the capture group and use it, again shown below.

Explanation of the Regex Pattern

The Regex packs a lot of meaning into a small space. Here’s a detailed explanation of every section. As you can see, it takes longer to explain it than it does to just run it. Skip it unless you have a super attention span!

Regex Fragment Explanation Match Comment
class=”table1column2″ Match the exact text <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California</span>
[^>]+ Any character that is not >, one or more repetitions. <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California</span>
The so-called “negated character class” is a great way to skip over text until it
hits a special character you know is always present, like the >
> Literal match <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California</span>
.*? Any character, as few as possible (“lazy star”) <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span
style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California</span>
I would prefer to use a negated character class, but there is a line-break in the text
I wish to skip over. Lazy star grabs matches until we hit the expression below.
<span Literal match <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California</span>
This terminates the lazy star match above
[^>]+ Any character that is not >, one or more repetitions <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California</span>
Another negated character class – super helpful!
> Another literal match <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California</span>
([^<]+) Finally! My Capture Group; grab everything until we hit a < <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California</span>
Basically another negated character class, wrapped in parenthesis
</span> Literal match <td class=”table1column2″ width=”30%” style=”padding-top: 0in; padding-bottom: 0in” valign=”top”><p>
<span style=”font-weight: bold; font-family:Arial, sans-serif”>Hotel California<span>

Here’s the Code to Solve the Puzzle

As you can see, the actual code is quite short. Again, explaining it is harder than just doing it!

private const string SONG_URL = 
  @"https://www.goldstandardsonglist.com/Pages_Sort_2a/Sort_2a.htm";
  
private void btnSolve_Click(object sender, RoutedEventArgs e) {
  string anagramThis = "COOL HIT FARE IN LA";
  Mouse.OverrideCursor = Cursors.Wait;

  string matchThis = CleanAndSortLetters(anagramThis);

  string pageText;
  //Open the song URL in a web client and read all the text
  using (WebClient wc = new WebClient()) {
    using (StreamReader sr = new StreamReader(wc.OpenRead(SONG_URL))) {
      pageText = sr.ReadToEnd();
    }
  }

  string pattern = "class=\"table1column2\"[^>]+" +
           "><p>.*?<span[^>]+>([^<]+)</span>";
  Regex re = new Regex(pattern, 
      RegexOptions.Compiled | RegexOptions.IgnoreCase);

  Match m = re.Match(pageText);
  //Examine every match
  while (m.Success) {
    //Grab the capture group and remove line breaks etc.
    string songName = m.Groups[1].Value
      .Replace("\r", "")
      .Replace("\n", " ");
    string candidate = CleanAndSortLetters(songName);
    //The key to solving the puzzle:
    if (candidate == matchThis)
      txtAnswwer.Text += songName + "\n";

    m = m.NextMatch();
  }
  Mouse.OverrideCursor = null;
}

Code Comments

I left-out the method ‘CleanAndSortLetters‘ because I already listed it above.

If you download my code and run it, you can set a breakpoint and see that it takes several seconds to download the song list. But, after that slow operation, the rest of the code runs in a few milliseconds!

Summary

SummarySolving the puzzle is easy with Linq; you just use the Linq operators to sort the letters, remove spaces etc, and convert to lower case. After that, you compare it to the target; if they are the same, you’ve found a winner!

 

 

The hard part is getting your list of song titles. Get the raw HTML with WebClient.Open, then use Regular Expressions to extract patterns from your HTML markup. Most of my Regex pattern consists of literal matches, or else negated character classes. As you remember, negated character classes allow you to match until you hit a designated character, such as < or >.

Download the Code

You can download my code here from my Dropbox account; note that they will force you to create a free account before proceeding with the download.