Experiments with elimentary Data Mining using PowerShell – Part 2


In this blog post, we will pick up where we left of in the first one, fixing stuff that did not work. The first post was basically a failed attempt but, it covers some interesting topics and links to the original article that inspired this post. Kindly read it if you have the time.

What flummoxed me in the first post was the fact was that the Euclidean distances between articles which are related to each other was greater than the distances between ones that were not related to each other. So, I visited the original article Alexandru Nedelcu(Twitter|Blog) on data mining and started reading it in more detail. If you look at Euclidean distance closely, it really is a measure of dissimilarity i.e. the greater the distance the more dissimilar the two things are. Also, since I did not give Ruby Array operations much thought; I implemented what I thought was right.

The curious thing about Ruby Arrays is a method called Zip that converts the arguments passed to it into an array and merges the caller with elements from each argument. What this means is that if you have two arrays ‘arrOne’ and ‘arrTwo’ which are defined as follows:

arrOne = [4, 5, 6]

arrTwo =[7, 8, 9]

Then, an operation like [1,2,3].zip(a, b) would result in a output similar to this:

[[1, 4, 7], [2, 5, 8], [3, 6, 9]]

First thing to do was to mimic the behavior in PowerShell. For our purposes that would suffice. One other reason for doing this is that it makes the code very small (you will see why in a minute).

Function Get-RbyZip($a,$b){
 [int]$i=0
 $lenA = $a.length
 $lenB = $b.length
 $aryRes = @()
 try{
  for($i=0;$i -lt $lenA; $i++)
  {
   if($i -ge $lenB){
    $aryRes += , @($a[$i],$null)
   }else{
    $aryRes += , @($a[$i],$b[$i])
   }
  }
  return $aryRes
 }
 catch{
  Write-Host "Error occured in [Get-RbyZip]" -Fore Red
 }
}

We need to tweak our Get-Point function as well since that is the heartline of our code and needed some fixing of stupidity (me, I was to blame).

Function Get-Point($tagSet,$tagSpace)
{
 try{
  $pntInSpace = $tagSpace|%{if($tagSet -contains $_){1}else{0}}
  return $pntInSpace
 }catch{
  Write-Host "Error occured in [Get-Point]" -Fore Red
 }
}

So, now our functions for getting the distances come out and they look small than the last time you saw them because of that one call to Get-RbyZip and since have a matrix to work with the calculations look a lot easier to follow and understand

Function Get-EuclideanDistance($pntA, $pntB)
{
 try{
  $zipC = Get-RbyZip $pntA $pntB
  for($i=0; $i -lt $zipC.length; $i++){
   $sumSq += [Math]::Pow(($zipC[$i][0] - $zipC[$i][1]),2)
  }
  
  $euclidDist = [Math]::Sqrt($sumSq)
  return $euclidDist
 }
 catch{
  Write-Host "Error occured in [Get-EuclideanDistance]" -Fore Red
 }
}

Function Get-CosineSimilarity($pntA, $pntB)
{
 try{
  $zipC = Get-RbyZip $pntA $pntB
  for($i=0;$i -lt $zipC.length; $i++){
   $sumSqA += [Math]::Pow($zipC[$i][0],2)
   $sumSqB += [Math]::Pow($zipC[$i][1],2)
   $dotPrd_AB += (($zipC[$i][0])*($zipC[$i][1]))
  }
  
  $magA = [Math]::Sqrt($sumSqA)
  $magB = [Math]::Sqrt($sumSqB)
  
  if(($magA -gt 0) -and ($magB -gt 0)){
   $cosSimilarity = $dotPrd_AB/($magA*$magB)
  }else{
   $cosSimilarity = 0
  }
  
  return $cosSimilarity
 }catch{
  Write-Host "Error occured in [Get-CosineSimilarity]" -Fore Red
 }
}

Now here is the biggest change and probably the one that fixed most of the issues was the way were generating reference points for our comparisons. Instead considering the whole dataset we limited ourselves to only the data from one subset belonging to a particular item and tried to fix its point in space relative to the different set that was not connected to this one in anyway. Identifying this problem was eureka moment of my day. So, once this was fixed the results started looking much better.

Function Measure-Similarity($hshArticle, $byTheseTags){
 
 $articleTags = $hshArticle.Keys|%{$hshArticle[$_].Tags}
 $tagSpace = ($byTheseTags + $articleTags)|Sort|Select -Unique
 
 $thisPoint = Get-Point $byTheseTags $tagSpace
 
 $articleToPoint = @{}
 
 foreach($key in $hshArticle.Keys){
  $keyTags = $hshArticle[$key].Tags
  $articleToPoint += @{$key = (Get-Point $keyTags $tagSpace)}
 }
 
 $euclidDist = @{}
 $cosSimilarity = @{}
 foreach($article in ($articleToPoint.Keys|Sort)){
  $euclidDist += @{$article = (Get-EuclideanDistance $thisPoint $articleToPoint[$article])}
  $cosSimilarity += @{$article = (Get-CosineSimilarity $thisPoint $articleToPoint[$article])}
 }
 
 return $euclidDist, $cosSimilarity
}

If you are looking to run some quick tests using the new code, here is some data that I was using.

$hshArticle = @{}
$hshArticle["Working with Perfmon CSV logs in Powershell - Part 3"] = @{Tags = @("Fun-Stuff")}
$hshArticle["Working with Perfmon CSV logs in Powershell - Part 2"] = @{Tags = @("PowerShell","csv","csv file","csv format","DBD","perfmon","perl")}
$hshArticle["Working with Perfmon CSV logs in Powershell - Part 1"] = @{Tags = @("PowerShell","csv","csv file","measure","perfmon","Powershell","summarize")}
$hshArticle["Reading a tweeters timeline using PowerShell-Take2"] = @{Tags = @("PowerShell","Powershell","Twitter API","XML","proof of concept","twitter","formatting")}
$hshArticle["Reading a tweeters timeline using PowerShell"]= @{Tags = @("PowerShell","Powershell","Twitter API","XML","proof of concept","twitter","initial release")}

$matchTags = "PowerShell", "csv","perfmon"

$ed, $cs = Measure-Similarity $hshArticle $matchTags

Write-Host "For Euclidean Distance, lesser the more similar two things are" -Fore Green
Write-Host "For Cosine Similarity, the closer to 1 the better, as Cosine for zero is 1" -Fore Green
foreach($article in ($ed.Keys|Sort)){
 Write-Host " Article:" $article -Fore Green
 Write-Host "  Euclidean Distance:" $ed[$article] -Fore Yellow
 Write-Host "  Cosine Similarity: " $cs[$article] -Fore Yellow
}

And here is how the output looks with this test data:

Datamining

Fixed it hopefully

That is all for today folks!

[Coming Soon: Learning from Turing|How to prove you really are human? using fun Powershell Quizzes]

Advertisements
About

By profession, I’m a SQL Server Database Administrator. I love to poke my nose into different corners and see how stuff looks in there. I keep looking for new things to do as my mind refuses to settle on one topic.

Tagged with: , , , , , , ,
Posted in Fun-Stuff, PowerShell

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: