# Experiments with elimentary Data Mining using PowerShell – Part 2

In this blog post, we will pick up where we left of in the first one, fixing stuff that did not work. The first post was basically a failed attempt but, it covers some interesting topics and links to the original article that inspired this post. Kindly read it if you have the time.

What flummoxed me in the first post was the fact was that the Euclidean distances between articles which are related to each other was greater than the distances between ones that were not related to each other. So, I visited the original article Alexandru Nedelcu(Twitter|Blog) on data mining and started reading it in more detail. If you look at Euclidean distance closely, it really is a measure of dissimilarity i.e. the greater the distance the more dissimilar the two things are. Also, since I did not give Ruby Array operations much thought; I implemented what I thought was right.

The curious thing about Ruby Arrays is a method called Zip that converts the arguments passed to it into an array and merges the caller with elements from each argument. What this means is that if you have two arrays ‘arrOne’ and ‘arrTwo’ which are defined as follows:

arrOne = [4, 5, 6]

arrTwo =[7, 8, 9]

Then, an operation like [1,2,3].zip(a, b) would result in a output similar to this:

[[1, 4, 7], [2, 5, 8], [3, 6, 9]]

First thing to do was to mimic the behavior in PowerShell. For our purposes that would suffice. One other reason for doing this is that it makes the code very small (you will see why in a minute).

```Function Get-RbyZip(\$a,\$b){
[int]\$i=0
\$lenA = \$a.length
\$lenB = \$b.length
\$aryRes = @()
try{
for(\$i=0;\$i -lt \$lenA; \$i++)
{
if(\$i -ge \$lenB){
\$aryRes += , @(\$a[\$i],\$null)
}else{
\$aryRes += , @(\$a[\$i],\$b[\$i])
}
}
return \$aryRes
}
catch{
Write-Host "Error occured in [Get-RbyZip]" -Fore Red
}
}
```

We need to tweak our Get-Point function as well since that is the heartline of our code and needed some fixing of stupidity (me, I was to blame).

```Function Get-Point(\$tagSet,\$tagSpace)
{
try{
\$pntInSpace = \$tagSpace|%{if(\$tagSet -contains \$_){1}else{0}}
return \$pntInSpace
}catch{
Write-Host "Error occured in [Get-Point]" -Fore Red
}
}
```

So, now our functions for getting the distances come out and they look small than the last time you saw them because of that one call to Get-RbyZip and since have a matrix to work with the calculations look a lot easier to follow and understand

```Function Get-EuclideanDistance(\$pntA, \$pntB)
{
try{
\$zipC = Get-RbyZip \$pntA \$pntB
for(\$i=0; \$i -lt \$zipC.length; \$i++){
\$sumSq += [Math]::Pow((\$zipC[\$i] - \$zipC[\$i]),2)
}

\$euclidDist = [Math]::Sqrt(\$sumSq)
return \$euclidDist
}
catch{
Write-Host "Error occured in [Get-EuclideanDistance]" -Fore Red
}
}

Function Get-CosineSimilarity(\$pntA, \$pntB)
{
try{
\$zipC = Get-RbyZip \$pntA \$pntB
for(\$i=0;\$i -lt \$zipC.length; \$i++){
\$sumSqA += [Math]::Pow(\$zipC[\$i],2)
\$sumSqB += [Math]::Pow(\$zipC[\$i],2)
\$dotPrd_AB += ((\$zipC[\$i])*(\$zipC[\$i]))
}

\$magA = [Math]::Sqrt(\$sumSqA)
\$magB = [Math]::Sqrt(\$sumSqB)

if((\$magA -gt 0) -and (\$magB -gt 0)){
\$cosSimilarity = \$dotPrd_AB/(\$magA*\$magB)
}else{
\$cosSimilarity = 0
}

return \$cosSimilarity
}catch{
Write-Host "Error occured in [Get-CosineSimilarity]" -Fore Red
}
}
```

Now here is the biggest change and probably the one that fixed most of the issues was the way were generating reference points for our comparisons. Instead considering the whole dataset we limited ourselves to only the data from one subset belonging to a particular item and tried to fix its point in space relative to the different set that was not connected to this one in anyway. Identifying this problem was eureka moment of my day. So, once this was fixed the results started looking much better.

```Function Measure-Similarity(\$hshArticle, \$byTheseTags){

\$articleTags = \$hshArticle.Keys|%{\$hshArticle[\$_].Tags}
\$tagSpace = (\$byTheseTags + \$articleTags)|Sort|Select -Unique

\$thisPoint = Get-Point \$byTheseTags \$tagSpace

\$articleToPoint = @{}

foreach(\$key in \$hshArticle.Keys){
\$keyTags = \$hshArticle[\$key].Tags
\$articleToPoint += @{\$key = (Get-Point \$keyTags \$tagSpace)}
}

\$euclidDist = @{}
\$cosSimilarity = @{}
foreach(\$article in (\$articleToPoint.Keys|Sort)){
\$euclidDist += @{\$article = (Get-EuclideanDistance \$thisPoint \$articleToPoint[\$article])}
\$cosSimilarity += @{\$article = (Get-CosineSimilarity \$thisPoint \$articleToPoint[\$article])}
}

return \$euclidDist, \$cosSimilarity
}
```

If you are looking to run some quick tests using the new code, here is some data that I was using.

```\$hshArticle = @{}
\$hshArticle["Working with Perfmon CSV logs in Powershell - Part 3"] = @{Tags = @("Fun-Stuff")}
\$hshArticle["Working with Perfmon CSV logs in Powershell - Part 2"] = @{Tags = @("PowerShell","csv","csv file","csv format","DBD","perfmon","perl")}
\$hshArticle["Working with Perfmon CSV logs in Powershell - Part 1"] = @{Tags = @("PowerShell","csv","csv file","measure","perfmon","Powershell","summarize")}

\$matchTags = "PowerShell", "csv","perfmon"

\$ed, \$cs = Measure-Similarity \$hshArticle \$matchTags

Write-Host "For Euclidean Distance, lesser the more similar two things are" -Fore Green
Write-Host "For Cosine Similarity, the closer to 1 the better, as Cosine for zero is 1" -Fore Green
foreach(\$article in (\$ed.Keys|Sort)){
Write-Host " Article:" \$article -Fore Green
Write-Host "  Euclidean Distance:" \$ed[\$article] -Fore Yellow
Write-Host "  Cosine Similarity: " \$cs[\$article] -Fore Yellow
}
```

And here is how the output looks with this test data:

That is all for today folks!

[Coming Soon: Learning from Turing|How to prove you really are human? using fun Powershell Quizzes] 