Few days ago, I was reading an interesting article by Alexandru Nedelcu(Twitter|Blog) on data mining. The article is titled “Data Mining: Finding Similar Items and Users”, and I found it very interesting as I had no clue about how data mining is done. My advice, read the blog by Alex as it is very good and fun to see the concepts instantly in action. One tiny problem is that the examples are in Ruby and to be honest Ruby, is lost on me. But, I think I managed to get the gist of the code. Also, Wikipedia is always such a darling when you are looking for specifics and it came through; providing some more information that was helpful to me. Along the way, some interesting things were learnt. For example, I learnt that clustering similar things into groups by similarity/dissimilarity (or distance) from each other is a big field that encompasses data mining; and there are certain clever ways to do this grouping based on a set of rules.

So, armed with the article by Alex and some Wikipedia searches I tried to do some data mining on my blog; more out of interest than anything else. I tried to find out if it was easy to find posts similar to one another. If I was successful in doing this then, extending this to find similar posts on various blogs I follow would be the next logical step.

As mentioned in the first paragraph, grouping things by similarity/dissimilarity or distance from one another is our main aim. The most common of these distance calculating methods is the Euclidean distance; and is defined as the geometric distance in the multidimensional space. It is computed as:

There is a slight problem with the Euclidean distance; which is that we need to represent the data as points in a multi-dimensional space. We will have to first convert our data into points in space. One easy way to do this as pointed out by Alex, was to represent a match in the collection by 1 and a miss by 0. Ok, seems easy so let’s do it:

Function Get-Point($tagValue,$tagsToMatch) { $pntInSpace = $tagsToMatch|%{if($_ -contains $tagValue){1}else{0}} $pntInSpace }

Now that we have a way of converting our data into points the next thing to do would be to calculate the Euclidean distance since we already know the formula from above 🙂

Function Get-EuclideanDistance($a,$b) { try { if($a.length -eq $b.length){ $lenArr = $a.Length $sumSqB = ($b|%{[Math]::Pow($_,2)}|Measure-Object -Sum|Select Sum).Sum $sumSqA = ($a|%{[Math]::Pow($_,2)}|Measure-Object -Sum|Select Sum).Sum $sumAB = 0 for($i=0;$i -lt $lenArr;$i++){ $sumAB += ($a[$i]*$b[$i]) } $sumTwoAB = 2*$sumAB $euclideanDist = [Math]::sqrt(($sumSqA+$sumSqB-$sumTwoAB)) return $euclideanDist }else{ throw $("The points have different number of dimensions") } } catch { Write-Host $_ -Fore Red|FL * } }

A slight variation of the Euclidean distance called the Squared Euclidean distance exists which puts more weight on points farther away from one another by removing the square root from the picture:

Function Get-SquaredEuclideanDistance($a,$b) { try { if($a.length -eq $b.length) { $lenArr = $a.Length $disAtoB = 0 for($i=0;$i -lt $lenArr;$i++) { $disAtoB += [Math]::pow(($b[$i]-$a[$i]),2) } $sqeuclideanDist = [Math]::sqrt($disAtoB) return $sqeuclideanDist }else{ throw $("The points have different number of dimensions") } } catch { Write-Host $_ -Fore Red|FL * } }

The above two methods will give dramatically varying results if the scale of measurement is changed. This is disastrous if you are grouping objects by fixed weights. So, this maybe a problem and if so, is there a better way to do this? Yes, there is a better way to do the distance calculations and it is introduced in the next paragraph.

The third method called the Cosine similarity, measures the similarity between two vectors by measuring the cosine of the angle between them. This method is used for text mining of documents. The diagrammatic representation and formulation are given below.

Now, that we know what Cosine similarity is and the formula to calculate it; let us look at the code 🙂

Function Get-CosineSimilarity($a, $b) { try { if($a.length -eq $b.length) { $lenArr = $a.Length $sumSqB = ($b|%{[Math]::Pow($_,2)}|Measure-Object -Sum|Select Sum).Sum $sumSqA = ($a|%{[Math]::Pow($_,2)}|Measure-Object -Sum|Select Sum).Sum if(($sumSqB -le 0) -or ($sumSqA -le 0)) { $cosineSimilarity = 0 }elseif(($sumSqA -gt 0) -and ($sumSqB -gt 0)) { $sumAB = 0 for($i=0;$i -lt $lenArr;$i++){ $sumAB += ($a[$i]*$b[$i]) } $cosineSimilarity = $sumAB/([Math]::sqrt($sumSqA)*[Math]::sqrt($sumSqB)) }else{ $cosineSimilarity = 0 } return $cosineSimilarity }else{ throw $("The points have different number of dimensions") } } catch { Write-Host $_ -Fore Red|FL * } }

Now that we have written the functions that we need, our next point of interest is getting the relevant data to work on. Which in our case would be the rss feed. Interacting with the feed is easy since it is in XML format and we can quickly build a hash out of the feed:

$objWebClient = New-Object System.Net.WebClient $hshArticle = @{} $feedUrl = 'https://sqlchow.wordpress.com/feed' $feedContent = [xml]$objWebClient.DownloadString($feedUrl) foreach($item in $feedContent.rss.channel.item) { $articleTitle = $item.title $articleCategory = $item.category|%{$_} $hshArticle["$articleTitle"] = @{Tags= $articleCategory} } #quickly check, if we have all the post titles $hshArticle.Keys|%{$_} #quickly check, if we have all the tags from each post $hshArticle.Keys|%{$hshArticle[$_].Tags}

Finally, putting it all together

$matchTags = "PowerShell", "csv","perfmon" Foreach($key in $hshArticle.Keys) { Write-Host $key -Fore Green $pntA = $hshArticle[$key].Tags| %{Get-Point $_ $matchTags} $dimInPntA = $pntA.Length Write-Host $pntA -ForegroundColor Yellow $pntB = 1..$dimInPntA| %{1} #$pntB = $matchTags|%{Get-Point $_ $hshArticle[$key].Tags} Write-Host $pntB -ForegroundColor Yellow $ed = Get-EuclideanDistance $pntA $pntB $cs = Get-CosineSimilarity $pntA $pntB Write-Host "Euclidean distance: $ed" -Fore Cyan Write-Host "Cosine Similarity: $cs" -Fore Cyan }

And the output is as below:

Although, the output looks in line with the principals showed above, I believe the output is wrong. Still, trying to figure out why though. Also, it is 4:20 AM and maybe I will get to this topic tomorrow.

If you find anything that can help me, kindly let me know.

[…] Wilson Interview — Microsoft PowerShell. I also ran across an example of someone who attempted to perform data mining entirely within PowerShell, without accessing any tool like SSDM. That’s impressive in the same sense that walking on stilts […]