Some fun with #Powershell, build quizzes O_O!!!


Tolerate my verbosity as it is not everyday that you find interesting topics to explore.

Another day, and this time it is Brent Ozar’s (Blog|Twitter) turn to inspire me. In his blog post ‘Blog Posts I couldn’t develop in 2011’, he mentioned a curious little stored procedure called sp_Captcha which he developed for a client (Ok, I will not lie, I saw this post in January and I hope Brent does not mind my dalliance with Powershell). I was immediately intrigued and wanted to try and figure out if I could replicate that in T-SQL. As I started reading more about the topic, I found myself getting fascinated with methods used to distinguish between machines and humans. Then I went on to read about how machines learn. And finally ended up at MIST (Minimum Intelligence Signal Test), developed by Chris McKinstry.

MIST is an alternative to the Turing test which measures the ‘humanness’ of the responder and traditionally has been used for training artificial intelligence systems to imitate human responses. It seems that Chris, collected a set of 80,000 propositions called MindPixels that can be answered either by ‘yes’ or ‘no’, or by 1 or 0. The basic premise is that, given a set of 20 questions if a program is able to correctly guess the results of all propositions without having prior access to them would be considered intelligent. This, is where leave that concept and move in a completely different direction.

I set out to write T-SQL code. But, since finding the GAC-80K list was a pain. I hope you made the correlation between the two 80k’s. Anyways, since I was searching for the GAC-80K for a while and finally found it here, what you will find in the link is a C# project. If you unzip the project you will find a file ‘gac80k-06-july-2005.html’. You could rename it, something.txt did not find any html code in it. But, what it does contain is the list of propositions that Chris collected plus some 20K odd questions that the C# program collected by mining Wikipedia (that is my assumption).

So, what do mindpixels look like?  Here are a few examples:

  • 0.15        How hot does water need to boil?
  • 0.44        can a dog see in color?
  • 0.67        Does WSYIWYG mean what you see is what you get?

These are all questions with a yes or no answer, with the coherence value representing the average decision.  Meaning, if the coherence is close to zero or close to one then you can say that people confidently made a consistent reply one way or the other.  The first and third questions are typical examples of fairly confident decisions.  If the coherence is close to 0.5 then that means that the decision was uncertain, with no clear vote in either direction similar to the second question.

I was very excited, when I saw this file and went bonkers on the file. But, after initial euphoria settled I thought: “Dang, this is what I need to convert to T-SQL”. Like any good DBA I wanted to first investigate if there was intelligent way to get to these questions into a format that facilitates their insertion into a database. But, once Powershell came out of the box I completely forgot about the T-SQL part.

To put simply, our problem is that we have a file in which we can identify a format. We need to bring it into a format that can be used to tokenize it and put it into a database or build a hash. My vote goes to the hash table as it is much easier for me to test out the result set on the fly.

So, if you downloaded the file and explored it a little you would find that the first 33 lines are copyright information and some html links. Since, we are not interested in reading the whole file (the file contains 118831 lines, and is about 5Megs you could read the whole thing into memory if you so please) we will restrict our selection to 5940 lines (5% of the total number of lines in the file) after skipping the first 100 lines or so.

$gac80K = Get-Content -LiteralPath 'C:\ Reading-Materials\minimind\gac80k-06-july-2005.html'|?{$_ -match '^\s+\d\.\d{2}'}|Select -Skip 100 -First 5940

Let us further reduce this set to a 1000 randomly selected questions.

$quizQuest =  $gac80K | Get-Random -Count 1000

Once you, have the random set of questions we will try to build our hash from it. But, before that we need to parse our data and tokenize it.

[regex]$pattern = '^\s+(\d.\d{2})\t'
$var = [regex]::Split($quizQuest[100], $pattern)
$var

1.00
Is a year longer than a day?

$var[1]
1.00
$var[2]
Is a year longer than a day?

Now that we can tokenize the data, the only thing left to do is build a hash with this and play around with it as  you please.

$hashOfQuestions = @{}
[regex]$pattern = '^\s+(\d\.\d{2})\t'

foreach($question in $quizQuest)
{
$var = [regex]::Split($question, $pattern
$hashOfQuestions[$var[2]] = @{Answer=$var[1]}
}

$hashQuestions.Keys|Select -First 10 | %{Write-Host $_ -> $hashQuestions[$_].Answer -Fore Green}

Here is the output, I got:The output from command line

We have only reached a point where things start to get exciting. Maybe, we can build a fun quiz for our friends to test their humanness or maybe, just maybe you can build a system that will query Google or some similar engine for information and try to tweak the coherence of the question and see where you end up.  Or maybe we will be ‘Applying Common Sense’.

Very exciting, indeed.

Advertisements
About

By profession, I’m a SQL Server Database Administrator. I love to poke my nose into different corners and see how stuff looks in there. I keep looking for new things to do as my mind refuses to settle on one topic.

Tagged with:
Posted in Fun-Stuff, PowerShell

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: