Today at work we ended up talking about simhashing (a hash function which generates similar hashes for similar inputs) and I found this nice article with a step by step explanation of the algorithm, so I wrote a quick Ruby version (needs 1.9):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<span class='line'><span class="k">class</span> <span class="nc">String</span>
</span><span class='line'>
</span><span class='line'> <span class="k">def</span> <span class="nf">shingles</span>
</span><span class='line'> <span class="nb">self</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sr">//</span><span class="p">)</span><span class="o">.</span><span class="n">each_cons</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&</span><span class="ss">:join</span><span class="p">)</span><span class="o">.</span><span class="n">uniq</span>
</span><span class='line'> <span class="k">end</span>
</span><span class='line'>
</span><span class='line'> <span class="k">def</span> <span class="nf">simhash</span>
</span><span class='line'> <span class="n">v</span> <span class="o">=</span> <span class="nb">Array</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span><span class='line'> <span class="n">hashes</span> <span class="o">=</span> <span class="n">shingles</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&</span><span class="ss">:hash</span><span class="p">)</span>
</span><span class='line'> <span class="n">hashes</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="nb">hash</span><span class="o">|</span>
</span><span class='line'> <span class="nb">hash</span><span class="o">.</span><span class="n">to_s</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">chars</span><span class="o">.</span><span class="n">each_with_index</span> <span class="k">do</span> <span class="o">|</span><span class="n">bit</span><span class="p">,</span> <span class="n">i</span><span class="o">|</span>
</span><span class='line'> <span class="n">bit</span><span class="o">.</span><span class="n">to_i</span> <span class="o">&</span> <span class="mi">1</span> <span class="o">==</span> <span class="mi">1</span> <span class="o">?</span> <span class="n">v</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="p">:</span> <span class="n">v</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">-=</span> <span class="mi">1</span>
</span><span class='line'> <span class="k">end</span>
</span><span class='line'> <span class="k">end</span>
</span><span class='line'> <span class="n">v</span><span class="o">.</span><span class="n">map</span><span class="p">{</span> <span class="o">|</span><span class="n">i</span><span class="o">|</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">0</span> <span class="o">?</span> <span class="mi">1</span> <span class="p">:</span> <span class="mi">0</span> <span class="p">}</span><span class="o">.</span><span class="n">join</span>
</span><span class='line'> <span class="k">end</span>
</span><span class='line'>
</span><span class='line'> <span class="k">def</span> <span class="nf">hamming_distance</span><span class="p">(</span><span class="n">other</span><span class="p">)</span>
</span><span class='line'> <span class="n">other_sh</span> <span class="o">=</span> <span class="n">other</span><span class="o">.</span><span class="n">simhash</span>
</span><span class='line'> <span class="nb">self</span><span class="o">.</span><span class="n">simhash</span><span class="o">.</span><span class="n">chars</span><span class="o">.</span><span class="n">each_with_index</span><span class="o">.</span><span class="n">inject</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">total</span><span class="p">,</span> <span class="p">(</span><span class="n">bit</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span><span class="o">|</span>
</span><span class='line'> <span class="n">total</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">bit</span> <span class="o">!=</span> <span class="n">other_sh</span><span class="o">[</span><span class="n">i</span><span class="o">]</span>
</span><span class='line'> <span class="n">total</span>
</span><span class='line'> <span class="k">end</span>
</span><span class='line'> <span class="k">end</span>
</span><span class='line'>
</span><span class='line'><span class="k">end</span>
</span>
Getting the “features” of a string:
1
2
>> "This is a test string" . shingles
#=> ["Th", "hi", "is", "s ", " i", " a", "a ", " t", "te", "es", ...]
Simhashing a string:
1
2
>> "This is a test string" . simhash
#=> "0100110001100001001100010000000010001001001000110110000010101100"
Calculating the Hamming distance between the symhashes of two strings (higher numbers mean less similar, with 64 being the highest possible score):
1
2
"This is a test string" . hamming_distance ( "This is another test string" )
#=> 8