citizen428.blog()

Try to learn something about everything

Quick and Dirty Simhash in Ruby

Today at work we ended up talking about simhashing (a hash function which generates similar hashes for similar inputs) and I found this nice article with a step by step explanation of the algorithm, so I wrote a quick Ruby version (needs 1.9):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<span class='line'><span class="k">class</span> <span class="nc">String</span>
</span><span class='line'>
</span><span class='line'>  <span class="k">def</span> <span class="nf">shingles</span>
</span><span class='line'>    <span class="nb">self</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sr">//</span><span class="p">)</span><span class="o">.</span><span class="n">each_cons</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&amp;</span><span class="ss">:join</span><span class="p">)</span><span class="o">.</span><span class="n">uniq</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>
</span><span class='line'>  <span class="k">def</span> <span class="nf">simhash</span>
</span><span class='line'>    <span class="n">v</span> <span class="o">=</span> <span class="nb">Array</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span><span class='line'>    <span class="n">hashes</span> <span class="o">=</span> <span class="n">shingles</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&amp;</span><span class="ss">:hash</span><span class="p">)</span>
</span><span class='line'>    <span class="n">hashes</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="nb">hash</span><span class="o">|</span>
</span><span class='line'>      <span class="nb">hash</span><span class="o">.</span><span class="n">to_s</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">chars</span><span class="o">.</span><span class="n">each_with_index</span> <span class="k">do</span> <span class="o">|</span><span class="n">bit</span><span class="p">,</span> <span class="n">i</span><span class="o">|</span>
</span><span class='line'>        <span class="n">bit</span><span class="o">.</span><span class="n">to_i</span> <span class="o">&amp;</span> <span class="mi">1</span> <span class="o">==</span> <span class="mi">1</span> <span class="o">?</span> <span class="n">v</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="p">:</span> <span class="n">v</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">-=</span> <span class="mi">1</span>
</span><span class='line'>      <span class="k">end</span>
</span><span class='line'>    <span class="k">end</span>
</span><span class='line'>    <span class="n">v</span><span class="o">.</span><span class="n">map</span><span class="p">{</span> <span class="o">|</span><span class="n">i</span><span class="o">|</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span> <span class="o">?</span> <span class="mi">1</span> <span class="p">:</span> <span class="mi">0</span> <span class="p">}</span><span class="o">.</span><span class="n">join</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>
</span><span class='line'>  <span class="k">def</span> <span class="nf">hamming_distance</span><span class="p">(</span><span class="n">other</span><span class="p">)</span>
</span><span class='line'>    <span class="n">other_sh</span> <span class="o">=</span> <span class="n">other</span><span class="o">.</span><span class="n">simhash</span>
</span><span class='line'>    <span class="nb">self</span><span class="o">.</span><span class="n">simhash</span><span class="o">.</span><span class="n">chars</span><span class="o">.</span><span class="n">each_with_index</span><span class="o">.</span><span class="n">inject</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">total</span><span class="p">,</span> <span class="p">(</span><span class="n">bit</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span><span class="o">|</span>
</span><span class='line'>      <span class="n">total</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">bit</span> <span class="o">!=</span> <span class="n">other_sh</span><span class="o">[</span><span class="n">i</span><span class="o">]</span>
</span><span class='line'>      <span class="n">total</span>
</span><span class='line'>    <span class="k">end</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>
</span><span class='line'><span class="k">end</span>
</span>

Getting the “features” of a string:

1
2
>> "This is a test string".shingles
#=> ["Th", "hi", "is", "s ", " i", " a", "a ", " t", "te", "es", ...]

Simhashing a string:

1
2
>> "This is a test string".simhash
#=> "0100110001100001001100010000000010001001001000110110000010101100"

Calculating the Hamming distance between the symhashes of two strings (higher numbers mean less similar, with 64 being the highest possible score):

1
2
"This is a test string".hamming_distance("This is another test string")
#=> 8

Comments

Copyright © 2016 - Michael Kohl - Powered by Octopress