citizen428.blog()

Try to learn something about everything

XML Diffs With Bash and Awk

Slightly modified version of a post I originally wrote for our company blog.

When importing data at work, we often have to deal with XML. This generally works fine, but the format’s structured nature also means that you can’t just treat it like any old text file.

That’s something we recently had to work around when we wanted to generate a daily XML diff, which only contains elements which changed since the previous feed. Of course there are several open source tools for diff-ing XML (e.g. diffxml or xmldiff) but since we didn’t get them to do what we want in a reasonable amount of time, we just decided to roll our own.

The final solution is a 71 line bash script, which downloads a zip, extracts it, generates MD5 sums for every element and then creates a diff between this new file and the previous list of MD5 sums. Once we know which elements have changed we merge them into a new feed which then gets handed to our importer. The awesome xmlstarlet was a great help in this, as was battle-tested old awk.

Let’s look at an interesting snippet from the script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<span class='line'>xmlstarlet sel -I -t -m <span class="s2">&quot;//item&quot;</span> -v <span class="s2">&quot;./guid&quot;</span> -o <span class="s2">&quot;|&quot;</span> -c <span class="s2">&quot;.&quot;</span> -n - |
</span><span class='line'>  sed -e <span class="s1">&#39;...&#39;</span> |
</span><span class='line'>  awk <span class="se">\</span>
</span><span class='line'>    <span class="s1">&#39;BEGIN {</span>
</span><span class='line'><span class="s1">      FS=&quot;|&quot;</span>
</span><span class='line'><span class="s1">      RS=&quot;\n&quot;</span>
</span><span class='line'><span class="s1">    }</span>
</span><span class='line'><span class="s1">    {</span>
</span><span class='line'><span class="s1">      id=$1</span>
</span><span class='line'><span class="s1">      command=&quot;printf \&quot;%s\&quot; \&quot;&quot; $2 &quot;\&quot; | md5sum | cut -d\&quot; \&quot; -f1&quot;</span>
</span><span class='line'><span class="s1">      command | getline md5</span>
</span><span class='line'><span class="s1">      close(command)</span>
</span><span class='line'><span class="s1">      print id&quot;:&quot;md5</span>
</span><span class='line'><span class="s1">    }&#39;</span> &gt;&gt; <span class="nv">$MD5_DIR</span>/vendor-md5-<span class="nv">$TODAY</span>
</span>

Here we use xmlstarlet to iterate over all the items in the feed (the XPath “//item”), print the value of the “guid” element (-v “./guid”), output a pipe character (-o “|”) and then copy the current element followed by a newline (-c “.” -n) . This then gets piped through sed for some cleaning up (which I omitted here for brevity’s sake) before awk takes the part after each “|”, generates an MD5 sum and finally produces a file that looks like this:

1
2
3

1
2
3
<span class='line'>rKKTZ:4012fced7c4cd77da607d294fbb8b5b6
</span><span class='line'>hC7Jr:39245a0f9a976e6d47c0e2d76abf6238
</span><span class='line'>...</span>

Now that we are able to create a daily list of MD5 sums, it’s easy to generate the diff feed:

1
2
3
4
5
6
7
8
9
10

1
2
3
4
5
6
7
8
9
10
<span class='line'><span class="k">if</span> <span class="o">[</span> -e <span class="nv">$MD5_DIR</span>/vendor-md5-last <span class="o">]</span> ; <span class="k">then</span>
</span><span class='line'><span class="k">  </span><span class="nv">changed</span><span class="o">=</span><span class="sb">`</span>diff <span class="nv">$MD5_DIR</span>/vendor-md5-last <span class="nv">$MD5_DIR</span>/vendor-md5-<span class="nv">$TODAY</span> |
</span><span class='line'>	   grep <span class="s2">&quot;^&gt;&quot;</span> |
</span><span class='line'>           cut -d<span class="s2">&quot;:&quot;</span> -f 1 |
</span><span class='line'>           cut -b 1-2 --complement<span class="sb">`</span>
</span><span class='line'>
</span><span class='line'><span class="k">for </span>record in <span class="nv">$changed</span> ; <span class="k">do</span>
</span><span class='line'><span class="k">  </span><span class="nv">f</span><span class="o">=</span><span class="sb">`</span>fgrep -l <span class="s2">&quot;&lt;guid&gt;$record&lt;/guid&gt;&quot;</span> <span class="nv">$FILE_PATTERN</span><span class="sb">`</span>
</span><span class='line'>  xmlstarlet sel -I -t -c <span class="s2">&quot;/rss/channel/item[guid=&#39;$record&#39;]&quot;</span> <span class="nv">$f</span> &gt;&gt; vendor-import-<span class="nv">$TODAY</span>.xml
</span><span class='line'><span class="k">done</span>
</span>

Here we create an array with the id of the changed elements over which we then iterate. In the loop we once again use xmlstarlet to extract the current item from the feed which contains the right guid.

I’m quite happy with the result, it does exactly what we want it to do and is also reasonably fast. This is a good example of how familiar Unix tools can be combined to create fairly concise solutions for non-trivial problem.

Comments

Copyright © 2016 - Michael Kohl - Powered by Octopress