XML Diffs With Bash and Awk - citizen428.blog()

Slightly modified version of a post I originally wrote for our company blog.

When importing data at work, we often have to deal with XML. This generally works fine, but the format’s structured nature also means that you can’t just treat it like any old text file.

That’s something we recently had to work around when we wanted to generate a daily XML diff, which only contains elements which changed since the previous feed. Of course there are several open source tools for diff-ing XML (e.g. diffxml or xmldiff) but since we didn’t get them to do what we want in a reasonable amount of time, we just decided to roll our own.

The final solution is a 71 line bash script, which downloads a zip, extracts it, generates MD5 sums for every element and then creates a diff between this new file and the previous list of MD5 sums. Once we know which elements have changed we merge them into a new feed which then gets handed to our importer. The awesome xmlstarlet was a great help in this, as was battle-tested old awk.

Let’s look at an interesting snippet from the script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1

2

3

4

5

6

7

8

9

10

11

12

13

14

<span class='line'>xmlstarlet sel -I -t -m <span class="s2">&quot;//item&quot;</span> -v <span class="s2">&quot;./guid&quot;</span> -o <span class="s2">&quot;|&quot;</span> -c <span class="s2">&quot;.&quot;</span> -n - |
</span><span class='line'>  sed -e <span class="s1">&#39;...&#39;</span> |
</span><span class='line'>  awk <span class="se">\</span>
</span><span class='line'>    <span class="s1">&#39;BEGIN {</span>
</span><span class='line'><span class="s1">      FS=&quot;|&quot;</span>
</span><span class='line'><span class="s1">      RS=&quot;\n&quot;</span>
</span><span class='line'><span class="s1">    }</span>
</span><span class='line'><span class="s1">    {</span>
</span><span class='line'><span class="s1">      id=$1</span>
</span><span class='line'><span class="s1">      command=&quot;printf \&quot;%s\&quot; \&quot;&quot; $2 &quot;\&quot; | md5sum | cut -d\&quot; \&quot; -f1&quot;</span>
</span><span class='line'><span class="s1">      command | getline md5</span>
</span><span class='line'><span class="s1">      close(command)</span>
</span><span class='line'><span class="s1">      print id&quot;:&quot;md5</span>
</span><span class='line'><span class="s1">    }&#39;</span> &gt;&gt; <span class="nv">$MD5_DIR</span>/vendor-md5-<span class="nv">$TODAY</span>
</span>

Here we use xmlstarlet to iterate over all the items in the feed (the XPath “//item”), print the value of the “guid” element (-v “./guid”), output a pipe character (-o “|”) and then copy the current element followed by a newline (-c “.” -n) . This then gets piped through sed for some cleaning up (which I omitted here for brevity’s sake) before awk takes the part after each “|”, generates an MD5 sum and finally produces a file that looks like this:
1
2
3
1

2

3

<span class='line'>rKKTZ:4012fced7c4cd77da607d294fbb8b5b6
</span><span class='line'>hC7Jr:39245a0f9a976e6d47c0e2d76abf6238
</span><span class='line'>...</span>

Now that we are able to create a daily list of MD5 sums, it’s easy to generate the diff feed:
1
2
3
4
5
6
7
8
9
10
1

2

3

4

5

6

7

8

9

10

<span class='line'><span class="k">if</span> <span class="o">[</span> -e <span class="nv">$MD5_DIR</span>/vendor-md5-last <span class="o">]</span> ; <span class="k">then</span>
</span><span class='line'><span class="k">  </span><span class="nv">changed</span><span class="o">=</span><span class="sb">`</span>diff <span class="nv">$MD5_DIR</span>/vendor-md5-last <span class="nv">$MD5_DIR</span>/vendor-md5-<span class="nv">$TODAY</span> |
</span><span class='line'>	   grep <span class="s2">&quot;^&gt;&quot;</span> |
</span><span class='line'>           cut -d<span class="s2">&quot;:&quot;</span> -f 1 |
</span><span class='line'>           cut -b 1-2 --complement<span class="sb">`</span>
</span><span class='line'>
</span><span class='line'><span class="k">for </span>record in <span class="nv">$changed</span> ; <span class="k">do</span>
</span><span class='line'><span class="k">  </span><span class="nv">f</span><span class="o">=</span><span class="sb">`</span>fgrep -l <span class="s2">&quot;&lt;guid&gt;$record&lt;/guid&gt;&quot;</span> <span class="nv">$FILE_PATTERN</span><span class="sb">`</span>
</span><span class='line'>  xmlstarlet sel -I -t -c <span class="s2">&quot;/rss/channel/item[guid=&#39;$record&#39;]&quot;</span> <span class="nv">$f</span> &gt;&gt; vendor-import-<span class="nv">$TODAY</span>.xml
</span><span class='line'><span class="k">done</span>
</span>

Here we create an array with the id of the changed elements over which we then iterate. In the loop we once again use xmlstarlet to extract the current item from the feed which contains the right guid.
I’m quite happy with the result, it does exactly what we want it to do and is also reasonably fast. This is a good example of how familiar Unix tools can be combined to create fairly concise solutions for non-trivial problem.


  
    
      
  

Posted by Michael Kohl

      








  


Oct 1^st, 2010
      


  
    programming, shell
  



    
    
      
  
  
  Tweet
  
  
  
  
  
    
  


    
    
      
        « Information Overload 2010-09-26
      
      
        Information Overload 2010-10-02 »
      
    
  


  
    Comments
    

  




  
    
  About citizen428
  I'm Michael Kohl, generally known as citizen428 online. I mainly
  write about programming, and do a regular blog post series collecting interesting
  articles I enjoyed throughout the week.







  Recent Posts
  
    
      
        Review: Statistics Done Wrong
      
    
      
        Information Overload 2015-07-12
      
    
      
        Information Overload 2015-07-06
      
    
      
        Information Overload 2015-06-07
      
    
      
        Information Overload 2015-06-01
      
    
  



  Latest Tweets
  
    Status updating...
  
  
  
  
    Follow @citizen428
  




  GitHub Repos
  
    Status updating...
  
  
  @citizen428 on GitHub
  
  
  



  



    
  
  
  Copyright © 2016 - Michael Kohl -
  Powered by Octopress