citizen428.blog()

Slightly modified version of a post I originally wrote for our company blog.
When importing data at work, we often have to deal with XML. This generally works fine, but the format’s structured nature also means that you can’t just treat it like any old text file.
That’s something we recently had to work around when we wanted to generate a daily XML diff, which only contains elements which changed since the previous feed. Of course there are several open source tools for diff-ing XML (e.g. diffxml or xmldiff) but since we didn’t get them to do what we want in a reasonable amount of time, we just decided to roll our own.
The final solution is a 71 line bash script, which downloads a zip, extracts it, generates MD5 sums for every element and then creates a diff between this new file and the previous list of MD5 sums. Once we know which elements have changed we merge them into a new feed which then gets handed to our importer. The awesome xmlstarlet was a great help in this, as was battle-tested old awk.
Let’s look at an interesting snippet from the script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1

2

3

4

5

6

7

8

9

10

11

12

13

14

<span class='line'>xmlstarlet sel -I -t -m <span class="s2">&quot;//item&quot;</span> -v <span class="s2">&quot;./guid&quot;</span> -o <span class="s2">&quot;|&quot;</span> -c <span class="s2">&quot;.&quot;</span> -n - |
</span><span class='line'>  sed -e <span class="s1">&#39;...&#39;</span> |
</span><span class='line'>  awk <span class="se">\</span>
</span><span class='line'>    <span class="s1">&#39;BEGIN {</span>
</span><span class='line'><span class="s1">      FS=&quot;|&quot;</span>
</span><span class='line'><span class="s1">      RS=&quot;\n&quot;</span>
</span><span class='line'><span class="s1">    }</span>
</span><span class='line'><span class="s1">    {</span>
</span><span class='line'><span class="s1">      id=$1</span>
</span><span class='line'><span class="s1">      command=&quot;printf \&quot;%s\&quot; \&quot;&quot; $2 &quot;\&quot; | md5sum | cut -d\&quot; \&quot; -f1&quot;</span>
</span><span class='line'><span class="s1">      command | getline md5</span>
</span><span class='line'><span class="s1">      close(command)</span>
</span><span class='line'><span class="s1">      print id&quot;:&quot;md5</span>
</span><span class='line'><span class="s1">    }&#39;</span> &gt;&gt; <span class="nv">$MD5_DIR</span>/vendor-md5-<span class="nv">$TODAY</span>
</span>

Here we use xmlstarlet to iterate over all the items in the feed (the XPath “//item”), print the value of the “guid” element (-v “./guid”), output a pipe character (-o “|”) and then copy the current element followed by a newline (-c “.” -n) . This then gets piped through sed for some cleaning up (which I omitted here for brevity’s sake) before awk takes the part after each “|”, generates an MD5 sum and finally produces a file that looks like this:
1
2
3
1

2

3

<span class='line'>rKKTZ:4012fced7c4cd77da607d294fbb8b5b6
</span><span class='line'>hC7Jr:39245a0f9a976e6d47c0e2d76abf6238
</span><span class='line'>...</span>

Now that we are able to create a daily list of MD5 sums, it’s easy to generate the diff feed:
1
2
3
4
5
6
7
8
9
10
1

2

3

4

5

6

7

8

9

10

<span class='line'><span class="k">if</span> <span class="o">[</span> -e <span class="nv">$MD5_DIR</span>/vendor-md5-last <span class="o">]</span> ; <span class="k">then</span>
</span><span class='line'><span class="k">  </span><span class="nv">changed</span><span class="o">=</span><span class="sb">`</span>diff <span class="nv">$MD5_DIR</span>/vendor-md5-last <span class="nv">$MD5_DIR</span>/vendor-md5-<span class="nv">$TODAY</span> |
</span><span class='line'>	   grep <span class="s2">&quot;^&gt;&quot;</span> |
</span><span class='line'>           cut -d<span class="s2">&quot;:&quot;</span> -f 1 |
</span><span class='line'>           cut -b 1-2 --complement<span class="sb">`</span>
</span><span class='line'>
</span><span class='line'><span class="k">for </span>record in <span class="nv">$changed</span> ; <span class="k">do</span>
</span><span class='line'><span class="k">  </span><span class="nv">f</span><span class="o">=</span><span class="sb">`</span>fgrep -l <span class="s2">&quot;&lt;guid&gt;$record&lt;/guid&gt;&quot;</span> <span class="nv">$FILE_PATTERN</span><span class="sb">`</span>
</span><span class='line'>  xmlstarlet sel -I -t -c <span class="s2">&quot;/rss/channel/item[guid=&#39;$record&#39;]&quot;</span> <span class="nv">$f</span> &gt;&gt; vendor-import-<span class="nv">$TODAY</span>.xml
</span><span class='line'><span class="k">done</span>
</span>

Here we create an array with the id of the changed elements over which we then iterate. In the loop we once again use xmlstarlet to extract the current item from the feed which contains the right guid.
I’m quite happy with the result, it does exactly what we want it to do and is also reasonably fast. This is a good example of how familiar Unix tools can be combined to create fairly concise solutions for non-trivial problem.
  
  


    
  
  
    
      
  
    
      Information Overload 2010-09-26
    
    
      
        








  


Sep 26^th, 2010
        
      
    
  


  Being sick this week I had a lot of time to read, but most if it went into Bruce Sterling’s Hacker Crackdown and Joe Dunthorne’s Submarine. Anyway, here we go:

“Did ‘Star Wars’ become a toy story? ":http://herocomplex.latimes.com/2010/08/12/star-wars-was-born-a-long-time-ago-but-not-all-that-far-far-away-in-1972-filmmakers-george-lucas-and-gary-kurtz-wer/
I admit to being quite a Star Wars nerd, the Han shot first kind who pretends Episode I-III never happened and who actually sat through the entire Holiday special. If you are remotely like me, this article providing some insider information by Gary Kurtz will be quite an interesting read.
Computers and Mathematical Notation
Interesting thoughts on how programming languages (J in this case) could influence mathematical notation by Turing Award winner Kenneth E. Iverson
Common Programmer Health Problems
Some sane advice we all occasionally are guilty of ignoring.
We Are All Talk Radio Hosts
On strawberry jam, reasoning and cognitive bias.
The Millennium Development Goal that really does work has been forgotten
Sure, we’ll help you. Just not in a way so that you eventually can help yourself.
The Anthropology of Hackers
Outline of anthropologist Gabriella Coleman’s course about hacker culture at NYU.
Ten things to look for in a circumvention tool
By Roger Dingledine of Tor fame.
This Ain’t Your Daddy’s Wumpus
Example chapter from the upcoming book Land of Lisp (PDF).
In Arabian Desert, a Sustainable City Rises
The first residents are starting to move into Masdar, a planned city built in Abu Dhabi. There seems to be some discussion about how much of a “real” city Masdar will be, but then Dubai didn’t feel like a “real” place to me either, but more like a soulless agglomeration of skyscrapers and shopping malls.

  
  


    
  
  
    
      
  
    
      Emacsclient on OS X
    
    
      
        








  


Sep 22^nd, 2010
        
      
    
  


  If you are running a non-system Emacs on OS X and have tried to use “emacsclient”, you may have seen the following error message despite having started the Emacs server:
1
2
3
4
5
6
7
1

2

3

4

5

6

7

<span class='line'>emacsclient: can't find socket; have you started the server?
</span><span class='line'>To start the server in Emacs, type "M-x server-start".
</span><span class='line'>emacsclient: No socket or alternate editor.  Please use:
</span><span class='line'>
</span><span class='line'>	--socket-name
</span><span class='line'>	--server-file      (or environment variable EMACS_SERVER_FILE)
</span><span class='line'>	--alternate-editor (or environment variable ALTERNATE_EDITOR)</span>

This doesn’t work because you are invoking “/usr/bin/emacsclient” which came with the OS, instead of “/Applications/Emacs.app/Contents/MacOS/bin/emacsclient”. This can easily be fixed by symlinking the latter to “/usr/local/bin/emacsclient” and making sure that “/usr/local/bin” is listed in your path before “/usr/bin”.
Not a big deal, but it took me a couple of minutes to figure out and I thought I might as well save others some time…
  
  


    
  
  
    
      
  
    
      Information Overload 2010-09-19
    
    
      
        








  


Sep 19^th, 2010
        
      
    
  


  This week “Information Overload” is back in full swing, and there really were lots of interesting things I stumbled upon:

The Biggest Company You’ve Never Heard Of
Interesting two and a half minute video about Serco, a company I indeed never had heard of before.
Goran Lindberg and Sweden’s dark side
Are Stieg Larsson’s novels closer to the truth than expectd?
Third Kim lucky?
This article suggests that Kim Jong-un will become Kim Jong-il’s successor. I’d have expected a kind of military/party government after the latter’s retirement/death, let’s see what actually happens.
Genetic Scars of the Holocaust: Children Suffer Too
Epigenetics in the context of Holocaust victims and their children. 
The Most Powerful Colors in the World
With so many colors to choose from, companies don’t seem to be overly creative when deciding on their logos.
Todesopfer rechter Gewalt
Interactive map of of people who died due to right-wing violence in Germany. Also see 137 Todesopfer rechter Gewalt for more information on the victims (both in German).
Itunes Remove duplicated songs
It’s always nice to see something based on my blog posts, especially in this case where it was the first time ever I wrote something about Erlang.
The Myth of the Boy Wizard
After the recently discovered flaw in Haystack, Austin Heap got a lot of flak. This article is an interesting take on the media’s role and responsibility in all of this.
Zeros to heroes: 10 unlikely ideas that changed the world
A good reminder to keep on hacking despite the odds.
Clojure is Fast
If you ever where curious just how fast you can get your Clojure code to run, this will provide some valuable insight.
Lawsuit challenges Obama’s power to kill citizens without due process
You know a country is fucked once it needs a lawsuit like this.

  
  


    
  
  
    
      
  
    
      Rubinius Has a New Fan
    
    
      
        








  


Sep 18^th, 2010
        
      
    
  


  Being kinda sick I decided to use the weekend for emptying out my Instapaper account a little. Doing so I finally read Rubinius wants to help YOU make Ruby better on the Engine Yard blog. This reminded me that it’s been over a year since I last looked at Rubinius, so I used the excellent RVM to get the latest version and started my experiments. Basically everything I threw at it just worked, except for some of my scripts using 1.9’s new lambda syntax. Speedwise it seems to be more in the MRI 1.8.7 than the 1.9.2 range, but that’s fair enough. Getting adventurous I decided to try how Rubinius would handle one of my all-time favorite Ruby annoyances, the inability to override to_s in subclasses of String (don’t ask, but this once cost me almost an entire afternoon).
Example:

1
2
3
4
5
6
7
class SubclassedString < String
  def to_s
    "overriden"
  end
end

puts SubclassedString.new("original")

In MRI 1.8.7, MRI 1.9.2, JRuby HEAD and MacRuby 0.6 this will output “original”, which I believe to have tracked down to rb_obj_as_string in string.c in the MRI source (no idea about the other implementations). To my great surprise Rubinius 1.0.1 actually output “overriden”, which instantly won it a new fan. :-)
  
  


    
  
  
    
      
  
    
      My First Week With the Kindle 3
    
    
      
        








  


Sep 14^th, 2010
        
      
    
  


  tl;dr: Kindle 3 == teh awesum.
One week ago I finally got my Kindle 3 and it’s about time for a review. Here we go: Awesome! That’s it, ’nuff said. In all seriousness, the Kindle may well be my favorite gadget and that comes from somebody who owns a MacBook Pro, an Android phone, an iPod Touch and a Nintendo DS.
First off, at around 180 Euro for the WiFi+3G version it’s quite a bargain. I love the form factor, and the software is quite ok too, especially after the latest update. The built-in dictionary has already proven to be useful on several occasions and I’m sure I’ll start using the annotation feature rather sooner than later. But now for the most important part, the e-paper display. It’s an absolute pleasure to read on, I find the experience to be highly immersive. Since Saturday I read most of Cory Doctorow’s novel Little Brother on it, and am surprised how the Kindle just seems to disappear while I read. Hands-free reading without the need to keep the book open is pretty sweet too.
As a book nerd and regular reader (I usually read between 50-60 books a year) I still can’t entirely abandon paper books though, there’s just something magical about their feel and smell. I also tend to pick up a lot of my reading material from second-hand shops, Bookcrossing or Offener Buecherschrank, something which is not possible for eBooks (yet). However, the Kindle is great for reading all the great freely available books like Peter Watt’s Rifters trilogy, or the Project Gutenberg texts which I mostly ignored so far because I find it too annoying to read them on a computer screen (programming eBooks are a different matter, I want to read them on my screen so I can easily switch to an editor and try out things).
As should be obvious by now, I’m pretty stoked by my new toy. There is however one app that makes it even more awesome, Calibre. While it has a pretty ugly UI, it’s jam-packed with useful features every eBook user will appreciate. I especially love the ability to fetch RSS feeds and convert them into Kindle “magazines”: every morning Calibre fetches the feeds of several of my favorite online publications, converts them and emails them to my Kindle where I can later read them. Users can contribute their own recipes for this and some of them are just amazing (e.g. the one for Austrian newspaper Der Standard is basically of equal – or better – quality as the commercial offers in the Kindle store). Of course the program also can deal with authentication, so it’s no problem to access my subscription of the English edition of Le Monde diplomatique or my unread Instapaper items. Recipes are Python scripts by the way, so it’s easy to modify or create them. All in all an absolutely fantastic piece of software, which I happily donated money to! :-)
If you are looking for an eBook reader, it’s probably hard to find better value for money than the new Kindle. I’ve been wanting to buy one for the last 6 month or so, but am very happy that I waited until now. It’s everything I expected from such a device, plus a bit more.
  
  


    
  
  
    
      
  
    
      Information Overload 2010-09-13
    
    
      
        








  


Sep 13^th, 2010
        
      
    
  


  Due to an extended weekend trip this issue of “Information Overload” is much shorter and a bit later than usual, the next one should be back to normal:

Who Killed Prolog
Interesting read for programming language nerds, also contains interesting info on how the Japanese economy was feared by the US in the 80ies.
Obama ist nur ein normaler, gemaessigter Demokrat im Stil von Bill Clinton
Interview with Noam Chomsky in German.

  
  


    
  
  
    
      
  
    
      Information Overload 2010-09-05
    
    
      
        








  


Sep 5^th, 2010
        
      
    
  


  
Clojure or: How I Learned to Stop Worrying and Love the Parentheses
Blog post explaining what draws the author – and many others – to Clojure.
tactics, tactics, tactics
How to get better at programming. And chess.
Die Sankore-Schriften
Little history of the Sankore university in Timbuktu, Mali (in German).
The Making Of: Little Computer People
I love video game history, especially when it deals with games who influenced very successful titles later on.
Sleep Deprivation May Spur Serious Mental Problems, Study Finds
A little reminder to all us geeks, nerds and hackers that sleep does play an important role in our overall well-being.
Experience: I spent 29 years in solitary confinement
Robert Kind who spent 29 years in solitary confinement for crimes he didn’t commit talks about his experience.
Why can you turn clothing right-side-out?
Math is everywhere.
Brazilian agriculture:The miracle of the cerrado
Very interesting article on how Brazil transformed itself into one of the world’s biggest food exporters.

  
  


    
  
  
    
      
  
    
      Erlang Bit Syntax and ID3
    
    
      
        








  


Sep 4^th, 2010
        
      
    
  


  A couple of days ago I finally started properly looking at Erlang for the first time. One aspect I find especially interesting is the bit syntax, so I wrote a small program for parsing ID3v1 tags for practice. There’s definitely room for improvement (I ignored ID3v1.1), but it was a fun little exercise.
Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-module(mp3).
-export([get_id3/1, get_tags/2]).

get_id3(File) ->
    case file:open(File, [read, binary]) of
        {ok, MP3} ->
            Result = case file:pread(MP3, {eof, -128}, 128) of
                {eof} -> eof;
                {error, Reason} -> Reason;
                {ok, <<"TAG", Tags/binary>>} -> parse_id3(Tags);
                {ok, _} -> no_id3
            end,
            file:close(MP3),
            Result;
        {error, Reason} -> Reason
    end.

get_tags(Tags, L) ->
    lists:map(fun (Tag) -> proplists:get_value(Tag, L) end, Tags).

parse_id3(<<T:30/binary,Ar:30/binary,Al:30/binary,Y:4/binary,C:30/binary,G:1/binary>>) ->
    Clean = lists:map(fun cleanup/1, [T, Ar, Al, Y, C, G]),
    {id3v1, lists:zip([title, artist, album, year, comment, genre], Clean)}.

cleanup(T) ->
    lists:filter(fun(X) -> X =/= 0 end, binary_to_list(T)).



Lets see this in action in the Erlang shell (the MP3 comes from a similar exercise in RubyLearning’s core Ruby course):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
147> % file doesn’t exist
147> mp3:get_id3(“./test.txt”).
enoent
148> % file is not an MP3
148> mp3:get_id3(“./test.clj”).
no_id3
150> % get the tags
150> {id3v1, Tags} = mp3:get_id3(“song.mp3”).
{id3v1,[{title,“Dancing Shoes”},
 {artist,“Cliff Richard and The Shadows”},
 {album,“(SUMMER HOLIDAY 1963)”},
 {year,“2000”},
 {comment,“Rubylearningr”},
 {genre,[24]}]}



I’m too new to Erlang to judge if this is a proper use of a property list, but it allowed me to write get_tags/2 as a wrapper for
proplists:get_value/2 which is rather nice:

1
2
3
4
151> mp3:get_tags([artist], Tags).
[“Cliff Richard and The Shadows”]
152> mp3:get_tags([artist, year], Tags).
[“Cliff Richard and The Shadows”,“2000”]



Some initial help came from this related blog post, but I think our versions came out quite differently in the end.

All in all Erlang feels quite nice, except for minor syntactic quirks like different statement modifiers depending on context or the need to “extract” a local function with fun for the call in lists:map/2. Any feedback would be much appreciated, I’m sure there’s plenty of things I could have done better.

  
  


    
  
  
    
      ← Older
    
    Blog Archives
    
    Newer →
    
  


  
    
  About citizen428
  I’m Michael Kohl, generally known as citizen428 online. I mainly
  write about programming, and do a regular blog post series collecting interesting
  articles I enjoyed throughout the week.







  Recent Posts
  
    
      
        Review: Statistics Done Wrong
      
    
      
        Information Overload 2015-07-12
      
    
      
        Information Overload 2015-07-06
      
    
      
        Information Overload 2015-06-07
      
    
      
        Information Overload 2015-06-01
      
    
  



  Latest Tweets
  
    Status updating…
  
  
  
  
    Follow @citizen428
  




  GitHub Repos
  
    Status updating…
  
  
  @citizen428 on GitHub
  
  
  



  


    
  
  
  Copyright © 2016 - Michael Kohl -
  Powered by Octopress
citizen428.blog()

Try to learn something about everything

Information Overload 2010-10-02

XML Diffs With Bash and Awk

Information Overload 2010-09-26

Emacsclient on OS X

Information Overload 2010-09-19

Rubinius Has a New Fan

My First Week With the Kindle 3

Information Overload 2010-09-13

Information Overload 2010-09-05

Erlang Bit Syntax and ID3