citizen428.blog()

Try to learn something about everything

Information Overload 2010-10-02

On a weekend trip again, but managed to squeeze in quite a bit of reading during the week.

  • Friends Without Benefits
    Was there a paradigm shift in Silicon Valley from hard science to pointless web 2.0 startups?
  • How Universities Work
    As the title implies this is about universities in the US, but a lot of it also holds true for academic institutions in Europe.
  • First World War officially ends
    92 years after the end of the war, Germany will pay the last chunk of reparations imposed by the Treaty of Versailles.
  • The Shell Hater’s Handbook
    Despite the name, this presentation by GitHubber Ryan Tomyako is a nice intro to shell scripting. If you know a shell hater, send him a link to this presentation.
  • Virtual vs. Real Protests
    Twitter “revolutions” and the confusion between “mobilization” and “organization”.
  • Small Change
    Very much in the same vain as the previous article, Malcolm Gladwell talks about hierarchies vs. networks, strong vs. weak ties and why joining a Facebook group is not the same sort of activism as putting your life at risk in a real world conflict.
  • Pay The Bills
    Interesting experiment in earning some money while looking for a job.
  • Software Development for Developing Regions: Another Hat for Hackers
    Interesting idea about using skills like reversee engineering, code injection etc. for making software better suited to third world demands.
  • India’s surprising economic miracle
    Will India’s economy thrive when the global economy becomes more knowledge-intensive?

XML Diffs With Bash and Awk

Slightly modified version of a post I originally wrote for our company blog.

When importing data at work, we often have to deal with XML. This generally works fine, but the format’s structured nature also means that you can’t just treat it like any old text file.

That’s something we recently had to work around when we wanted to generate a daily XML diff, which only contains elements which changed since the previous feed. Of course there are several open source tools for diff-ing XML (e.g. diffxml or xmldiff) but since we didn’t get them to do what we want in a reasonable amount of time, we just decided to roll our own.

The final solution is a 71 line bash script, which downloads a zip, extracts it, generates MD5 sums for every element and then creates a diff between this new file and the previous list of MD5 sums. Once we know which elements have changed we merge them into a new feed which then gets handed to our importer. The awesome xmlstarlet was a great help in this, as was battle-tested old awk.

Let’s look at an interesting snippet from the script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<span class='line'>xmlstarlet sel -I -t -m <span class="s2">&quot;//item&quot;</span> -v <span class="s2">&quot;./guid&quot;</span> -o <span class="s2">&quot;|&quot;</span> -c <span class="s2">&quot;.&quot;</span> -n - |
</span><span class='line'>  sed -e <span class="s1">&#39;...&#39;</span> |
</span><span class='line'>  awk <span class="se">\</span>
</span><span class='line'>    <span class="s1">&#39;BEGIN {</span>
</span><span class='line'><span class="s1">      FS=&quot;|&quot;</span>
</span><span class='line'><span class="s1">      RS=&quot;\n&quot;</span>
</span><span class='line'><span class="s1">    }</span>
</span><span class='line'><span class="s1">    {</span>
</span><span class='line'><span class="s1">      id=$1</span>
</span><span class='line'><span class="s1">      command=&quot;printf \&quot;%s\&quot; \&quot;&quot; $2 &quot;\&quot; | md5sum | cut -d\&quot; \&quot; -f1&quot;</span>
</span><span class='line'><span class="s1">      command | getline md5</span>
</span><span class='line'><span class="s1">      close(command)</span>
</span><span class='line'><span class="s1">      print id&quot;:&quot;md5</span>
</span><span class='line'><span class="s1">    }&#39;</span> &gt;&gt; <span class="nv">$MD5_DIR</span>/vendor-md5-<span class="nv">$TODAY</span>
</span>

Here we use xmlstarlet to iterate over all the items in the feed (the XPath “//item”), print the value of the “guid” element (-v “./guid”), output a pipe character (-o “|”) and then copy the current element followed by a newline (-c “.” -n) . This then gets piped through sed for some cleaning up (which I omitted here for brevity’s sake) before awk takes the part after each “|”, generates an MD5 sum and finally produces a file that looks like this:

1
2
3

1
2
3
<span class='line'>rKKTZ:4012fced7c4cd77da607d294fbb8b5b6
</span><span class='line'>hC7Jr:39245a0f9a976e6d47c0e2d76abf6238
</span><span class='line'>...</span>

Now that we are able to create a daily list of MD5 sums, it’s easy to generate the diff feed:

1
2
3
4
5
6
7
8
9
10

1
2
3
4
5
6
7
8
9
10
<span class='line'><span class="k">if</span> <span class="o">[</span> -e <span class="nv">$MD5_DIR</span>/vendor-md5-last <span class="o">]</span> ; <span class="k">then</span>
</span><span class='line'><span class="k">  </span><span class="nv">changed</span><span class="o">=</span><span class="sb">`</span>diff <span class="nv">$MD5_DIR</span>/vendor-md5-last <span class="nv">$MD5_DIR</span>/vendor-md5-<span class="nv">$TODAY</span> |
</span><span class='line'>	   grep <span class="s2">&quot;^&gt;&quot;</span> |
</span><span class='line'>           cut -d<span class="s2">&quot;:&quot;</span> -f 1 |
</span><span class='line'>           cut -b 1-2 --complement<span class="sb">`</span>
</span><span class='line'>
</span><span class='line'><span class="k">for </span>record in <span class="nv">$changed</span> ; <span class="k">do</span>
</span><span class='line'><span class="k">  </span><span class="nv">f</span><span class="o">=</span><span class="sb">`</span>fgrep -l <span class="s2">&quot;&lt;guid&gt;$record&lt;/guid&gt;&quot;</span> <span class="nv">$FILE_PATTERN</span><span class="sb">`</span>
</span><span class='line'>  xmlstarlet sel -I -t -c <span class="s2">&quot;/rss/channel/item[guid=&#39;$record&#39;]&quot;</span> <span class="nv">$f</span> &gt;&gt; vendor-import-<span class="nv">$TODAY</span>.xml
</span><span class='line'><span class="k">done</span>
</span>

Here we create an array with the id of the changed elements over which we then iterate. In the loop we once again use xmlstarlet to extract the current item from the feed which contains the right guid.

I’m quite happy with the result, it does exactly what we want it to do and is also reasonably fast. This is a good example of how familiar Unix tools can be combined to create fairly concise solutions for non-trivial problem.

Information Overload 2010-09-26

Being sick this week I had a lot of time to read, but most if it went into Bruce Sterling’s Hacker Crackdown and Joe Dunthorne’s Submarine. Anyway, here we go:

Emacsclient on OS X

If you are running a non-system Emacs on OS X and have tried to use “emacsclient”, you may have seen the following error message despite having started the Emacs server:

1
2
3
4
5
6
7

1
2
3
4
5
6
7
<span class='line'>emacsclient: can't find socket; have you started the server?
</span><span class='line'>To start the server in Emacs, type "M-x server-start".
</span><span class='line'>emacsclient: No socket or alternate editor.  Please use:
</span><span class='line'>
</span><span class='line'>	--socket-name
</span><span class='line'>	--server-file      (or environment variable EMACS_SERVER_FILE)
</span><span class='line'>	--alternate-editor (or environment variable ALTERNATE_EDITOR)</span>

This doesn’t work because you are invoking “/usr/bin/emacsclient” which came with the OS, instead of “/Applications/Emacs.app/Contents/MacOS/bin/emacsclient”. This can easily be fixed by symlinking the latter to “/usr/local/bin/emacsclient” and making sure that “/usr/local/bin” is listed in your path before “/usr/bin”.

Not a big deal, but it took me a couple of minutes to figure out and I thought I might as well save others some time…

Information Overload 2010-09-19

This week “Information Overload” is back in full swing, and there really were lots of interesting things I stumbled upon:

Rubinius Has a New Fan

Being kinda sick I decided to use the weekend for emptying out my Instapaper account a little. Doing so I finally read Rubinius wants to help YOU make Ruby better on the Engine Yard blog. This reminded me that it’s been over a year since I last looked at Rubinius, so I used the excellent RVM to get the latest version and started my experiments. Basically everything I threw at it just worked, except for some of my scripts using 1.9’s new lambda syntax. Speedwise it seems to be more in the MRI 1.8.7 than the 1.9.2 range, but that’s fair enough. Getting adventurous I decided to try how Rubinius would handle one of my all-time favorite Ruby annoyances, the inability to override to_s in subclasses of String (don’t ask, but this once cost me almost an entire afternoon).

Example:

1
2
3
4
5
6
7
class SubclassedString < String
  def to_s
    "overriden"
  end
end

puts SubclassedString.new("original")

In MRI 1.8.7, MRI 1.9.2, JRuby HEAD and MacRuby 0.6 this will output “original”, which I believe to have tracked down to rb_obj_as_string in string.c in the MRI source (no idea about the other implementations). To my great surprise Rubinius 1.0.1 actually output “overriden”, which instantly won it a new fan. :-)

My First Week With the Kindle 3

tl;dr: Kindle 3 == teh awesum.

One week ago I finally got my Kindle 3 and it’s about time for a review. Here we go: Awesome! That’s it, ’nuff said. In all seriousness, the Kindle may well be my favorite gadget and that comes from somebody who owns a MacBook Pro, an Android phone, an iPod Touch and a Nintendo DS.

First off, at around 180 Euro for the WiFi+3G version it’s quite a bargain. I love the form factor, and the software is quite ok too, especially after the latest update. The built-in dictionary has already proven to be useful on several occasions and I’m sure I’ll start using the annotation feature rather sooner than later. But now for the most important part, the e-paper display. It’s an absolute pleasure to read on, I find the experience to be highly immersive. Since Saturday I read most of Cory Doctorow’s novel Little Brother on it, and am surprised how the Kindle just seems to disappear while I read. Hands-free reading without the need to keep the book open is pretty sweet too.

As a book nerd and regular reader (I usually read between 50-60 books a year) I still can’t entirely abandon paper books though, there’s just something magical about their feel and smell. I also tend to pick up a lot of my reading material from second-hand shops, Bookcrossing or Offener Buecherschrank, something which is not possible for eBooks (yet). However, the Kindle is great for reading all the great freely available books like Peter Watt’s Rifters trilogy, or the Project Gutenberg texts which I mostly ignored so far because I find it too annoying to read them on a computer screen (programming eBooks are a different matter, I want to read them on my screen so I can easily switch to an editor and try out things).

As should be obvious by now, I’m pretty stoked by my new toy. There is however one app that makes it even more awesome, Calibre. While it has a pretty ugly UI, it’s jam-packed with useful features every eBook user will appreciate. I especially love the ability to fetch RSS feeds and convert them into Kindle “magazines”: every morning Calibre fetches the feeds of several of my favorite online publications, converts them and emails them to my Kindle where I can later read them. Users can contribute their own recipes for this and some of them are just amazing (e.g. the one for Austrian newspaper Der Standard is basically of equal – or better – quality as the commercial offers in the Kindle store). Of course the program also can deal with authentication, so it’s no problem to access my subscription of the English edition of Le Monde diplomatique or my unread Instapaper items. Recipes are Python scripts by the way, so it’s easy to modify or create them. All in all an absolutely fantastic piece of software, which I happily donated money to! :-)

If you are looking for an eBook reader, it’s probably hard to find better value for money than the new Kindle. I’ve been wanting to buy one for the last 6 month or so, but am very happy that I waited until now. It’s everything I expected from such a device, plus a bit more.

Information Overload 2010-09-05

Erlang Bit Syntax and ID3

A couple of days ago I finally started properly looking at Erlang for the first time. One aspect I find especially interesting is the bit syntax, so I wrote a small program for parsing ID3v1 tags for practice. There’s definitely room for improvement (I ignored ID3v1.1), but it was a fun little exercise. Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-module(mp3).
-export([get_id3/1, get_tags/2]).

get_id3(File) ->
    case file:open(File, [read, binary]) of
        {ok, MP3} ->
            Result = case file:pread(MP3, {eof, -128}, 128) of
                {eof} -> eof;
                {error, Reason} -> Reason;
                {ok, <<"TAG", Tags/binary>>} -> parse_id3(Tags);
                {ok, _} -> no_id3
            end,
            file:close(MP3),
            Result;
        {error, Reason} -> Reason
    end.

get_tags(Tags, L) ->
    lists:map(fun (Tag) -> proplists:get_value(Tag, L) end, Tags).

parse_id3(<<T:30/binary,Ar:30/binary,Al:30/binary,Y:4/binary,C:30/binary,G:1/binary>>) ->
    Clean = lists:map(fun cleanup/1, [T, Ar, Al, Y, C, G]),
    {id3v1, lists:zip([title, artist, album, year, comment, genre], Clean)}.

cleanup(T) ->
    lists:filter(fun(X) -> X =/= 0 end, binary_to_list(T)).

Lets see this in action in the Erlang shell (the MP3 comes from a similar exercise in RubyLearning’s core Ruby course):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
147> % file doesn’t exist
147> mp3:get_id3(./test.txt).
enoent
148> % file is not an MP3
148> mp3:get_id3(./test.clj).
no_id3
150> % get the tags
150> {id3v1, Tags} = mp3:get_id3(song.mp3).
{id3v1,[{title,Dancing Shoes},
 {artist,Cliff Richard and The Shadows},
 {album,(SUMMER HOLIDAY 1963)},
 {year,2000},
 {comment,Rubylearningr},
 {genre,[24]}]}

I’m too new to Erlang to judge if this is a proper use of a property list, but it allowed me to write get_tags/2 as a wrapper for proplists:get_value/2 which is rather nice:

1
2
3
4
151> mp3:get_tags([artist], Tags).
[Cliff Richard and The Shadows]
152> mp3:get_tags([artist, year], Tags).
[Cliff Richard and The Shadows,2000]

Some initial help came from this related blog post, but I think our versions came out quite differently in the end.

All in all Erlang feels quite nice, except for minor syntactic quirks like different statement modifiers depending on context or the need to “extract” a local function with fun for the call in lists:map/2. Any feedback would be much appreciated, I’m sure there’s plenty of things I could have done better.

Copyright © 2016 - Michael Kohl - Powered by Octopress