Monday, March 13, 2006

DMoz vs RDF Repository (Sesame)

What can you do - there are just some days that just should not happen. I think one of them was when I decided to integrate DMoz ontology into JeromeDL and FOAFRealm. It all looked so harmless - especially when I got (small !) part of the ontology from Andreas. I build the whole mind model around that, finally even set up a JOnto project to deliver unified API to handle taxonomies. And ...
I decided to download DMoz RDF, or what ever they claim to be an RDF :( It took me some time to realize what was wrong. And eventually I got some help from Hee Chul and Krystian with nice converting scripts. I though that it was the end of the problems - I had a sample RDF (a true one) that worked. And a real RDF version of full DMoz ontology. But it was not the end of the problems :(
I decided to upload the 800MB RDF-DMoz file to Sesame. But after a couple of hours of waiting, 100% CPU usage, almost 80C CPU temperature of my laptop, I gave up.


Daniel suggested I should just point to the RDF file and make the memory repository. Well - it went quick that way - "out of heap memory" error :( Later I took the "divide and conquer" approach. Cut this 800MB file into 10 smaller. First one got uploaded very quickly (relatively). And so, encouraged by that example I started uploading the rest 9. Each next of them was taking much much longer to be uploaded, until the 10th one that obviously must had make Sesame hanging - as there was no progress for the whole night (I went to sleep btw).

I cleaned the repository and uploaded only the first chunk again. But trying to use it - with browsing or SeRQL querying was way to sluggish. Finally I came to my brains and "slimed" the DMoz RDF removing (with modified Krystian's script) all information that was not defining dmoz:Topic or using dc:title and dmoz:narrow{12}. Luckily I got 200MB RDF file that went smoothly into Sesame.
And now JOnto-DMoz is finally kicking the ass :)

[I will upload the scripts and the final RDF file to jonto.sf.net soon]

2 comments:

Unknown said...

Yes... I'm also having the same issues with the DMoz half hearted attempt at RDF. :( Even the namespaces are wrong and have to be converted. Ugh. I was looking at doing a fly-weight extension on the Model to only load the components that are important but, I'm not yet sure how to do it without first creating an expensive in-memory index... ugh.

Unknown said...

Please watch out for new at http://jonto.sf.net/, Mateusz has just took JOnto over from me - and he should be announcing new version pretty soon