The concerns around RDF Storages efficient are not new; many people I meet, ask me if they are scalable enough, so that they could used them in the industrial solutions.
I was not sure about it for a long time. I was not happy with Jena, we have switched JeromeDL and FOAFRealm to Sesame. I showed some improvement. I was hoping to switch to YARS, but being unable to write to this storage kept me at bay.
Anyway, over the time my confidence in the scalability of the RDF storages grew. When DERI announced the break through with SWSE/YARS2, I felt pretty confident that we have reached the stage, where the industrial world can start building upon Semantic Web technologies.
And so, I became reckless. Until only recently ...
During my summer holiday, just to play around a little, I did some changes in the TagsTreeMaps (TTM) component, preparing it for the evaluation, which I will need for my thesis. Since broadband connection and a sunny environment are mutually exclusive (at least they were in my case), I have switched from the original del.icio.us tagging provider module, developed last year, to an internal notitio.us provider module. The later one operated on the RDF storage (Sesame) with a copy of my, and some of my colleagues, taggings from del.icio.us. The graph with taggings was build following Tom Grubbers Tagging ontology (TagCommons).
If you happened to play with TTM anytime in the past, you know that what is required in the first step is a list of all tags by given user, with a number of times each tag has been used. Since none of RDF query languages (at least to my knowledge), supported by Sesame, allows for aggregations like COUNT(*), I decided to do the counting myself. Still, I needed a list of all tags.
The obvious, to me, query was following:
SELECT term
FROM
{document} tagging:hasTagging {tagging},
{tagging} dc:creator {<USER-ID>};
tagging:hasTerm {} rdfs:label {term}
USING NAMESPACE
tagging = <http://ttm.corrib.org/tagging#>,
dc = <http://purl.org/dc/elements/1.1/>
In other words, for all documents tagged by user with give USER-ID, get all literals representing tags used in this tagging.
To my surprise the whole application slowed downed to a snail pace. Why? A quick profiling with Logger in the right places of the algorithm (I could not get Tomcat profilers in Eclipse running on my Mac), gave a hint that it is the query execution by Sesame that takes ages.
I have even posted this query through the web interface of Sesame. The result was even worse: 25k ms (!) to compute the query for roughly 400+ documents with 2.5 tags per each (on average). That is BAD.
Luckily, I am blessed with a group of smarter than me (apparently) people working under my supervision in my SemInf Lab in DERI.
I told the problem to Maciej, and asked him what question would he wrote. His response was:
SELECT term
FROM
{tagging} dc:creator {<USER-ID>};
tagging:hasTerm {} rdfs:label {term}
USING NAMESPACE
tagging = <http://ttm.corrib.org/tagging#>,
dc = <http://purl.org/dc/elements/1.1/>
... and Sesame managed to compute it, giving the same results (!) in 200ms (!!!!!).
The question is if I can use his query instead of mine? Quick answer: YES,
... but what if the RDF will not conform our ontology? Like e.g., there will be resources with dc:creator and tagging:hasTerms properties, where will not be of a type Tagging, associated with a document? Unlikely to happen in the old world of SQL, but not in the open Semantic Web environment.
For the purpose of the evaluation of TTM I will stick to Maciej’s query. Hopefully, there will be some better solution out there, by the time notitio.us will go commercial.
I was not sure about it for a long time. I was not happy with Jena, we have switched JeromeDL and FOAFRealm to Sesame. I showed some improvement. I was hoping to switch to YARS, but being unable to write to this storage kept me at bay.
Anyway, over the time my confidence in the scalability of the RDF storages grew. When DERI announced the break through with SWSE/YARS2, I felt pretty confident that we have reached the stage, where the industrial world can start building upon Semantic Web technologies.
And so, I became reckless. Until only recently ...
During my summer holiday, just to play around a little, I did some changes in the TagsTreeMaps (TTM) component, preparing it for the evaluation, which I will need for my thesis. Since broadband connection and a sunny environment are mutually exclusive (at least they were in my case), I have switched from the original del.icio.us tagging provider module, developed last year, to an internal notitio.us provider module. The later one operated on the RDF storage (Sesame) with a copy of my, and some of my colleagues, taggings from del.icio.us. The graph with taggings was build following Tom Grubbers Tagging ontology (TagCommons).
If you happened to play with TTM anytime in the past, you know that what is required in the first step is a list of all tags by given user, with a number of times each tag has been used. Since none of RDF query languages (at least to my knowledge), supported by Sesame, allows for aggregations like COUNT(*), I decided to do the counting myself. Still, I needed a list of all tags.
The obvious, to me, query was following:
SELECT term
FROM
{document} tagging:hasTagging {tagging},
{tagging} dc:creator {<USER-ID>};
tagging:hasTerm {} rdfs:label {term}
USING NAMESPACE
tagging = <http://ttm.corrib.org/tagging#>,
dc = <http://purl.org/dc/elements/1.1/>
In other words, for all documents tagged by user with give USER-ID, get all literals representing tags used in this tagging.
To my surprise the whole application slowed downed to a snail pace. Why? A quick profiling with Logger in the right places of the algorithm (I could not get Tomcat profilers in Eclipse running on my Mac), gave a hint that it is the query execution by Sesame that takes ages.
I have even posted this query through the web interface of Sesame. The result was even worse: 25k ms (!) to compute the query for roughly 400+ documents with 2.5 tags per each (on average). That is BAD.
Luckily, I am blessed with a group of smarter than me (apparently) people working under my supervision in my SemInf Lab in DERI.
I told the problem to Maciej, and asked him what question would he wrote. His response was:
SELECT term
FROM
{tagging} dc:creator {<USER-ID>};
tagging:hasTerm {} rdfs:label {term}
USING NAMESPACE
tagging = <http://ttm.corrib.org/tagging#>,
dc = <http://purl.org/dc/elements/1.1/>
... and Sesame managed to compute it, giving the same results (!) in 200ms (!!!!!).
The question is if I can use his query instead of mine? Quick answer: YES,
... but what if the RDF will not conform our ontology? Like e.g., there will be resources with dc:creator and tagging:hasTerms properties, where will not be of a type Tagging, associated with a document? Unlikely to happen in the old world of SQL, but not in the open Semantic Web environment.
For the purpose of the evaluation of TTM I will stick to Maciej’s query. Hopefully, there will be some better solution out there, by the time notitio.us will go commercial.
No comments:
Post a Comment