User offline. Last seen 26 weeks 4 days ago. Offline
Joined: 08/25/2008

I was surprised to see that submitting the same document with same XmlParams under REST gives slightly different results than when submitted under SOAP. By slightly different results I mean that most of the relevance scores are different and several of the <c:detection> texts include an additional word at the beginning or end. I suppose the difference in relevance scores could be significant for some applications but not for mine.

But I would expect that the same algorithms would be applied regardless of which protocol was used to submit the request so the relevance scores and detection text would be identical between a REST submission and a SOAP submission.

Submitting the attached file gives several differences in relevance scores and detection text. For instance, the relevance score for Robert Hunt is 0.193 when submitted via REST and is 0.169 when submitted via SOAP. Can you verify that you get these differences as well on comment on whether this should be expected?

 

Trackback URL for this post:

http://www.opencalais.com/trackback/27482
AttachmentSize
And now a word from our Les Miz geek 18 times and counting.txt3.36 KB

Login or Register to post a comment.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
User offline. Last seen 2 years 32 weeks ago. Offline
Joined: 06/01/2008

Hello John,

Thank you for your post: it will allow us to shed light on some important issues.

Firstly, I want to make it absolutely clear that the underlying algorithms are the SAME, and are common to both methods of invocation (soap and rest).

The most notable difference between the two is that SOAP layers (both client and server side) don't 'like' much the '\r' character, and for the most part just drop it altogether.

So, the text "hello\r\nworld" arrives at our end as "hello\nworld" when using soap.

This does not affect the extraction process, but it accounts for the "extra word" in the detection:  we try not to include broken words in the prefix/suffix of the detection, and so we count a certain amount of characters back and forward, and then look for the nearest word-boudary. Obviously, any extra character can exclude a whole word in/out, when using this logic.

Secondly, about the relevance scores: I was not able to reproduce that, and got identical scores.

My guess is that your two requests were submitted at different times.

Our relevance algorithm is based on an indexed collection of documents, and this "reference-collection" is updated on a regular basis. It seems to me that your more-recent request arrived into the system after this "DB" was updated, hence getting different scores.

Could you please confirm this by re-submitting both (soap and rest) at near times ?

HTH,

Regards,

Meir

 

User offline. Last seen 26 weeks 4 days ago. Offline
Joined: 08/25/2008

Thanks for your reply. Please see my separate "Re: Diffs explained" post giving my results with file attachments: http://opencalais.com/forums/sig-developers/re-diffs-explained