Week 5 Readings

Week 5 Readings

IIR 1.3 and 1.4
The intersection algorithm is a pretty simple way to advance through two lists sequentially. Deciding which part of the query should be processed first is important to optimize response time. In place intersection uses the spots of intermediate results that are no longer valid to store the new, still possibly correct results. "Boolean queries are precise: a document either matches the query or it does not."

IIR 6
Free text queries are queries where the user does not indicate to the system correct word order or relative weights of terms making up the query. These queries cannot be matched by boolean retrieval, because many documents will be retrieved that match all the words via intersection, but these cannot be ranked.

Weighted zone scores are an attempt to make up for this. The system creator assigns weights to fields, and if the query term is found in that zone, that weight is added to the query result. These rankings are compared across documents, allowing ranking. Another way to assign weights to different zones of a document is through machine-learned relevance.

Weighted scores can process free text queries by simply adding the weighted scores for each term together. This does not require boolean intersection, because the highest weighted document is most likely to contain all the terms of the query.

Inverse document frequency helps to refine these results. As the example in the book states, if the document collection is about the automobile industry, the term auto will appear in almost every document. Queries for this word should be weighted less than other words to generate meaningful difference between documents. It is important that this is the frequency of documents it appears in, not collection frequency. If one document is the word auto 1,000 that would artificially inflate the frequency of the term looking at the whole collection.

The rest of the chapter deals with vector space models, and refinements to these models. These refinements to be chosen from can be seen in the SMART model, and can be mixed and matched for both documents and queries to achieve the best results.

IS2140 - Readings and Muddiest Points

Thursday, January 30, 2014

Week 5 Readings

No comments:

Post a Comment