Friday, January 10, 2014

Week 2 Readings

Week 2 Readings

IIR 1.2, 2 and 3
I really like how they build up the theory behind the index, from just a term-document matrix where each document is either a 0 or 1 for each term to the complex form that it really is.  This proves the usefulness of various features of the index, starting with the frequency counts.
"users want things to work with their data as is.  Often, they just think of documents as text inside applications and are not even aware of how it is encoded on the disk."  This quote really speaks to the many different types of encoded and unencoded text that a search engine must parse.
The transition from long stoplists to none in modern IR is really interesting, and a change that I didn't know had happened.  The techniques used to weight indexed documents seem very fascinating, especially considering how the length of documents can vary in a web system (discussed earlier in the chapter).
Case-folding should be performed on tokens that appear at the beginning of a sentence, or in a sentence containing mostly capitalized words.  The rest of the terms tend to be correctly capitalized, but these must be given term equivalence to the case-folded term in many cases.
The methods for processing queries (biword, positional, etc) have presumably been greatly improved upon by either Google or Bing, because disk and memory space is essentially a non-issue for them.  But they keep all this information as a trade secret!  Not that it would necessarily make good reading for an introductory text, but still.
The use of data structures in IR really speaks to me.  B-trees are a really powerful way to store huge amount of files on disk, and if you use a b+ tree for the dictionary, I think that it makes trailing wildcard queries run in slightly optimized time over O(nlogn).  And rebalancing algorithms for trees are awesome!
Spelling correction in queries is pretty difficult, and the example of online search engines is to guess the intended query from the incidence of other users queries.  Without this data to rely on, however, other IR systems have to use less successful spelling correction theories.

No comments:

Post a Comment