Thursday, January 30, 2014

Week 5 Readings

Week 5 Readings

IIR 1.3 and 1.4
The intersection algorithm is a pretty simple way to advance through two lists sequentially.  Deciding which part of the query should be processed first is important to optimize response time.  In place intersection uses the spots of intermediate results that are no longer valid to store the new, still possibly correct results. "Boolean queries are precise: a document either matches the query or it does not."

IIR 6
Free text queries are queries where the user does not indicate to the system correct word order or relative weights of terms making up the query.  These queries cannot be matched by boolean retrieval, because many documents will be retrieved that match all the words via intersection, but these cannot be ranked.

Weighted zone scores are an attempt to make up for this.  The system creator assigns weights to fields, and if the query term is found in that zone, that weight is added to the query result.  These rankings are compared across documents, allowing ranking.  Another way to assign weights to different zones of a document is through machine-learned relevance.

Weighted scores can process free text queries by simply adding the weighted scores for each term together.  This does not require boolean intersection, because the highest weighted document is most likely to contain all the terms of the query.

Inverse document frequency helps to refine these results.  As the example in the book states, if the document collection is about the automobile industry, the term auto will appear in almost every document. Queries for this word should be weighted less than other words to generate meaningful difference between documents.  It is important that this is the frequency of documents it appears in, not collection frequency.  If one document is the word auto 1,000 that would artificially inflate the frequency of the term looking at the whole collection.

The rest of the chapter deals with vector space models, and refinements to these models.  These refinements to be chosen from can be seen in the SMART model, and can be mixed and matched for both documents and queries to achieve the best results.

Tuesday, January 28, 2014

Week 4 Muddiest Point

Week 4 Muddiest Point

Gamma Code

Is gamma code stored as a string, like index as string compression methods?  The unary code then is a delimiter for the next few characters, letting the program processing the postings know where each posting begins.  Storing all the postings as one long string means that no set amount of bytes must be used for each posting, saving space in the postings list.  The binary string (gathered from the unary marker and following string) is then processed in memory according to the conventions of binary.

If gamma code is not stored as a string, then I do not understand how gamma code works.

General Compression Question
Memory and storage are always increasing.  Does compression also mean faster access?  Otherwise, why do open source software like Indri and Lucerne compress index terms and postings automatically?

Wednesday, January 22, 2014

Week 4 Readings

Week 4 Readings

IIR Ch 4
The two block write algorithms are interesting.  BSBI is elegant and simple, while SPIMI eliminates an unnecessary pass through the data to sort terms.  MapReduce is fast and super scaleable.  One aspect not touched here is what to do if the master fails.  It is after all, built of the same cheap parts as each parser and inverter.  My guess is that there are several layers of redundancy, where each back up master only monitors the performance of the cluster's output to disk, ready to kick in if needed.

Of the two options presented for dynamic indexing, neither seems like an ideal solution.  Rebuilding the entire index from scratch means that there will be queries where a better answer could be found, but the user was unfortunate enough to query before the rebuild was completed.  For a solution to this problem, the book offers housing both the index on disk and a replacement, temp index in memory.  However, this requires each query to make calls to both indexes, merger them, and check the results against a delete list, also in memory.  This makes for slower query processing, and that is a cardinal sin!

IIR Ch 5
The benefit of compression is not in saving disk space.  Disks are cheap and always getting cheaper.  Compression means that more can be cached in memory, meaning less disk seeks, and faster return of results for those queries that are cached.  Similarly, each transfer from disk is so much smaller than an uncompressed result, that the time savings are enormous.
It's interesting to think of casefolding and stemming as compression techniques, because they were first presented as measures to assist in recall.  But since each also means that one more distinct term will not be stored in the dictionary, this is a compression of the index as a whole.
Heap's law states that the larger a collection, the larger the term dictionary, another indicator that compression is necessary.
The dictionary can be compressed significantly, through a variety of techniques.  My favorite of these is front coding.  This is similar to JPEG compression, and makes a lot of sense to me.  Blocking is not a big improvement over simply stringing all terms, but in a very large dictionary or a very small memory every bit helps.
Variable byte encoding and bit level encoding are a bit more difficult than this book has stated.  I do not have a firm understanding of binary math to follow the way that these gaps make things faster.

Wednesday, January 15, 2014

Week 2 Muddiest Point

Week 2 Muddiest Point

Do n-gram methods of tokenization always mean that the terms are indexed together (precoordination)?  Or can this be done at run time, given the processing power available nowdays?

Friday, January 10, 2014

Week 2 Readings

Week 2 Readings

IIR 1.2, 2 and 3
I really like how they build up the theory behind the index, from just a term-document matrix where each document is either a 0 or 1 for each term to the complex form that it really is.  This proves the usefulness of various features of the index, starting with the frequency counts.
"users want things to work with their data as is.  Often, they just think of documents as text inside applications and are not even aware of how it is encoded on the disk."  This quote really speaks to the many different types of encoded and unencoded text that a search engine must parse.
The transition from long stoplists to none in modern IR is really interesting, and a change that I didn't know had happened.  The techniques used to weight indexed documents seem very fascinating, especially considering how the length of documents can vary in a web system (discussed earlier in the chapter).
Case-folding should be performed on tokens that appear at the beginning of a sentence, or in a sentence containing mostly capitalized words.  The rest of the terms tend to be correctly capitalized, but these must be given term equivalence to the case-folded term in many cases.
The methods for processing queries (biword, positional, etc) have presumably been greatly improved upon by either Google or Bing, because disk and memory space is essentially a non-issue for them.  But they keep all this information as a trade secret!  Not that it would necessarily make good reading for an introductory text, but still.
The use of data structures in IR really speaks to me.  B-trees are a really powerful way to store huge amount of files on disk, and if you use a b+ tree for the dictionary, I think that it makes trailing wildcard queries run in slightly optimized time over O(nlogn).  And rebalancing algorithms for trees are awesome!
Spelling correction in queries is pretty difficult, and the example of online search engines is to guess the intended query from the incidence of other users queries.  Without this data to rely on, however, other IR systems have to use less successful spelling correction theories.

Monday, January 6, 2014

Week 1 Readings

Week 1 Readings

FOA 1.1
I have not heard the term 'finding out about' before.  I like it!  This seems like a personal knowledge management system.  Framing information retrieval in this way, makes a compelling argument for the usefulness of IR.  Without IR, we will be less well equipped to guarantee that we are prepared to make decisions.  The IR process as a dialog is a very important concept to keep in mind; without knowing that the user is satisfied, your search engine has not filled their need.

IES 1.1-1.2
In their definition of IR, it can only exist for something in a human language.  I don't know if I agree with this.  If the user knows how to create a useful query in a non-human language, and can process the results as relevant or not, this is IR.  A Google image search using a photo as the query is an example of this that even I have done successfully.
I appreciate the discussion of desktop file search systems and enterprise IR.  These are applications that are equally useful to a web search engine professionally, and from my limited experience, have much room for improvement.
The Probabilistic Ranking Principle is important to consider.  It seems likely that this is always true, and that it is indeed the best way to sort results, but I would like to read the findings of anyone who has challenged this precept.  I know that, for example, the fifth result was most clicked in one study, and others have found that the second page of Google results are not statistically less significantly relevant than the first page.  Maybe the PRP of these pages is only very slightly different.

MIR 1.1-1.4
The IR problem is to retrieve all the relevant documents while returning as few non-relevant documents as possible.  It is interesting that the + is all, but the - is not zero.  Why is the goal not to return zero non-relevant documents?  Is this due to partial relevance or ideas of changing relevance as the user reformulates the query?
There are many problems in IR currently, including spam, legality, size, speed, and efficiency.

Week 1 Muddiest Points

Week 1 Muddiest Points

Are Salton's vector space model and other matching models still relevant since Google's citation analysis?  Should all models strive to be as broad and accurate as Google?

In the list of types of short term information needs, it was brought up that as the comprehensive exhaustive search has an opposite, the null existence search, could the question answer search have a similar null opposite.  I tried to Google 'who was the 55th president of the united states'.  The first link was incorrect - first was George W Bush's second inauguration (55th inauguration).  The second two were correct.  Do you feel that this type of information need does exist, or could exist with better natural language processing of search queries?