IS2140 - Readings and Muddiest Points: 2014

Friday, April 11, 2014

Week 13 Readings

Week 13 Readings

IIR Ch. 13
Text classification is a problem in IR. When you encounter a new document, what other documents are similar and which group of documents should it belong to? The classification algorithms presented here are very similar to the algorithms we use for finding document similarity. Conditional independence is also used in feature selection in classification models, due to the complexity of finding dependence of terms in a learning model, and constantly updating with new data. I like the introduction of many statistical measures in the chapter, such as chi square, because these are complex problems with many possible solutions we should be exposed to.

IIR Ch. 14
The contiguity hypothesis is challenging for me to accept. Even from the simplest example we see, containing the topics Kenya, UK, and China, this hypothesis states that all documents fall into one class only. But a document about trade relations between the UK and China fits into both. So, a new class must be formed. But then this can be subdivided again and again, until each class is only one document! Specificity is critical for this. kNN and Voronoi tessellation seems to be very complex. And I am not sure how to tessellate the vector space; from the reading it seems that the tessellations should be for each document, not k, which is much smaller.

Monday, April 7, 2014

Week 12 Muddiest Point

Week 12 Muddiest Point

It seems that whatever takes the most processing power will end up being the best way to do anything in IR. Does creating a personalized search system require the most processing? Do we also see this model of adaptive IR perform better that modifying queries with personalized terms or reranking returned documents with a user model?

Friday, April 4, 2014

Week 12 Readings

Week 12 Readings

Gauch et al.
I am super interested in this! Especially explicit user profiling, I wonder how systems can get users to want to give this information. The interchangability of implicit and explicit feedback in studies is interesting. One would think that explicit would provide more powerful information, but perhaps this is not true.
There is a lot of research into getting at the semantic concepts behind a word. This is a really difficult research issue because it is not static for a user. Even if you can figure out which bank they mean right now, there is no guarantee that this is the only way they use the word. And that first part is difficult, too. The project that explored selected webpages and compared results to WordNet concepts was pretty cool.
Bloedorn et al.'s suggestion to use a hierarchy is true; in my searching of PubMed sometimes the expanded term higher up the hierarchy is what I later find I am aiming for. There are a lot of different ideas for how to get user feedback and create a profile of user interest based on this feedback.

Pazzani & Billsus
Many of the mathematical techniques for content recommendation are algorithms we have learned to use in general retrieval models. Rocchio feedback, naive Bayes, multivariate Bernoulli, etc. I like the idea that sometimes the explicit feedback a user gives is as somple as clicking a giant thumb-up or thumbs-down button. That's really easy, and if you make the change in results meaningful, that's enough incentive to do it, at least for me. Finally, telling the difference between a funny joke and an unfunny one is impossible with any of these techniques.

Ahn et al.
This paper presents a method for helping users in exploratory search, by creating a model of their current task and integrating this into the results set. I like the idea of highlighting task terms in snippets. It does bring to mind that users may not be used to this, and will not freely enter queries that stray from the task terms. We are trained to mold our queries into what we know Google likes, and entering a Google search without my actual search terms is crazy. In TaskSieve, this is actually a valid way to search.

Monday, March 31, 2014

Week 11 Muddiest Point

Week 11 Muddiest Point

Is there any more recent data on the language spoken by Internet users and the language of pages? How do you even collect this data?

Images can be seen as a universal language (well, almost). Can tagging webpages with images be a good way to get at the topic of a page, thereby bypassing the semantic issues and the need for things like Darwish's probabilistic analysis? Has anyone tried this yet?

Wednesday, March 26, 2014

Week 11 Readings

Week 11 Readings

IES Ch 14
Parallel query processing is simple. Increase the number of machines accepting queries, and give each a copy of the index. This makes query speed increase directly proportionally with the number of machines. Of course, if the index is too large to be stored on one system this is not possible. Intra-query parallelism is thus more common, where each machine holds only part of the index and queries on terms in that mini-index are directed to the machine.
Replication seems like a very easy way to introduce fault tolerance. If each query is run on each of the machines, then the chance that every single one of these will fail at once is really small. If one fails, then the next in line can finish the query with no loss of time. Also, this multiple redundancy makes it easy to simply replace the failed machines as needed.
This reading also presents a fantastic buildup of MapReduce, by introducing the paper from its simplest form to the more advanced problems encountered in a large search engine.

Monday, March 24, 2014

Week 10 Muddiest Point

Week 10 Muddiest Point

How can we trust user generated anchor text and hyperlinks, when <meta> tags are so corrupted by the same users? If I have a spam page and try lots of SEO techniques, should my outlinks still contribute to PageRank or a similar link analysis?

Is there any other user generated content that can help us improve search? Are they willing to provide this information? How can we incentivize this process? Have any search engines tried using Mechanical Turks or paying users a very small amount for relevance judgments?

Friday, February 28, 2014

Week 9 Readings

Week 9 Readings

MIR Chapter 10
The user interface should be as unobtrusive as possible, and help to bring the user clarity in their information seeking behavior. There are more complex metrics to rank systems on than recall and precision alone. A user who has to spend a lot of time with a system may rate it poorly, or some may not be interested in recall at all for their information need.
"An empty screen or a blank entry form does not provide clues to help a user decide how to start the search process" - wow this isn't a negative at all!
I like the venn diagram visualization of the Boolean queries, it seems very useful. I also like that this reading presents old methods that have been tried in search interfaces. Before Google changed every search display to be just one way, so many ideas were attempted, and this is an important reference point to make sure we don't waste our time on something already tried and failed.

SUI Chapter 1
The user interface of search has not changed nearly at all since 1997. Novice users are not naturally inclined to ask keyword questions though, but instead want to ask a natural language question. Interface design needs to be targeted to a certain user group! Then the interface should be designed, tested, and redesigned to find a nice balance to meet this user group. This can be a challenge because some principles of interface design are opposites of each other (keep it the same, and keep it simple), as well as the fact that UI design is not an exact science.

SUI Chapter 11
Concordance visualizations are very interesting! SeeSoft and TextArc look very different from each other, but are both really nice.

Week 8 Muddiest Point

No muddiest point this week!

Friday, February 21, 2014

Week 8 Readings

Week 8 Readings

IIR 9
Synonymy is a problem that I am really interested in! I look forward to reading this chapter, especially the sections on query expansion.
Relevance feedback has a fundamental problem to me. How can you separate query reformulation from an evolving query? I think that the end result of either will be bettered through relevance feedback, but it seems theoretically unsound. I'm really glad they begin with the Rocchio algorithm, because my first thought was that relevance feedback is an extension of statistical language models. This would work through computing a new language model incorporating the relevant documents in the query's language model. Rocchio's algorithm was designed for vector space retrieval, and can also be applied simply to a probabilistic retrieval model. However, one major difficulty of using relevance feedback in general is that users do not want to spend a long time interacting with their search engine. Also, if a user does perform relevance feedback and the newly generated results set still contains some poor matches, this is a big failure to most users. Global relevance feedback works with a thesaurus to expand concepts. But each of these thesaurii has major challenges, and this is not a very good expansion method.

Xu and Croft
Term mismatch problem has been identified as an issue for a very long time. The two solutions, global and local each have an issue. For this reason, they have submitted a new concept called local context analysis. This takes the idea of cooccurence of terms in the query and returned documents. If the term is infrequent in the collection but cooccurs highly it is a good term to expand the query with.

Wang, Fang, and Zhai
Using negative feedback (no clickthroughs) to load better results on pages 2 and beyond is such an awesome idea! Four different negative feedback models are presented, and two ideas for how to make the TREC collection contain a good number of negative results queries. This is so future researchers can perform similar experiments and compare to these initial results.

Harman
The author found three issues in relevance feedback.
1- The probablistic model could not be extended with relevance judgments. Through modifying Sparck Jones this is possible.
2- How to decide what terms to include from documents selected as relevant. Harman recommends 20 most relevant terms, though this weighting is left up in the air.
3- Diminishing returns in multiple feedback passes. Harman actually argues against this, and in a very pre-WWW mindset recommends looking through many non-relevant documents for thoroughness.

Monday, February 17, 2014

Week 7 Muddiest Point

Week 7 Muddiest Point

Is the cost to the user used in ranking relevant documents in the Binary Independence Model? Or is this simply one probabilistic algorithm that can be implemented among many choices (such as the classic probabilistic model)?

Saturday, February 15, 2014

Week 6 Readings

Week 6 Readings

Hiemstra and de Vries
This paper presents the possibilities of the statistical language retrieval model. This is a great text to introduce us to the model, because it is framed in a comparison to similar concepts from the three retrieval models we have already learned about. It seems that this model will work much better on longer queries, for a short query of only one or two words, it will essentially perform exactly like tf idf weighting in the vector space model.
It's really interesting that a Boolean query may not be valid when using language retrieval. An AND joined with an OR will weigh a two term query versus a one term query and this does not work in the language retrieval model. Instead of stemming, the query can use all variants of a stem joined with OR to perform the same function. This is pretty cool, and not just useful in the language model, but it seems like it will be slower and use more resources to perform.

IIR Chapter 11
Probabilistic retrieval is possible because users have to give an estimation of their query, and the system has to give an estimation of the document representation. This means that matching these estimations can be itself estimated.

IIR Chapter 12
The language retrieval models are all based on establishing a language model and comparing the words in the other part of the model to this language. Which models the words are most likely to have come from are ranked highest. There are three basic ways of doing these comparisons: make each document a language, or make each query a language, or do both and compare them. Smoothing is important here too, because a word that does not appear in the small sample of the language model (the document) may still be a part of the language. A real world equivalent to this would be hearing someone speak sounds and trying to guess what language these sounds are part of.

Friday, February 7, 2014

Week 5 Muddiest Point

Week 5 Muddiest Point

Is the dominance of the exact match model (boolean retrieval) a form of job security for librarians? Our book says that the Lexis-Nexis vector space retrieval returned better results than expert librarians employed by Lexis-Nexis for boolean retrieval reference services. Are there empirical studies of the relevance of results with each model? I don't understand why those professional searchers still have a job after what the book mentioned. Or, at least, why haven't they transitioned to performing natural language searches for patrons after engaging them to find their information need?

At my internship, I use boolean retrieval over several medical databases. And more often than not, I get frustrated after an hour of fruitless searching and google it, with perfect results.

Thursday, January 30, 2014

Week 5 Readings

Week 5 Readings

IIR 1.3 and 1.4
The intersection algorithm is a pretty simple way to advance through two lists sequentially. Deciding which part of the query should be processed first is important to optimize response time. In place intersection uses the spots of intermediate results that are no longer valid to store the new, still possibly correct results. "Boolean queries are precise: a document either matches the query or it does not."

IIR 6
Free text queries are queries where the user does not indicate to the system correct word order or relative weights of terms making up the query. These queries cannot be matched by boolean retrieval, because many documents will be retrieved that match all the words via intersection, but these cannot be ranked.

Weighted zone scores are an attempt to make up for this. The system creator assigns weights to fields, and if the query term is found in that zone, that weight is added to the query result. These rankings are compared across documents, allowing ranking. Another way to assign weights to different zones of a document is through machine-learned relevance.

Weighted scores can process free text queries by simply adding the weighted scores for each term together. This does not require boolean intersection, because the highest weighted document is most likely to contain all the terms of the query.

Inverse document frequency helps to refine these results. As the example in the book states, if the document collection is about the automobile industry, the term auto will appear in almost every document. Queries for this word should be weighted less than other words to generate meaningful difference between documents. It is important that this is the frequency of documents it appears in, not collection frequency. If one document is the word auto 1,000 that would artificially inflate the frequency of the term looking at the whole collection.

The rest of the chapter deals with vector space models, and refinements to these models. These refinements to be chosen from can be seen in the SMART model, and can be mixed and matched for both documents and queries to achieve the best results.

Tuesday, January 28, 2014

Week 4 Muddiest Point

Week 4 Muddiest Point

Gamma Code

Is gamma code stored as a string, like index as string compression methods? The unary code then is a delimiter for the next few characters, letting the program processing the postings know where each posting begins. Storing all the postings as one long string means that no set amount of bytes must be used for each posting, saving space in the postings list. The binary string (gathered from the unary marker and following string) is then processed in memory according to the conventions of binary.

If gamma code is not stored as a string, then I do not understand how gamma code works.

General Compression Question
Memory and storage are always increasing. Does compression also mean faster access? Otherwise, why do open source software like Indri and Lucerne compress index terms and postings automatically?

Wednesday, January 22, 2014

Week 4 Readings

Week 4 Readings

IIR Ch 4
The two block write algorithms are interesting. BSBI is elegant and simple, while SPIMI eliminates an unnecessary pass through the data to sort terms. MapReduce is fast and super scaleable. One aspect not touched here is what to do if the master fails. It is after all, built of the same cheap parts as each parser and inverter. My guess is that there are several layers of redundancy, where each back up master only monitors the performance of the cluster's output to disk, ready to kick in if needed.

Of the two options presented for dynamic indexing, neither seems like an ideal solution. Rebuilding the entire index from scratch means that there will be queries where a better answer could be found, but the user was unfortunate enough to query before the rebuild was completed. For a solution to this problem, the book offers housing both the index on disk and a replacement, temp index in memory. However, this requires each query to make calls to both indexes, merger them, and check the results against a delete list, also in memory. This makes for slower query processing, and that is a cardinal sin!

IIR Ch 5
The benefit of compression is not in saving disk space. Disks are cheap and always getting cheaper. Compression means that more can be cached in memory, meaning less disk seeks, and faster return of results for those queries that are cached. Similarly, each transfer from disk is so much smaller than an uncompressed result, that the time savings are enormous.
It's interesting to think of casefolding and stemming as compression techniques, because they were first presented as measures to assist in recall. But since each also means that one more distinct term will not be stored in the dictionary, this is a compression of the index as a whole.
Heap's law states that the larger a collection, the larger the term dictionary, another indicator that compression is necessary.
The dictionary can be compressed significantly, through a variety of techniques. My favorite of these is front coding. This is similar to JPEG compression, and makes a lot of sense to me. Blocking is not a big improvement over simply stringing all terms, but in a very large dictionary or a very small memory every bit helps.
Variable byte encoding and bit level encoding are a bit more difficult than this book has stated. I do not have a firm understanding of binary math to follow the way that these gaps make things faster.

Wednesday, January 15, 2014

Week 2 Muddiest Point

Week 2 Muddiest Point

Do n-gram methods of tokenization always mean that the terms are indexed together (precoordination)? Or can this be done at run time, given the processing power available nowdays?

Friday, January 10, 2014

Week 2 Readings

Week 2 Readings

IIR 1.2, 2 and 3
I really like how they build up the theory behind the index, from just a term-document matrix where each document is either a 0 or 1 for each term to the complex form that it really is. This proves the usefulness of various features of the index, starting with the frequency counts.
"users want things to work with their data as is. Often, they just think of documents as text inside applications and are not even aware of how it is encoded on the disk." This quote really speaks to the many different types of encoded and unencoded text that a search engine must parse.
The transition from long stoplists to none in modern IR is really interesting, and a change that I didn't know had happened. The techniques used to weight indexed documents seem very fascinating, especially considering how the length of documents can vary in a web system (discussed earlier in the chapter).
Case-folding should be performed on tokens that appear at the beginning of a sentence, or in a sentence containing mostly capitalized words. The rest of the terms tend to be correctly capitalized, but these must be given term equivalence to the case-folded term in many cases.
The methods for processing queries (biword, positional, etc) have presumably been greatly improved upon by either Google or Bing, because disk and memory space is essentially a non-issue for them. But they keep all this information as a trade secret! Not that it would necessarily make good reading for an introductory text, but still.
The use of data structures in IR really speaks to me. B-trees are a really powerful way to store huge amount of files on disk, and if you use a b+ tree for the dictionary, I think that it makes trailing wildcard queries run in slightly optimized time over O(nlogn). And rebalancing algorithms for trees are awesome!
Spelling correction in queries is pretty difficult, and the example of online search engines is to guess the intended query from the incidence of other users queries. Without this data to rely on, however, other IR systems have to use less successful spelling correction theories.

Monday, January 6, 2014

Week 1 Readings

Week 1 Readings

FOA 1.1
I have not heard the term 'finding out about' before. I like it! This seems like a personal knowledge management system. Framing information retrieval in this way, makes a compelling argument for the usefulness of IR. Without IR, we will be less well equipped to guarantee that we are prepared to make decisions. The IR process as a dialog is a very important concept to keep in mind; without knowing that the user is satisfied, your search engine has not filled their need.

IES 1.1-1.2
In their definition of IR, it can only exist for something in a human language. I don't know if I agree with this. If the user knows how to create a useful query in a non-human language, and can process the results as relevant or not, this is IR. A Google image search using a photo as the query is an example of this that even I have done successfully.
I appreciate the discussion of desktop file search systems and enterprise IR. These are applications that are equally useful to a web search engine professionally, and from my limited experience, have much room for improvement.
The Probabilistic Ranking Principle is important to consider. It seems likely that this is always true, and that it is indeed the best way to sort results, but I would like to read the findings of anyone who has challenged this precept. I know that, for example, the fifth result was most clicked in one study, and others have found that the second page of Google results are not statistically less significantly relevant than the first page. Maybe the PRP of these pages is only very slightly different.

MIR 1.1-1.4
The IR problem is to retrieve all the relevant documents while returning as few non-relevant documents as possible. It is interesting that the + is all, but the - is not zero. Why is the goal not to return zero non-relevant documents? Is this due to partial relevance or ideas of changing relevance as the user reformulates the query?
There are many problems in IR currently, including spam, legality, size, speed, and efficiency.

Week 1 Muddiest Points

Week 1 Muddiest Points

Are Salton's vector space model and other matching models still relevant since Google's citation analysis? Should all models strive to be as broad and accurate as Google?

In the list of types of short term information needs, it was brought up that as the comprehensive exhaustive search has an opposite, the null existence search, could the question answer search have a similar null opposite. I tried to Google 'who was the 55th president of the united states'. The first link was incorrect - first was George W Bush's second inauguration (55th inauguration). The second two were correct. Do you feel that this type of information need does exist, or could exist with better natural language processing of search queries?