IS2140 - Readings and Muddiest Points: February 2014

Friday, February 28, 2014

Week 9 Readings

Week 9 Readings

MIR Chapter 10
The user interface should be as unobtrusive as possible, and help to bring the user clarity in their information seeking behavior. There are more complex metrics to rank systems on than recall and precision alone. A user who has to spend a lot of time with a system may rate it poorly, or some may not be interested in recall at all for their information need.
"An empty screen or a blank entry form does not provide clues to help a user decide how to start the search process" - wow this isn't a negative at all!
I like the venn diagram visualization of the Boolean queries, it seems very useful. I also like that this reading presents old methods that have been tried in search interfaces. Before Google changed every search display to be just one way, so many ideas were attempted, and this is an important reference point to make sure we don't waste our time on something already tried and failed.

SUI Chapter 1
The user interface of search has not changed nearly at all since 1997. Novice users are not naturally inclined to ask keyword questions though, but instead want to ask a natural language question. Interface design needs to be targeted to a certain user group! Then the interface should be designed, tested, and redesigned to find a nice balance to meet this user group. This can be a challenge because some principles of interface design are opposites of each other (keep it the same, and keep it simple), as well as the fact that UI design is not an exact science.

SUI Chapter 11
Concordance visualizations are very interesting! SeeSoft and TextArc look very different from each other, but are both really nice.

Week 8 Muddiest Point

No muddiest point this week!

Friday, February 21, 2014

Week 8 Readings

Week 8 Readings

IIR 9
Synonymy is a problem that I am really interested in! I look forward to reading this chapter, especially the sections on query expansion.
Relevance feedback has a fundamental problem to me. How can you separate query reformulation from an evolving query? I think that the end result of either will be bettered through relevance feedback, but it seems theoretically unsound. I'm really glad they begin with the Rocchio algorithm, because my first thought was that relevance feedback is an extension of statistical language models. This would work through computing a new language model incorporating the relevant documents in the query's language model. Rocchio's algorithm was designed for vector space retrieval, and can also be applied simply to a probabilistic retrieval model. However, one major difficulty of using relevance feedback in general is that users do not want to spend a long time interacting with their search engine. Also, if a user does perform relevance feedback and the newly generated results set still contains some poor matches, this is a big failure to most users. Global relevance feedback works with a thesaurus to expand concepts. But each of these thesaurii has major challenges, and this is not a very good expansion method.

Xu and Croft
Term mismatch problem has been identified as an issue for a very long time. The two solutions, global and local each have an issue. For this reason, they have submitted a new concept called local context analysis. This takes the idea of cooccurence of terms in the query and returned documents. If the term is infrequent in the collection but cooccurs highly it is a good term to expand the query with.

Wang, Fang, and Zhai
Using negative feedback (no clickthroughs) to load better results on pages 2 and beyond is such an awesome idea! Four different negative feedback models are presented, and two ideas for how to make the TREC collection contain a good number of negative results queries. This is so future researchers can perform similar experiments and compare to these initial results.

Harman
The author found three issues in relevance feedback.
1- The probablistic model could not be extended with relevance judgments. Through modifying Sparck Jones this is possible.
2- How to decide what terms to include from documents selected as relevant. Harman recommends 20 most relevant terms, though this weighting is left up in the air.
3- Diminishing returns in multiple feedback passes. Harman actually argues against this, and in a very pre-WWW mindset recommends looking through many non-relevant documents for thoroughness.

Monday, February 17, 2014

Week 7 Muddiest Point

Week 7 Muddiest Point

Is the cost to the user used in ranking relevant documents in the Binary Independence Model? Or is this simply one probabilistic algorithm that can be implemented among many choices (such as the classic probabilistic model)?

Saturday, February 15, 2014

Week 6 Readings

Week 6 Readings

Hiemstra and de Vries
This paper presents the possibilities of the statistical language retrieval model. This is a great text to introduce us to the model, because it is framed in a comparison to similar concepts from the three retrieval models we have already learned about. It seems that this model will work much better on longer queries, for a short query of only one or two words, it will essentially perform exactly like tf idf weighting in the vector space model.
It's really interesting that a Boolean query may not be valid when using language retrieval. An AND joined with an OR will weigh a two term query versus a one term query and this does not work in the language retrieval model. Instead of stemming, the query can use all variants of a stem joined with OR to perform the same function. This is pretty cool, and not just useful in the language model, but it seems like it will be slower and use more resources to perform.

IIR Chapter 11
Probabilistic retrieval is possible because users have to give an estimation of their query, and the system has to give an estimation of the document representation. This means that matching these estimations can be itself estimated.

IIR Chapter 12
The language retrieval models are all based on establishing a language model and comparing the words in the other part of the model to this language. Which models the words are most likely to have come from are ranked highest. There are three basic ways of doing these comparisons: make each document a language, or make each query a language, or do both and compare them. Smoothing is important here too, because a word that does not appear in the small sample of the language model (the document) may still be a part of the language. A real world equivalent to this would be hearing someone speak sounds and trying to guess what language these sounds are part of.

Friday, February 7, 2014

Week 5 Muddiest Point

Week 5 Muddiest Point

Is the dominance of the exact match model (boolean retrieval) a form of job security for librarians? Our book says that the Lexis-Nexis vector space retrieval returned better results than expert librarians employed by Lexis-Nexis for boolean retrieval reference services. Are there empirical studies of the relevance of results with each model? I don't understand why those professional searchers still have a job after what the book mentioned. Or, at least, why haven't they transitioned to performing natural language searches for patrons after engaging them to find their information need?

At my internship, I use boolean retrieval over several medical databases. And more often than not, I get frustrated after an hour of fruitless searching and google it, with perfect results.