IS2140 - Readings and Muddiest Points: April 2014

Friday, April 11, 2014

Week 13 Readings

Week 13 Readings

IIR Ch. 13
Text classification is a problem in IR. When you encounter a new document, what other documents are similar and which group of documents should it belong to? The classification algorithms presented here are very similar to the algorithms we use for finding document similarity. Conditional independence is also used in feature selection in classification models, due to the complexity of finding dependence of terms in a learning model, and constantly updating with new data. I like the introduction of many statistical measures in the chapter, such as chi square, because these are complex problems with many possible solutions we should be exposed to.

IIR Ch. 14
The contiguity hypothesis is challenging for me to accept. Even from the simplest example we see, containing the topics Kenya, UK, and China, this hypothesis states that all documents fall into one class only. But a document about trade relations between the UK and China fits into both. So, a new class must be formed. But then this can be subdivided again and again, until each class is only one document! Specificity is critical for this. kNN and Voronoi tessellation seems to be very complex. And I am not sure how to tessellate the vector space; from the reading it seems that the tessellations should be for each document, not k, which is much smaller.

Monday, April 7, 2014

Week 12 Muddiest Point

Week 12 Muddiest Point

It seems that whatever takes the most processing power will end up being the best way to do anything in IR. Does creating a personalized search system require the most processing? Do we also see this model of adaptive IR perform better that modifying queries with personalized terms or reranking returned documents with a user model?

Friday, April 4, 2014

Week 12 Readings

Week 12 Readings

Gauch et al.
I am super interested in this! Especially explicit user profiling, I wonder how systems can get users to want to give this information. The interchangability of implicit and explicit feedback in studies is interesting. One would think that explicit would provide more powerful information, but perhaps this is not true.
There is a lot of research into getting at the semantic concepts behind a word. This is a really difficult research issue because it is not static for a user. Even if you can figure out which bank they mean right now, there is no guarantee that this is the only way they use the word. And that first part is difficult, too. The project that explored selected webpages and compared results to WordNet concepts was pretty cool.
Bloedorn et al.'s suggestion to use a hierarchy is true; in my searching of PubMed sometimes the expanded term higher up the hierarchy is what I later find I am aiming for. There are a lot of different ideas for how to get user feedback and create a profile of user interest based on this feedback.

Pazzani & Billsus
Many of the mathematical techniques for content recommendation are algorithms we have learned to use in general retrieval models. Rocchio feedback, naive Bayes, multivariate Bernoulli, etc. I like the idea that sometimes the explicit feedback a user gives is as somple as clicking a giant thumb-up or thumbs-down button. That's really easy, and if you make the change in results meaningful, that's enough incentive to do it, at least for me. Finally, telling the difference between a funny joke and an unfunny one is impossible with any of these techniques.

Ahn et al.
This paper presents a method for helping users in exploratory search, by creating a model of their current task and integrating this into the results set. I like the idea of highlighting task terms in snippets. It does bring to mind that users may not be used to this, and will not freely enter queries that stray from the task terms. We are trained to mold our queries into what we know Google likes, and entering a Google search without my actual search terms is crazy. In TaskSieve, this is actually a valid way to search.