Week 13 Readings
IIR Ch. 13
Text classification is a problem in IR. When you encounter a new document, what other documents are similar and which group of documents should it belong to? The classification algorithms presented here are very similar to the algorithms we use for finding document similarity. Conditional independence is also used in feature selection in classification models, due to the complexity of finding dependence of terms in a learning model, and constantly updating with new data. I like the introduction of many statistical measures in the chapter, such as chi square, because these are complex problems with many possible solutions we should be exposed to.
IIR Ch. 14
The contiguity hypothesis is challenging for me to accept. Even from the simplest example we see, containing the topics Kenya, UK, and China, this hypothesis states that all documents fall into one class only. But a document about trade relations between the UK and China fits into both. So, a new class must be formed. But then this can be subdivided again and again, until each class is only one document! Specificity is critical for this. kNN and Voronoi tessellation seems to be very complex. And I am not sure how to tessellate the vector space; from the reading it seems that the tessellations should be for each document, not k, which is much smaller.
No comments:
Post a Comment