latent semantic indexing (lsi) - SVD can be used to cluster documents and carry out information retrieval by using concepts instead of exact word-matching
data encoded in term-frequency matrix (TF)
encode documents to search in a matrix: rows=words, columns=documents, cells=# of words in that document
create a query as a Nx1 vector (N words, mark 1 for matching row index -- word must be in vocabulary), 1 document query
then compare this vector against other documents in the term-frequency matrix to find the document most relevant, use dot product to compare because this avoids the scale problem with finding the euclidean distance
normalization
also, to avoid searching for words that are useless or occur everywhere, eg. "the", use the term frequency:
total # of times w appears in d / total number of words in d
inverse document frequency:
idf = log (D / (1+Dw)) D:# of documents in corpus, Dw:# of documents the word appears in
No comments:
Post a Comment