Thursday, September 15, 2011

latent semantic indexing

latent semantic indexing (lsi) - SVD can be used to cluster documents and carry out information retrieval by using concepts instead of exact word-matching

data encoded in term-frequency matrix (TF)

encode documents to search in a matrix: rows=words, columns=documents, cells=# of words in that document

create a query as a Nx1 vector (N words, mark 1 for matching row index -- word must be in vocabulary), 1 document query

then compare this vector against other documents in the term-frequency matrix to find the document most relevant, use dot product to compare because this avoids the scale problem with finding the euclidean distance

normalization

also, to avoid searching for words that are useless or occur everywhere, eg. "the", use the term frequency:
 total # of times w appears in d / total number of words in d

inverse document frequency:
  idf = log (D / (1+Dw))        D:# of documents in corpus, Dw:# of documents the word appears in

No comments: