Random indexing of text samples for latent semantic analysis 论文

2000eScholarship (California Digital Library)引用 387
Topic ModelingNatural Language Processing TechniquesAdvanced Text Analysis Techniques

摘要

VD, the result is not nearly as good: only 36% correct. The authors conclude that the reorganization of information by SVD somehow corresponds to human psychology. We have studied high-dimensional random distributed representations, as models of brainlike representation of information (Kanerva, 1994# Kanerva & Sjodin, 1999). In this poster we report on the use of such a representation to reduce the dimensionality of the original words-by-contexts matrix. The method can be explained by looking at the 60,000 \\Theta 30,000 matrix of frequencies above. Assume that each text sample is represented by a 30,000-bit vector with a single 1 marking the place of the sample in a list of all samples, and call it the sample's index vector (i.e., the nth bit of the index vector for the nth text sample is 1---the representation is unitary or local) . Then the words-by-contexts matrix of frequencies can be gotten by the following procedure: every time that the word w occurs in the nth text sample, the