Merging a distributional and a semantic vector space in complex Hilbert spacePosted on 15 July 2013
As described in Wittek et. al, 2013, we can easily merge two random indices in complex space. Using a random index is useful for querying and text classification, and the same random index works for a concept representation. Merging the two representations allows us to leverage on the strength of both, yielding higher recall rates. The steps below outline how the merge works.
The modified source code of SemanticVectors is available here. There are notable changes to the original SemanticVectors classes. Most importantly, the distance function in the
CompexVector class was rewritten and the dominant mode was set at
CompoundVectorBuilder class strictly does not normalize vectors. Since all distance functions are inner products, normalization is unnecessary. The concepts are identified by a number, which are filtered out by SemanticVectors, unless the flag
filteroutnumbers is set to false. Unfortunately this flag cannot be flipped from the command line due to a bug how boolean variables are parsed. A modified
FlagConfig class takes false as the default value for
filteroutnumbers. There is also an extra classes helping with the new complex space (
We assume that the document collection is indexed by Lucene in the folder
lucene-index-term_representation. We further assume that a concept representation is available, e.g., created by MetaMap (Aronson and Lang, 2010). The Lucene index of the concept representation is in
The instructions below assume that all data, including the Lucene index directories, are in the same folder as the jar file. Set up environment variables. Notice that there is a new jar dependency:
Build random indices for the term and concept spaces:
java pitt.search.semanticvectors.BuildIndex -luceneindexpath lucene-index-term_representation java pitt.search.semanticvectors.BuildIndex -docvectorsfile conceptdocvectors -termvectorsfile concepttermvectors -luceneindexpath lucene-index-concept_representation
This will produce four vector spaces: termvectors.bin, docvectors.bin, concepttermvectors.bin, and conceptdocvectors.bin. Notice that the docvectors.bin and conceptdocvectors.bin are both document vector spaces of the same collection, yet the physical order in the files is quite arbitrary. This does not cause a problem with regular searches, but it is a problem when merging the two document spaces in a complex space. A shell script is provided to fix the order:
./fix_order.sh conceptdocvectors.bin ./fix_order.sh docvectors.bin
Then the spaces can be merged:
java pitt.search.semanticvectors.ComplexSpaceMerger docvectors.bin conceptdocvectors.bin complexdocvectors.bin
If the purpose is information retrieval, the next step is to map the query vectors in the same complex space. Again, the real and imaginary components of the query must belong to the same query, otherwise querying is straightforward.
Wittek, P.; Koopman, B.; Zuccon, G. & Darányi, S. Combining Word Semantics within Complex Hilbert Space for Information Retrieval. Proceedings of QI-13, 7th International Quantum Interaction Symposium, July, 2013. Leicester, UK.