What is TRESPI
TRESPI stands for TRansformers for Expansion of SParse Indexes. It is the end product of our research project.
Who are we
We are UC Berkeley MIDS program students: Wade Holmes, Manpreet Khural, Stacy Irwin, and Joanna Wang
Research Question
We want to know if we can help a user or institution find docuemnts faster by returning the best document in the top position of results. It is imortant because index based retrieval is core to even the most advanced multi-stage retrieval systems today. Small improvements in these systems have a material effect on the technology experience and performance. So our research question is: Can document indexes for information retrieval incorporate context AND be supplemented to address vocabulary mismatch measured by document return rank?
Project Description
For our project, we attempted to combine the HDCT model (Dai and Callan’s HDCT GitHub) and DocT5query ( Nogueira and Lin GitHub) model. We built two document indexes. Both indexes were generated from the MSMARCO Document dataset using HDCT. One index was generated using the standard HDCT framework. The other index was generated from documents comprised solely of queries generated by docT5query. The two indexes were then combined into a single index. Our goal is to have an inverted index that reflects a term’s importance within a document, but can also match query terms that are related to a document’s topic but that do not appear in the document. We named our approach TRansformers for Expansion of SParse Indexes, or TRESPI.
Overview of TRESPI
Our Approach
Here is a summery of our approach with TRESPI
- Generate index from source documents with HDCT
- Generate index from queries with HDCT
- Generate queries from docT5query
- Combine queries into set of query documents
- Generate index with HDCT from query documents
- Merge source document and query indexes to get Document Index
How well did we do
Data set
For our research we used Microsoft’s Machine Reading Comprehension Dataset (MS MARCO), which is commonly called MS MARCO. It is designed for research into NLP and information retrieval. MSMARCO was generated from Bing queries and enriched with human verified “best results”. In the bottom table you can see an example of the dataset format. It includes a query and the most relevant document.
Query | Relevant Document ID | Relevant Document Title |
---|---|---|
What are the two essential constituent elements of plain carbon steel? | D1862900 | Difference Between Alloy Steel and Carbon Steel |
Demo
Here is our Demo Notebook. You can see the two query results from TRESPI
Module Evaluation
Our Module evaluation is based on Mean Reciprocol Rank (MRR) criteria. All results are baselined against DeepCT with default recommended hyperparameters. Utilizing a T5 generated document body we were able to outperform baseline by 0.058. By improving upon the passage normalization function – we gained another 4 one-hundredths.And with hyperparamter tuning, we exceed baseline by 0.084.
Carry on the work
For people who are interested in our research project and want to continue the work, we have some recommendations for you.
- Different evaluation metric; Take into account more than just predicted rank of most relevant document
- MRR vs Precision vs nDCG vs Spearman’s Rank Correlation Coefficient
- Explore larger data sets: ClueWeb. Look at other datasets with passage level context. We observed incremental gains based on how to handle passage level term weights.
- Train DocT5Query specifically for task of vocabulary expansion
- Refined query generation to eliminate useless terms and increase synonym capture. Zero in on the vocabulary expansion problem to relentlessly promote terms relevant to the topic – there is still too much emphasis on term frequency in this model
- Neural network based term weight combination
Additional Resources
The code for our project, along with detailed explanations, is included in out github repository.
References
- Z. Dai, J. Callan. 2019. Context Aware Sentence/Passage Term Importance Estimation for First Stage Retrieval. https://arxiv.org/abs/1910.10687
- Z. Dai, J. Callan. 2020. Context aware document term weighting for ad-hoc search. In Proc. of The Web Conference 2020, 2020, pp. 1897-1907. https://dl.acm.org/doi/10.1145/3366423.3380258
- J. Devlin, M. Chang, K. Lee, K. Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. https://aclanthology.org/N19-1423.pdf
- J. Lin, R. Nogueira, A. Yates. 2020. Pretrained Transformers for Text Ranking: BERT and Beyond. https://arxiv.org/pdf/2010.06467
- R. Nogueira, W. Yang, J. Lin, K. Cho. 2019. Document Expansion by Query Prediction. https://arxiv.org/abs/1904.08375
- R. Nogueira, J. Lin. 2019. From doc2query to docTTTTTquery. https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery.pdf
- S. Robertson, H. Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval. Vol. 3, No. 4 (2009) pp. 333-389. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.437.660&rep=rep1&type=pdf