Paper Recommendations
Integrate zotero library to get recommendations from arxiv
Overview
This project explores the development of a Research Paper Recommendation System directly integrated with the user’s Zotero library. The system provides research paper recommendations based on user-defined frequency from arXiv, enabling researchers to stay up-to-date with the latest publications in their desired categories, including submissions to prominent journals like NeurIPS and CVPR. The project also investigates the capabilities of fine-tuned language models, specifically a BERT-based architecture, for generating semantic paper embeddings.
Problem Statement
Researchers often struggle to discover relevant academic papers aligned with their existing libraries and research interests. This project addresses this challenge by developing a system that recommends contextually relevant papers through analysis of Zotero libraries, arXiv’s repository, contextual embeddings generated via a fine-tuned language model, and efficient vector similarity search using Pinecone’s vector database.
Methodology
The core methodology employs the Scincl model - a fine-tuned BERT variant from Hugging Face’s inference API. Building on SciBERT (a scientific text-optimized language model), Scincl enhances differentiation between research papers through contrastive learning, creating larger margins in embedding space. The system processes paper titles and abstracts to generate discriminative embeddings that capture semantic relationships between publications.
Implementation Details
Setup
- Initialization: Creates a personalized research preference index in Pinecone by processing the user’s Zotero library to generate embeddings
- Routine Operation: Periodically scrapes arXiv using predefined filters, converts new papers to embeddings, and queries the user’s index for matches. Recommendations are triggered when similarity scores exceed a defined threshold
- Gradio Interface: Provides a web interface with predefined variables for immediate visualization, including a demo using a computer vision-focused Zotero library
Challenges
Latency Optimization: Initial implementation faced bottlenecks from Hugging Face API latency. Achieved significant performance improvements by transitioning to local model deployment using the Transformers library.
Technology Stack
- Python
- Pinecone Vector DB
- Hugging Face Inference API
- Zotero API
- ArXiv API
- Gradio