Paper Recommendations | Ankith Savio Arogya Dass

Overview

This project explores the development of a Research Paper Recommendation System directly integrated with the user’s Zotero library. The system provides research paper recommendations based on user-defined frequency from arXiv, enabling researchers to stay up-to-date with the latest publications in their desired categories, including submissions to prominent journals like NeurIPS and CVPR. The project also investigates the capabilities of fine-tuned language models, specifically a BERT-based architecture, for generating semantic paper embeddings.

Application demonstration video

Problem Statement

Researchers often struggle to discover relevant academic papers aligned with their existing libraries and research interests. This project addresses this challenge by developing a system that recommends contextually relevant papers through analysis of Zotero libraries, arXiv’s repository, contextual embeddings generated via a fine-tuned language model, and efficient vector similarity search using Pinecone’s vector database.

Methodology

The core methodology employs the Scincl model - a fine-tuned BERT variant from Hugging Face’s inference API. Building on SciBERT (a scientific text-optimized language model), Scincl enhances differentiation between research papers through contrastive learning, creating larger margins in embedding space. The system processes paper titles and abstracts to generate discriminative embeddings that capture semantic relationships between publications.

System architecture sequence diagram

Implementation Details

Setup

Initialization: Creates a personalized research preference index in Pinecone by processing the user’s Zotero library to generate embeddings
Routine Operation: Periodically scrapes arXiv using predefined filters, converts new papers to embeddings, and queries the user’s index for matches. Recommendations are triggered when similarity scores exceed a defined threshold
Gradio Interface: Provides a web interface with predefined variables for immediate visualization, including a demo using a computer vision-focused Zotero library

Challenges

Latency Optimization: Initial implementation faced bottlenecks from Hugging Face API latency. Achieved significant performance improvements by transitioning to local model deployment using the Transformers library.

Technology Stack

Python
Pinecone Vector DB
Hugging Face Inference API
Zotero API
ArXiv API
Gradio

Inspiration & Acknowledgments

Scincl - Fine-tuned embedding model
SciBERT - Base pre-trained language model