genAI-Project / data /metadata.jsonl
OGB2000's picture
Initial clean deployment
bf77be6
Raw
History Blame Contribute Delete
174 kB
{"arxiv_id": "2503.16581v1", "title": "Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models", "authors": ["Zahra Khalila", "Arbi Haza Nasution", "Winda Monika", "Aytug Onan", "Yohei Murakami", "Yasir Bin Ismail Radi", "Noor Mohammad Osmani"], "year": "2025", "abstract": "Accurate and contextually faithful responses are critical when applying large language models (LLMs) to sensitive and domain-specific tasks, such as answering queries related to quranic studies. General-purpose LLMs often struggle with hallucinations, where generated responses deviate from authoritative sources, raising concerns about their reliability in religious contexts. This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response accuracy, relevance, and faithfulness. In this study, we investigate 13 open-source LLMs categorized into large (e.g., Llama3:70b, Gemma2:27b, QwQ:32b), medium (e.g., Gemma2:9b, Llama3:8b), and small (e.g., Llama3.2:3b, Phi3:3.8b). A Retrieval-Augmented Generation (RAG) is used to make up for the problems that come with using separate models. This research utilizes a descriptive dataset of Quranic surahs including the meanings, historical context, and qualities of the 114 surahs, allowing the model to gather relevant knowledge before responding. The models are evaluated using three key metrics set by human evaluators: context relevance, answer faithfulness, and answer relevance. The findings reveal that large models consistently outperform smaller models in capturing query semantics and producing accurate, contextually grounded responses. The Llama3.2:3b model, even though it is considered small, does very well on faithfulness (4.619) and relevance (4.857), showing the promise of smaller architectures that have been well optimized. This article examines the trade-offs between model size, computational efficiency, and response quality while using LLMs in domain-specific applications.", "pdf_url": "https://arxiv.org/pdf/2503.16581v1", "local_path": "data\\papers\\2503_16581v1.pdf"}
{"arxiv_id": "2411.18583v1", "title": "Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation", "authors": ["Nurshat Fateh Ali", "Md. Mahdi Mohtasim", "Shakil Mosharrof", "T. Gopi Krishna"], "year": "2024", "abstract": "This research presents and compares multiple approaches to automate the generation of literature reviews using several Natural Language Processing (NLP) techniques and retrieval-augmented generation (RAG) with a Large Language Model (LLM). The ever-increasing number of research articles provides a huge challenge for manual literature review. It has resulted in an increased demand for automation. Developing a system capable of automatically generating the literature reviews from only the PDF files as input is the primary objective of this research work. The effectiveness of several Natural Language Processing (NLP) strategies, such as the frequency-based method (spaCy), the transformer model (Simple T5), and retrieval-augmented generation (RAG) with Large Language Model (GPT-3.5-turbo), is evaluated to meet the primary objective. The SciTLDR dataset is chosen for this research experiment and three distinct techniques are utilized to implement three different systems for auto-generating the literature reviews. The ROUGE scores are used for the evaluation of all three systems. Based on the evaluation, the Large Language Model GPT-3.5-turbo achieved the highest ROUGE-1 score, 0.364. The transformer model comes in second place and spaCy is at the last position. Finally, a graphical user interface is created for the best system based on the large language model.", "pdf_url": "https://arxiv.org/pdf/2411.18583v1", "local_path": "data\\papers\\2411_18583v1.pdf"}
{"arxiv_id": "2506.06962v3", "title": "AR-RAG: Autoregressive Retrieval Augmentation for Image Generation", "authors": ["Jingyuan Qi", "Zhiyang Xu", "Qifan Wang", "Lifu Huang"], "year": "2025", "abstract": "We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a training-free plug-and-use decoding strategy that directly merges the distribution of model-predicted patches with the distribution of retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method that progressively smooths the features of retrieved patches via multi-scale convolution operations and leverages them to augment the image generation process. We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.", "pdf_url": "https://arxiv.org/pdf/2506.06962v3", "local_path": "data\\papers\\2506_06962v3.pdf"}
{"arxiv_id": "2510.22344v1", "title": "FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation", "authors": ["Mohammad Aghajani Asl", "Majid Asgari-Bidhendi", "Behrooz Minaei-Bidgoli"], "year": "2025", "abstract": "While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.", "pdf_url": "https://arxiv.org/pdf/2510.22344v1", "local_path": "data\\papers\\2510_22344v1.pdf"}
{"arxiv_id": "2504.13684v1", "title": "Intelligent Interaction Strategies for Context-Aware Cognitive Augmentation", "authors": [" Xiangrong", " Zhu", "Yuan Xu", "Tianjian Liu", "Jingwei Sun", "Yu Zhang", "Xin Tong"], "year": "2025", "abstract": "Human cognition is constrained by processing limitations, leading to cognitive overload and inefficiencies in knowledge synthesis and decision-making. Large Language Models (LLMs) present an opportunity for cognitive augmentation, but their current reactive nature limits their real-world applicability. This position paper explores the potential of context-aware cognitive augmentation, where LLMs dynamically adapt to users' cognitive states and task environments to provide appropriate support. Through a think-aloud study in an exhibition setting, we examine how individuals interact with multi-modal information and identify key cognitive challenges in structuring, retrieving, and applying knowledge. Our findings highlight the need for AI-driven cognitive support systems that integrate real-time contextual awareness, personalized reasoning assistance, and socially adaptive interactions. We propose a framework for AI augmentation that seamlessly transitions between real-time cognitive support and post-experience knowledge organization, contributing to the design of more effective human-centered AI systems.", "pdf_url": "https://arxiv.org/pdf/2504.13684v1", "local_path": "data\\papers\\2504_13684v1.pdf"}
{"arxiv_id": "2402.12317v2", "title": "EVOR: Evolving Retrieval for Code Generation", "authors": ["Hongjin Su", "Shuyang Jiang", "Yuhang Lai", "Haoyuan Wu", "Boao Shi", "Che Liu", "Qian Liu", "Tao Yu"], "year": "2024", "abstract": "Recently the retrieval-augmented generation (RAG) has been successfully applied in code generation. However, existing pipelines for retrieval-augmented code generation (RACG) employ static knowledge bases with a single source, limiting the adaptation capabilities of Large Language Models (LLMs) to domains they have insufficient knowledge of. In this work, we develop a novel pipeline, EVOR, that employs the synchronous evolution of both queries and diverse knowledge bases. On two realistic settings where the external knowledge is required to solve code generation tasks, we compile four new datasets associated with frequently updated libraries and long-tail programming languages, named EVOR-BENCH. Extensive experiments demonstrate that EVOR achieves two to four times of execution accuracy compared to other methods such as Reflexion (Shinn et al., 2024), DocPrompting (Zhou et al., 2023), etc. We demonstrate that EVOR is flexible and can be easily combined with them to achieve further improvement. Further analysis reveals that EVOR benefits from the synchronous evolution of queries and documents and the diverse information sources in the knowledge base. We hope that our studies will inspire more insights into the design of advanced RACG pipelines in future research. Our model, code, and data are available at https://arks-codegen.github.io.", "pdf_url": "https://arxiv.org/pdf/2402.12317v2", "local_path": "data\\papers\\2402_12317v2.pdf"}
{"arxiv_id": "2504.17204v1", "title": "Factually: Exploring Wearable Fact-Checking for Augmented Truth Discernment", "authors": ["Chitralekha Gupta", "Hanjun Wu", "Praveen Sasikumar", "Shreyas Sridhar", "Priambudi Bagaskara", "Suranga Nanayakkara"], "year": "2025", "abstract": "Wearable devices are transforming human capabilities by seamlessly augmenting cognitive functions. In this position paper, we propose a voice-based, interactive learning companion designed to amplify and extend cognitive abilities through informal learning. Our vision is threefold: (1) to enable users to discover new knowledge on-the-go through contextual interactive quizzes, fostering critical thinking and mindfulness, (2) to proactively detect misinformation, empowering users to critically assess information in real time, and (3) to provide spoken language correction and prompting hints for second language learning and effective communication. As an initial step toward this vision, we present Factually - a proactive, wearable fact-checking system integrated into devices like smartwatches or rings. Factually discreetly alerts users to potential falsehoods via vibrotactile feedback, helping them assess information critically. We demonstrate its utility through three illustrative scenarios, highlighting its potential to extend cognitive abilities for real-time misinformation detection. Early qualitative feedback suggests that Factually can enhance users' fact-checking capabilities, offering both practical and experiential benefits.", "pdf_url": "https://arxiv.org/pdf/2504.17204v1", "local_path": "data\\papers\\2504_17204v1.pdf"}
{"arxiv_id": "2502.00306v2", "title": "Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation", "authors": ["Ali Naseh", "Yuefeng Peng", "Anshuman Suri", "Harsh Chaudhari", "Alina Oprea", "Amir Houmansadr"], "year": "2025", "abstract": "Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to generate grounded responses by leveraging external knowledge databases without altering model parameters. Although the absence of weight tuning prevents leakage via model parameters, it introduces the risk of inference adversaries exploiting retrieved documents in the model's context. Existing methods for membership inference and data extraction often rely on jailbreaking or carefully crafted unnatural queries, which can be easily detected or thwarted with query rewriting techniques common in RAG systems. In this work, we present Interrogation Attack (IA), a membership inference technique targeting documents in the RAG datastore. By crafting natural-text queries that are answerable only with the target document's presence, our approach demonstrates successful inference with just 30 queries while remaining stealthy; straightforward detectors identify adversarial prompts from existing methods up to ~76x more frequently than those generated by our attack. We observe a 2x improvement in TPR@1%FPR over prior inference attacks across diverse RAG configurations, all while costing less than $0.02 per document inference.", "pdf_url": "https://arxiv.org/pdf/2502.00306v2", "local_path": "data\\papers\\2502_00306v2.pdf"}
{"arxiv_id": "2605.12335v1", "title": "EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records", "authors": ["Saeed Shurrab", "Mariam Al-Omari", "Dana El Samad", "Farah E. Shamout"], "year": "2026", "abstract": "Electronic Health Records (EHR) contain rich longitudinal patient information and are widely used in predictive modeling applications. However, effectively leveraging historical data remains challenging due to long trajectories, heterogeneous events, temporal irregularity, and the varying relevance of past clinical context. Existing approaches often rely on fixed windows or uniform aggregation, which can obscure clinically important signals. In this work, we introduce EHR-RAGp, a retrieval-augmented foundation model that dynamically integrates the most relevant patient history across diverse clinical event types. We propose a prototype-guided retrieval module that acts as an alignment mechanism and estimates the relevance of retrieved historical chunks with respect to a given prediction task, guiding the model towards the most informative context. Across multiple clinical prediction tasks, EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines. Furthermore, integrating EHR-RAGp with existing clinical foundation models yields substantial performance gains. Overall, EHR-RAGp provides a scalable and efficient framework for leveraging long-range clinical context to improve downstream performance.", "pdf_url": "https://arxiv.org/pdf/2605.12335v1", "local_path": "data\\papers\\2605_12335v1.pdf"}
{"arxiv_id": "2504.14689v1", "title": "Designing AI Systems that Augment Human Performed vs. Demonstrated Critical Thinking", "authors": ["Katelyn Xiaoying Mei", "Nic Weber"], "year": "2025", "abstract": "The recent rapid advancement of LLM-based AI systems has accelerated our search and production of information. While the advantages brought by these systems seemingly improve the performance or efficiency of human activities, they do not necessarily enhance human capabilities. Recent research has started to examine the impact of generative AI on individuals' cognitive abilities, especially critical thinking. Based on definitions of critical thinking across psychology and education, this position paper proposes the distinction between demonstrated and performed critical thinking in the era of generative AI and discusses the implication of this distinction in research and development of AI systems that aim to augment human critical thinking.", "pdf_url": "https://arxiv.org/pdf/2504.14689v1", "local_path": "data\\papers\\2504_14689v1.pdf"}
{"arxiv_id": "2503.01763v2", "title": "Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models", "authors": ["Zhengliang Shi", "Yuhan Wang", "Lingyong Yan", "Pengjie Ren", "Shuaiqiang Wang", "Dawei Yin", "Zhaochun Ren"], "year": "2025", "abstract": "Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.", "pdf_url": "https://arxiv.org/pdf/2503.01763v2", "local_path": "data\\papers\\2503_01763v2.pdf"}
{"arxiv_id": "2507.23334v2", "title": "MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation", "authors": ["Daeyong Kwon", "SeungHeon Doh", "Juhan Nam"], "year": "2025", "abstract": "Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.", "pdf_url": "https://arxiv.org/pdf/2507.23334v2", "local_path": "data\\papers\\2507_23334v2.pdf"}
{"arxiv_id": "2311.04589v3", "title": "TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models", "authors": ["Zhen Yang", "Yingxue Zhang", "Fandong Meng", "Jie Zhou"], "year": "2023", "abstract": "Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an approach to treat the input from any modality as a token sequence and learn a joint embedding space for all modalities. Specifically, for the input from any modality, TEAL first discretizes it into a token sequence with the off-the-shelf tokenizer and embeds the token sequence into a joint embedding space with a learnable embedding matrix. MM-LLMs just need to predict the multi-modal tokens autoregressively as the textual LLMs do. Finally, the corresponding de-tokenizer is applied to generate the output in each modality based on the predicted token sequence. With the joint embedding space, TEAL enables the frozen LLMs to perform both understanding and generation tasks involving non-textual modalities, such as image and audio. Thus, the textual LLM can just work as an interface and maintain its high performance in textual understanding and generation. Experiments show that TEAL achieves substantial improvements in multi-modal understanding, and implements a simple scheme for multi-modal generations.", "pdf_url": "https://arxiv.org/pdf/2311.04589v3", "local_path": "data\\papers\\2311_04589v3.pdf"}
{"arxiv_id": "2505.21439v1", "title": "Towards Better Instruction Following Retrieval Models", "authors": ["Yuchen Zhuang", "Aaron Trinh", "Rushi Qiang", "Haotian Sun", "Chao Zhang", "Hanjun Dai", "Bo Dai"], "year": "2025", "abstract": "Modern information retrieval (IR) models, trained exclusively on standard <query, passage> pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive <instruction, query, passage> triplets as positive samples. In particular, for each positive triplet, we generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness. Unlike existing corpora that primarily support computationally intensive reranking tasks for decoder-only language models, the highly contrastive positive-negative triplets in InF-IR further enable efficient representation learning for smaller encoder-only models, facilitating direct embedding-based retrieval. Using this corpus, we train InF-Embed, an instruction-aware Embedding model optimized through contrastive learning and instruction-query attention mechanisms to align retrieval outcomes precisely with user intents. Extensive experiments across five instruction-based retrieval benchmarks demonstrate that InF-Embed significantly surpasses competitive baselines by 8.1% in p-MRR, measuring the instruction-following capabilities.", "pdf_url": "https://arxiv.org/pdf/2505.21439v1", "local_path": "data\\papers\\2505_21439v1.pdf"}
{"arxiv_id": "2309.15217v2", "title": "Ragas: Automated Evaluation of Retrieval Augmented Generation", "authors": ["Shahul Es", "Jithin James", "Luis Espinosa-Anke", "Steven Schockaert"], "year": "2023", "abstract": "We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With Ragas, we put forward a suite of metrics which can be used to evaluate these different dimensions \\textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.", "pdf_url": "https://arxiv.org/pdf/2309.15217v2", "local_path": "data\\papers\\2309_15217v2.pdf"}
{"arxiv_id": "2501.05032v2", "title": "Enhancing Human-Like Responses in Large Language Models", "authors": ["Ethem Yağız Çalık", "Talha Rüzgar Akkuş"], "year": "2025", "abstract": "This paper explores the advancements in making large language models (LLMs) more human-like. We focus on techniques that enhance natural language understanding, conversational coherence, and emotional intelligence in AI systems. The study evaluates various approaches, including fine-tuning with diverse datasets, incorporating psychological principles, and designing models that better mimic human reasoning patterns. Our findings demonstrate that these enhancements not only improve user interactions but also open new possibilities for AI applications across different domains. Future work will address the ethical implications and potential biases introduced by these human-like attributes.", "pdf_url": "https://arxiv.org/pdf/2501.05032v2", "local_path": "data\\papers\\2501_05032v2.pdf"}
{"arxiv_id": "2310.14025v1", "title": "Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation", "authors": ["Anastasia Kritharoula", "Maria Lymperaiou", "Giorgos Stamou"], "year": "2023", "abstract": "Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates, which better represents the meaning of an ambiguous word within a given context. In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches. Since VWSD is primarily a text-image retrieval task, we explore the latest transformer-based methods for multimodal retrieval. Additionally, we utilize Large Language Models (LLMs) as knowledge bases to enhance the given phrases and resolve ambiguity related to the target word. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, as well as question-answering (QA), to fully explore the capabilities of relevant models. To tap into the implicit knowledge of LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable answer generation. On top of all, we train a learn to rank (LTR) model in order to combine our different modules, achieving competitive ranking results. Extensive experiments on VWSD demonstrate valuable insights to effectively drive future directions.", "pdf_url": "https://arxiv.org/pdf/2310.14025v1", "local_path": "data\\papers\\2310_14025v1.pdf"}
{"arxiv_id": "2509.12382v1", "title": "LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation", "authors": ["Anu Pradhan", "Alexandra Ortan", "Apurv Verma", "Madhavan Seshadri"], "year": "2025", "abstract": "The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research. Can we trust Large Language Models to serve as reliable judges of their own kind? This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations. Instead, Gwet's AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework.", "pdf_url": "https://arxiv.org/pdf/2509.12382v1", "local_path": "data\\papers\\2509_12382v1.pdf"}
{"arxiv_id": "1912.02145v1", "title": "An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering", "authors": ["Shayne Longpre", "Yi Lu", "Zhucheng Tu", "Chris DuBois"], "year": "2019", "abstract": "To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a simple negative sampling technique to be particularly effective, even though it is typically used for datasets that include unanswerable questions, such as SQuAD 2.0. When applied in conjunction with per-domain sampling, our XLNet (Yang et al., 2019)-based submission achieved the second best Exact Match and F1 in the MRQA leaderboard competition.", "pdf_url": "https://arxiv.org/pdf/1912.02145v1", "local_path": "data\\papers\\1912_02145v1.pdf"}
{"arxiv_id": "2510.25621v1", "title": "FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering", "authors": ["Mohammad Aghajani Asl", "Behrooz Minaei Bidgoli"], "year": "2025", "abstract": "The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.", "pdf_url": "https://arxiv.org/pdf/2510.25621v1", "local_path": "data\\papers\\2510_25621v1.pdf"}
{"arxiv_id": "2108.12898v1", "title": "Generating Answer Candidates for Quizzes and Answer-Aware Question Generators", "authors": ["Kristiyan Vachev", "Momchil Hardalov", "Georgi Karadzhov", "Georgi Georgiev", "Ivan Koychev", "Preslav Nakov"], "year": "2021", "abstract": "In education, open-ended quiz questions have become an important tool for assessing the knowledge of students. Yet, manually preparing such questions is a tedious task, and thus automatic question generation has been proposed as a possible alternative. So far, the vast majority of research has focused on generating the question text, relying on question answering datasets with readily picked answers, and the problem of how to come up with answer candidates in the first place has been largely ignored. Here, we aim to bridge this gap. In particular, we propose a model that can generate a specified number of answer candidates for a given passage of text, which can then be used by instructors to write questions manually or can be passed as an input to automatic answer-aware question generators. Our experiments show that our proposed answer candidate generation model outperforms several baselines.", "pdf_url": "https://arxiv.org/pdf/2108.12898v1", "local_path": "data\\papers\\2108_12898v1.pdf"}
{"arxiv_id": "1911.11403v1", "title": "SemEval-2015 Task 3: Answer Selection in Community Question Answering", "authors": ["Preslav Nakov", "Lluís Màrquez", "Walid Magdy", "Alessandro Moschitti", "James Glass", "Bilal Randeree"], "year": "2019", "abstract": "Community Question Answering (cQA) provides new interesting research directions to the traditional Question Answering (QA) field, e.g., the exploitation of the interaction between users and the structure of related posts. In this context, we organized SemEval-2015 Task 3 on \"Answer Selection in cQA\", which included two subtasks: (a) classifying answers as \"good\", \"bad\", or \"potentially relevant\" with respect to the question, and (b) answering a YES/NO question with \"yes\", \"no\", or \"unsure\", based on the list of all answers. We set subtask A for Arabic and English on two relatively different cQA domains, i.e., the Qatar Living website for English, and a Quran-related website for Arabic. We used crowdsourcing on Amazon Mechanical Turk to label a large English training dataset, which we released to the research community. Thirteen teams participated in the challenge with a total of 61 submissions: 24 primary and 37 contrastive. The best systems achieved an official score (macro-averaged F1) of 57.19 and 63.7 for the English subtasks A and B, and 78.55 for the Arabic subtask A.", "pdf_url": "https://arxiv.org/pdf/1911.11403v1", "local_path": "data\\papers\\1911_11403v1.pdf"}
{"arxiv_id": "1803.07724v1", "title": "Attention on Attention: Architectures for Visual Question Answering (VQA)", "authors": ["Jasdeep Singh", "Vincent Ying", "Alex Nutkiewicz"], "year": "2018", "abstract": "Visual Question Answering (VQA) is an increasingly popular topic in deep learning research, requiring coordination of natural language processing and computer vision modules into a single architecture. We build upon the model which placed first in the VQA Challenge by developing thirteen new attention mechanisms and introducing a simplified classifier. We performed 300 GPU hours of extensive hyperparameter and architecture searches and were able to achieve an evaluation score of 64.78%, outperforming the existing state-of-the-art single model's validation score of 63.15%.", "pdf_url": "https://arxiv.org/pdf/1803.07724v1", "local_path": "data\\papers\\1803_07724v1.pdf"}
{"arxiv_id": "2105.14013v1", "title": "Feature extraction and evaluation for BioMedical Question Answering", "authors": ["Ankit Shah", "Srishti Singh", "Shih-Yen Tao"], "year": "2021", "abstract": "In this paper, we present our work on the BioASQ pipeline. The goal is to answer four types of questions: summary, yes/no, factoids, and list. Our goal is to empirically evaluate different modules involved: the feature extractor and the sentence selection block. We used our pipeline to test the effectiveness of each module for all kinds of question types and perform error analysis. We defined metrics that are useful for future research related to the BioASQ pipeline critical to improve the performance of the training pipeline.", "pdf_url": "https://arxiv.org/pdf/2105.14013v1", "local_path": "data\\papers\\2105_14013v1.pdf"}
{"arxiv_id": "2106.11517v1", "title": "Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering", "authors": ["Shamane Siriwardhana", "Rivindu Weerasekera", "Elliott Wen", "Suranga Nanayakkara"], "year": "2021", "abstract": "In this paper, we illustrate how to fine-tune the entire Retrieval Augment Generation (RAG) architecture in an end-to-end manner. We highlighted the main engineering challenges that needed to be addressed to achieve this objective. We also compare how end-to-end RAG architecture outperforms the original RAG architecture for the task of question answering. We have open-sourced our implementation in the HuggingFace Transformers library.", "pdf_url": "https://arxiv.org/pdf/2106.11517v1", "local_path": "data\\papers\\2106_11517v1.pdf"}
{"arxiv_id": "2412.07420v1", "title": "RAG-based Question Answering over Heterogeneous Data and Text", "authors": ["Philipp Christmann", "Gerhard Weikum"], "year": "2024", "abstract": "This article presents the QUASAR system for question answering over unstructured text, structured tables, and knowledge graphs, with unified treatment of all sources. The system adopts a RAG-based architecture, with a pipeline of evidence retrieval followed by answer generation, with the latter powered by a moderate-sized language model. Additionally and uniquely, QUASAR has components for question understanding, to derive crisper input for evidence retrieval, and for re-ranking and filtering the retrieved evidence before feeding the most informative pieces into the answer generation. Experiments with three different benchmarks demonstrate the high answering quality of our approach, being on par with or better than large GPT models, while keeping the computational cost and energy consumption orders of magnitude lower.", "pdf_url": "https://arxiv.org/pdf/2412.07420v1", "local_path": "data\\papers\\2412_07420v1.pdf"}
{"arxiv_id": "2407.04255v1", "title": "Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge", "authors": ["Xiangyu Wu", "Zhouyang Chi", "Yang Yang", "Jianfeng Lu"], "year": "2024", "abstract": "In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage, and continued to fine-tune the model on the competition dataset, transferring the semantic information learned in the first stage to the competition task. Finally, we designed a bounding box matching and replacing post-processing strategy to correct the model's prediction results. Our team achieved a score of 76.342 on the final leaderboard, ranking second.", "pdf_url": "https://arxiv.org/pdf/2407.04255v1", "local_path": "data\\papers\\2407_04255v1.pdf"}
{"arxiv_id": "cs/0107006v1", "title": "Looking Under the Hood : Tools for Diagnosing your Question Answering Engine", "authors": ["Eric Breck", "Marc Light", "Gideon S. Mann", "Ellen Riloff", "Brianne Brown Pranav Anand", "Mats Rooth", "Michael Thelen"], "year": "2001", "abstract": "In this paper we analyze two question answering tasks : the TREC-8 question answering task and a set of reading comprehension exams. First, we show that Q/A systems perform better when there are multiple answer opportunities per question. Next, we analyze common approaches to two subproblems: term overlap for answer sentence identification, and answer typing for short answer extraction. We present general tools for analyzing the strengths and limitations of techniques for these subproblems. Our results quantify the limitations of both term overlap and answer typing to distinguish between competing answer candidates.", "pdf_url": "https://arxiv.org/pdf/cs/0107006v1", "local_path": "data\\papers\\cs_0107006v1.pdf"}
{"arxiv_id": "1601.03541v2", "title": "Question Answering on Linked Data: Challenges and Future Directions", "authors": ["Saeedeh Shekarpour", "Denis Lukovnikov", "Ashwini Jaya Kumar", "Kemele Endris", "Kuldeep Singh", "Harsh Thakkar", "Christoph Lange"], "year": "2016", "abstract": "Question Answering (QA) systems are becoming the inspiring model for the future of search engines. While recently, underlying datasets for QA systems have been promoted from unstructured datasets to structured datasets with highly semantic-enriched metadata, but still question answering systems involve serious challenges which cause to be far beyond desired expectations. In this paper, we raise the challenges for building a Question Answering (QA) system especially with the focus of employing structured data (i.e. knowledge graph). This paper provide an exhaustive insight of the known challenges, so far. Thus, it helps researchers to easily spot open rooms for the future research agenda.", "pdf_url": "https://arxiv.org/pdf/1601.03541v2", "local_path": "data\\papers\\1601_03541v2.pdf"}
{"arxiv_id": "1912.00730v1", "title": "SemEval-2017 Task 3: Community Question Answering", "authors": ["Preslav Nakov", "Doris Hoogeveen", "Lluís Màrquez", "Alessandro Moschitti", "Hamdy Mubarak", "Timothy Baldwin", "Karin Verspoor"], "year": "2019", "abstract": "We describe SemEval-2017 Task 3 on Community Question Answering. This year, we reran the four subtasks from SemEval-2016:(A) Question-Comment Similarity,(B) Question-Question Similarity,(C) Question-External Comment Similarity, and (D) Rerank the correct answers for a new question in Arabic, providing all the data from 2015 and 2016 for training, and fresh data for testing. Additionally, we added a new subtask E in order to enable experimentation with Multi-domain Question Duplicate Detection in a larger-scale scenario, using StackExchange subforums. A total of 23 teams participated in the task, and submitted a total of 85 runs (36 primary and 49 contrastive) for subtasks A-D. Unfortunately, no teams participated in subtask E. A variety of approaches and features were used by the participating systems to address the different subtasks. The best systems achieved an official score (MAP) of 88.43, 47.22, 15.46, and 61.16 in subtasks A, B, C, and D, respectively. These scores are better than the baselines, especially for subtasks A-C.", "pdf_url": "https://arxiv.org/pdf/1912.00730v1", "local_path": "data\\papers\\1912_00730v1.pdf"}
{"arxiv_id": "1909.09192v1", "title": "Learning Sparse Mixture of Experts for Visual Question Answering", "authors": ["Vardaan Pahuja", "Jie Fu", "Christopher J. Pal"], "year": "2019", "abstract": "There has been a rapid progress in the task of Visual Question Answering with improved model architectures. Unfortunately, these models are usually computationally intensive due to their sheer size which poses a serious challenge for deployment. We aim to tackle this issue for the specific task of Visual Question Answering (VQA). A Convolutional Neural Network (CNN) is an integral part of the visual processing pipeline of a VQA model (assuming the CNN is trained along with entire VQA model). In this project, we propose an efficient and modular neural architecture for the VQA task with focus on the CNN module. Our experiments demonstrate that a sparsely activated CNN based VQA model achieves comparable performance to a standard CNN based VQA model architecture.", "pdf_url": "https://arxiv.org/pdf/1909.09192v1", "local_path": "data\\papers\\1909_09192v1.pdf"}
{"arxiv_id": "2109.15120v1", "title": "SUper Team at SemEval-2016 Task 3: Building a feature-rich system for community question answering", "authors": ["Tsvetomila Mihaylova", "Pepa Gencheva", "Martin Boyanov", "Ivana Yovcheva", "Todor Mihaylov", "Momchil Hardalov", "Yasen Kiprov", "Daniel Balchev", "Ivan Koychev", "Preslav Nakov", "Ivelina Nikolova", "Galia Angelova"], "year": "2021", "abstract": "We present the system we built for participating in SemEval-2016 Task 3 on Community Question Answering. We achieved the best results on subtask C, and strong results on subtasks A and B, by combining a rich set of various types of features: semantic, lexical, metadata, and user-related. The most important group turned out to be the metadata for the question and for the comment, semantic vectors trained on QatarLiving data and similarities between the question and the comment for subtasks A and C, and between the original and the related question for Subtask B.", "pdf_url": "https://arxiv.org/pdf/2109.15120v1", "local_path": "data\\papers\\2109_15120v1.pdf"}
{"arxiv_id": "1911.08743v1", "title": "SemanticZ at SemEval-2016 Task 3: Ranking Relevant Answers in Community Question Answering Using Semantic Similarity Based on Fine-tuned Word Embeddings", "authors": ["Todor Mihaylov", "Preslav Nakov"], "year": "2019", "abstract": "We describe our system for finding good answers in a community forum, as defined in SemEval-2016, Task 3 on Community Question Answering. Our approach relies on several semantic similarity features based on fine-tuned word embeddings and topics similarities. In the main Subtask C, our primary submission was ranked third, with a MAP of 51.68 and accuracy of 69.94. In Subtask A, our primary submission was also third, with MAP of 77.58 and accuracy of 73.39.", "pdf_url": "https://arxiv.org/pdf/1911.08743v1", "local_path": "data\\papers\\1911_08743v1.pdf"}
{"arxiv_id": "1912.01972v1", "title": "SemEval-2016 Task 3: Community Question Answering", "authors": ["Preslav Nakov", "Lluís Màrquez", "Alessandro Moschitti", "Walid Magdy", "Hamdy Mubarak", "Abed Alhakim Freihat", "James Glass", "Bilal Randeree"], "year": "2019", "abstract": "This paper describes the SemEval--2016 Task 3 on Community Question Answering, which we offered in English and Arabic. For English, we had three subtasks: Question--Comment Similarity (subtask A), Question--Question Similarity (B), and Question--External Comment Similarity (C). For Arabic, we had another subtask: Rerank the correct answers for a new question (D). Eighteen teams participated in the task, submitting a total of 95 runs (38 primary and 57 contrastive) for the four subtasks. A variety of approaches and features were used by the participating systems to address the different subtasks, which are summarized in this paper. The best systems achieved an official score (MAP) of 79.19, 76.70, 55.41, and 45.83 in subtasks A, B, C, and D, respectively. These scores are significantly better than those for the baselines that we provided. For subtask A, the best system improved over the 2015 winner by 3 points absolute in terms of Accuracy.", "pdf_url": "https://arxiv.org/pdf/1912.01972v1", "local_path": "data\\papers\\1912_01972v1.pdf"}
{"arxiv_id": "2409.03708v2", "title": "RAG based Question-Answering for Contextual Response Prediction System", "authors": ["Sriram Veturi", "Saurabh Vaichal", "Reshma Lal Jagadheesh", "Nafis Irtiza Tripto", "Nian Yan"], "year": "2024", "abstract": "Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-answering framework for real-world applications using RAG entails several challenges: 1) data availability issues, 2) evaluating the quality of generated content, and 3) the costly nature of human evaluation. In this paper, we introduce an end-to-end framework that employs LLMs with RAG capabilities for industry use cases. Given a customer query, the proposed system retrieves relevant knowledge documents and leverages them, along with previous chat history, to generate response suggestions for customer service agents in the contact centers of a major retail company. Through comprehensive automated and human evaluations, we show that this solution outperforms the current BERT-based algorithms in accuracy and relevance. Our findings suggest that RAG-based LLMs can be an excellent support to human customer service representatives by lightening their workload.", "pdf_url": "https://arxiv.org/pdf/2409.03708v2", "local_path": "data\\papers\\2409_03708v2.pdf"}
{"arxiv_id": "2110.10780v3", "title": "An Open Natural Language Processing Development Framework for EHR-based Clinical Research: A case demonstration using the National COVID Cohort Collaborative (N3C)", "authors": ["Sijia Liu", "Andrew Wen", "Liwei Wang", "Huan He", "Sunyang Fu", "Robert Miller", "Andrew Williams", "Daniel Harris", "Ramakanth Kavuluru", "Mei Liu", "Noor Abu-el-rub", "Dalton Schutte", "Rui Zhang", "Masoud Rouhizadeh", "John D. Osborne", "Yongqun He", "Umit Topaloglu", "Stephanie S Hong", "Joel H Saltz", "Thomas Schaffter", "Emily Pfaff", "Christopher G. Chute", "Tim Duong", "Melissa A. Haendel", "Rafael Fuentes", "Peter Szolovits", "Hua Xu", "Hongfang Liu", "National COVID Cohort Collaborative", "Natural Language Processing", " Subgroup", "National COVID Cohort Collaborative"], "year": "2021", "abstract": "While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The corpora were derived from texts from three different institutions (Mayo Clinic, University of Kentucky, University of Minnesota). The gold standard annotations were tested with a single institution's (Mayo) ruleset. This resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study and adoption. Although we use COVID-19 as a use case in this effort, our framework is general enough to be applied to other domains of interest in clinical NLP.", "pdf_url": "https://arxiv.org/pdf/2110.10780v3", "local_path": "data\\papers\\2110_10780v3.pdf"}
{"arxiv_id": "2306.06371v1", "title": "A Comprehensive Review of State-of-The-Art Methods for Java Code Generation from Natural Language Text", "authors": ["Jessica López Espejel", "Mahaman Sanoussi Yahaya Alassan", "El Mehdi Chouham", "Walid Dahhane", "El Hassane Ettifouri"], "year": "2023", "abstract": "Java Code Generation consists in generating automatically Java code from a Natural Language Text. This NLP task helps in increasing programmers' productivity by providing them with immediate solutions to the simplest and most repetitive tasks. Code generation is a challenging task because of the hard syntactic rules and the necessity of a deep understanding of the semantic aspect of the programming language. Many works tried to tackle this task using either RNN-based, or Transformer-based models. The latter achieved remarkable advancement in the domain and they can be divided into three groups: (1) encoder-only models, (2) decoder-only models, and (3) encoder-decoder models. In this paper, we provide a comprehensive review of the evolution and progress of deep learning models in Java code generation task. We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community. In addition, we provide a detailed description of datasets and evaluation metrics used in the literature. Finally, we discuss results of different models on CONCODE dataset, then propose some future directions.", "pdf_url": "https://arxiv.org/pdf/2306.06371v1", "local_path": "data\\papers\\2306_06371v1.pdf"}
{"arxiv_id": "2006.16212v1", "title": "Towards the Study of Morphological Processing of the Tangkhul Language", "authors": ["Mirinso Shadang", "Navanath Saharia", "Thoudam Doren Singh"], "year": "2020", "abstract": "There is no or little work on natural language processing of Tangkhul language. The current work is a humble beginning of morphological processing of this language using an unsupervised approach. We use a small corpus collected from different sources of text books, short stories and articles of other topics. Based on the experiments carried out, the morpheme identification task using morphessor gives reasonable and interesting output despite using a small corpus.", "pdf_url": "https://arxiv.org/pdf/2006.16212v1", "local_path": "data\\papers\\2006_16212v1.pdf"}
{"arxiv_id": "2103.14757v1", "title": "An Automated Multiple-Choice Question Generation Using Natural Language Processing Techniques", "authors": ["Chidinma A. Nwafor", "Ikechukwu E. Onyenwe"], "year": "2021", "abstract": "Automatic multiple-choice question generation (MCQG) is a useful yet challenging task in Natural Language Processing (NLP). It is the task of automatic generation of correct and relevant questions from textual data. Despite its usefulness, manually creating sizeable, meaningful and relevant questions is a time-consuming and challenging task for teachers. In this paper, we present an NLP-based system for automatic MCQG for Computer-Based Testing Examination (CBTE).We used NLP technique to extract keywords that are important words in a given lesson material. To validate that the system is not perverse, five lesson materials were used to check the effectiveness and efficiency of the system. The manually extracted keywords by the teacher were compared to the auto-generated keywords and the result shows that the system was capable of extracting keywords from lesson materials in setting examinable questions. This outcome is presented in a user-friendly interface for easy accessibility.", "pdf_url": "https://arxiv.org/pdf/2103.14757v1", "local_path": "data\\papers\\2103_14757v1.pdf"}
{"arxiv_id": "2408.13040v1", "title": "SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks", "authors": ["Kai-Wei Chang", "Haibin Wu", "Yu-Kai Wang", "Yuan-Kuei Wu", "Hua Shen", "Wei-Cheng Tseng", "Iu-thing Kang", "Shang-Wen Li", "Hung-yi Lee"], "year": "2024", "abstract": "Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.", "pdf_url": "https://arxiv.org/pdf/2408.13040v1", "local_path": "data\\papers\\2408_13040v1.pdf"}
{"arxiv_id": "cmp-lg/9803002v1", "title": "Time, Tense and Aspect in Natural Language Database Interfaces", "authors": ["I. Androutsopoulos", "G. D. Ritchie", "P. Thanisch"], "year": "1998", "abstract": "Most existing natural language database interfaces (NLDBs) were designed to be used with database systems that provide very limited facilities for manipulating time-dependent data, and they do not support adequately temporal linguistic mechanisms (verb tenses, temporal adverbials, temporal subordinate clauses, etc.). The database community is becoming increasingly interested in temporal database systems, that are intended to store and manipulate in a principled manner information not only about the present, but also about the past and future. When interfacing to temporal databases, supporting temporal linguistic mechanisms becomes crucial. We present a framework for constructing natural language interfaces for temporal databases (NLTDBs), that draws on research in tense and aspect theories, temporal logics, and temporal databases. The framework consists of a temporal intermediate representation language, called TOP, an HPSG grammar that maps a wide range of questions involving temporal mechanisms to appropriate TOP expressions, and a provably correct method for translating from TOP to TSQL2, TSQL2 being a recently proposed temporal extension of the SQL database language. This framework was employed to implement a prototype NLTDB using ALE and Prolog.", "pdf_url": "https://arxiv.org/pdf/cmp-lg/9803002v1", "local_path": "data\\papers\\cmp-lg_9803002v1.pdf"}
{"arxiv_id": "2106.10153v1", "title": "All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers", "authors": ["Carmelo Scribano", "Davide Sapienza", "Giorgia Franchini", "Micaela Verucchi", "Marko Bertogna"], "year": "2021", "abstract": "Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.", "pdf_url": "https://arxiv.org/pdf/2106.10153v1", "local_path": "data\\papers\\2106_10153v1.pdf"}
{"arxiv_id": "2105.14897v1", "title": "Connecting Language and Vision for Natural Language-Based Vehicle Retrieval", "authors": ["Shuai Bai", "Zhedong Zheng", "Xiaohan Wang", "Junyang Lin", "Zhu Zhang", "Chang Zhou", "Yi Yang", "Hongxia Yang"], "year": "2021", "abstract": "Vehicle search is one basic task for the efficient traffic management in terms of the AI City. Most existing practices focus on the image-based vehicle matching, including vehicle re-identification and vehicle tracking. In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest and explore the potential of this task in the real-world scenario. The natural language-based vehicle search poses one new challenge of fine-grained understanding of both vision and language modalities. To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model in an end-to-end manner. Except for the network structure design and the training strategy, several optimization objectives are also re-visited in this work. The qualitative and quantitative experiments verify the effectiveness of the proposed method. Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy on the private test set. We hope this work can pave the way for the future study on using language description effectively and efficiently for real-world vehicle retrieval systems. The code will be available at https://github.com/ShuaiBai623/AIC2021-T5-CLV.", "pdf_url": "https://arxiv.org/pdf/2105.14897v1", "local_path": "data\\papers\\2105_14897v1.pdf"}
{"arxiv_id": "1807.09844v2", "title": "Modular Mechanistic Networks: On Bridging Mechanistic and Phenomenological Models with Deep Neural Networks in Natural Language Processing", "authors": ["Simon Dobnik", "John D. Kelleher"], "year": "2018", "abstract": "Natural language processing (NLP) can be done using either top-down (theory driven) and bottom-up (data driven) approaches, which we call mechanistic and phenomenological respectively. The approaches are frequently considered to stand in opposition to each other. Examining some recent approaches in deep learning we argue that deep neural networks incorporate both perspectives and, furthermore, that leveraging this aspect of deep learning may help in solving complex problems within language technology, such as modelling language and perception in the domain of spatial cognition.", "pdf_url": "https://arxiv.org/pdf/1807.09844v2", "local_path": "data\\papers\\1807_09844v2.pdf"}
{"arxiv_id": "2101.11436v1", "title": "Challenges Encountered in Turkish Natural Language Processing Studies", "authors": ["Kadir Tohma", "Yakup Kutlu"], "year": "2021", "abstract": "Natural language processing is a branch of computer science that combines artificial intelligence with linguistics. It aims to analyze a language element such as writing or speaking with software and convert it into information. Considering that each language has its own grammatical rules and vocabulary diversity, the complexity of the studies in this field is somewhat understandable. For instance, Turkish is a very interesting language in many ways. Examples of this are agglutinative word structure, consonant/vowel harmony, a large number of productive derivational morphemes (practically infinite vocabulary), derivation and syntactic relations, a complex emphasis on vocabulary and phonological rules. In this study, the interesting features of Turkish in terms of natural language processing are mentioned. In addition, summary info about natural language processing techniques, systems and various sources developed for Turkish are given.", "pdf_url": "https://arxiv.org/pdf/2101.11436v1", "local_path": "data\\papers\\2101_11436v1.pdf"}
{"arxiv_id": "2307.10652v5", "title": "Exploring the Landscape of Natural Language Processing Research", "authors": ["Tim Schopf", "Karim Arabi", "Florian Matthes"], "year": "2023", "abstract": "As an efficient approach to understand, generate, and process natural language texts, research in natural language processing (NLP) has exhibited a rapid spread and wide adoption in recent years. Given the increasing research work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established topics, identifies trends, and outlines areas for future research remains absent. Contributing to closing this gap, we have systematically classified and analyzed research papers in the ACL Anthology. As a result, we present a structured overview of the research landscape, provide a taxonomy of fields of study in NLP, analyze recent developments in NLP, summarize our findings, and highlight directions for future work.", "pdf_url": "https://arxiv.org/pdf/2307.10652v5", "local_path": "data\\papers\\2307_10652v5.pdf"}
{"arxiv_id": "2312.04649v1", "title": "PyThaiNLP: Thai Natural Language Processing in Python", "authors": ["Wannaphong Phatthiyaphaibun", "Korakot Chaovavanich", "Charin Polpanumas", "Arthit Suriyawongkul", "Lalita Lowphansirikul", "Pattarawat Chormai", "Peerat Limkonchotiwat", "Thanathip Suntorntip", "Can Udomcharoenchaikit"], "year": "2023", "abstract": "We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.", "pdf_url": "https://arxiv.org/pdf/2312.04649v1", "local_path": "data\\papers\\2312_04649v1.pdf"}
{"arxiv_id": "cmp-lg/9705013v1", "title": "FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text", "authors": ["Jerry R. Hobbs", "Douglas Appelt", "John Bear", "David Israel", "Megumi Kameyama", "Mark Stickel", "Mabry Tyson"], "year": "1997", "abstract": "FASTUS is a system for extracting information from natural language text for entry into a database and for other applications. It works essentially as a cascaded, nondeterministic finite-state automaton. There are five stages in the operation of FASTUS. In Stage 1, names and other fixed form expressions are recognized. In Stage 2, basic noun groups, verb groups, and prepositions and some other particles are recognized. In Stage 3, certain complex noun groups and verb groups are constructed. Patterns for events of interest are identified in Stage 4 and corresponding ``event structures'' are built. In Stage 5, distinct event structures that describe the same event are identified and merged, and these are used in generating database entries. This decomposition of language processing enables the system to do exactly the right amount of domain-independent syntax, so that domain-dependent semantic and pragmatic processing can be applied to the right larger-scale structures. FASTUS is very efficient and effective, and has been used successfully in a number of applications.", "pdf_url": "https://arxiv.org/pdf/cmp-lg/9705013v1", "local_path": "data\\papers\\cmp-lg_9705013v1.pdf"}
{"arxiv_id": "2011.05911v1", "title": "Situated Data, Situated Systems: A Methodology to Engage with Power Relations in Natural Language Processing Research", "authors": ["Lucy Havens", "Melissa Terras", "Benjamin Bach", "Beatrice Alex"], "year": "2020", "abstract": "We propose a bias-aware methodology to engage with power relations in natural language processing (NLP) research. NLP research rarely engages with bias in social contexts, limiting its ability to mitigate bias. While researchers have recommended actions, technical methods, and documentation practices, no methodology exists to integrate critical reflections on bias with technical NLP methods. In this paper, after an extensive and interdisciplinary literature review, we contribute a bias-aware methodology for NLP research. We also contribute a definition of biased text, a discussion of the implications of biased NLP systems, and a case study demonstrating how we are executing the bias-aware methodology in research on archival metadata descriptions.", "pdf_url": "https://arxiv.org/pdf/2011.05911v1", "local_path": "data\\papers\\2011_05911v1.pdf"}
{"arxiv_id": "2405.10845v1", "title": "Natural Language Processing for Requirements Traceability", "authors": ["Jin L. C. Guo", "Jan-Philipp Steghöfer", "Andreas Vogelsang", "Jane Cleland-Huang"], "year": "2024", "abstract": "Traceability, the ability to trace relevant software artifacts to support reasoning about the quality of the software and its development process, plays a crucial role in requirements and software engineering, particularly for safety-critical systems. In this chapter, we provide a comprehensive overview of the representative tasks in requirement traceability for which natural language processing (NLP) and related techniques have made considerable progress in the past decade. We first present the definition of traceability in the context of requirements and the overall engineering process, as well as other important concepts related to traceability tasks. Then, we discuss two tasks in detail, including trace link recovery and trace link maintenance. We also introduce two other related tasks concerning when trace links are used in practical contexts. For each task, we explain the characteristics of the task, how it can be approached through NLP techniques, and how to design and conduct the experiment to demonstrate the performance of the NLP techniques. We further discuss practical considerations on how to effectively apply NLP techniques and assess their effectiveness regarding the data set collection, the metrics selection, and the role of humans when evaluating the NLP approaches. Overall, this chapter prepares the readers with the fundamental knowledge of designing automated traceability solutions enabled by NLP in practice.", "pdf_url": "https://arxiv.org/pdf/2405.10845v1", "local_path": "data\\papers\\2405_10845v1.pdf"}
{"arxiv_id": "2005.03812v1", "title": "Comparative Analysis of Word Embeddings for Capturing Word Similarities", "authors": ["Martina Toshevska", "Frosina Stojanovska", "Jovan Kalajdjieski"], "year": "2020", "abstract": "Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.", "pdf_url": "https://arxiv.org/pdf/2005.03812v1", "local_path": "data\\papers\\2005_03812v1.pdf"}
{"arxiv_id": "2604.17982v1", "title": "Mitigating Multimodal Hallucination via Phase-wise Self-reward", "authors": ["Yu Zhang", "Chuyang Sun", "Kehai Chen", "Xuefeng Bai", "Yang Xiang", "Min Zhang"], "year": "2026", "abstract": "Large Vision-Language Models (LVLMs) still struggle with vision hallucination, where generated responses are inconsistent with the visual input. Existing methods either rely on large-scale annotated data for fine-tuning, which incurs massive computational overhead, or employ static post-hoc strategies that overlook the dynamic nature of hallucination emergence. To address these, we introduce a new self-rewarding framework, enabling dynamic hallucination mitigation at inference time without external supervision. On the empirical side, we reveal that visual hallucination exhibits phase-wise dynamic patterns, peaking at the onset of each semantic phase. Drawing on these insights, we propose \\textbf{PSRD} (\\textbf{Phase-wise \\textbf{S}elf-\\textbf{R}eward \\textbf{D}ecoding) for online hallucination correction guided by phase-wise self-reward signals. To reduce the cost of repeated self-evaluation during decoding, we distill the hallucination guidance signal from LVLMs into a lightweight reward model. The reward model subsequently provides on-the-fly guidance for targeted intervention during the decoding process, enabling precise hallucination suppression. The proposed PSRD significantly reduces the hallucination rate of LLaVA-1.5-7B by 50.0% and consistently outperforms existing post-hoc methods across five hallucination evaluation benchmarks for four LVLMs. Further analysis confirms that PSRD effectively mitigates hallucination propagation and achieves a highly controllable trade-off between strong performance and inference efficiency.", "pdf_url": "https://arxiv.org/pdf/2604.17982v1", "local_path": "data\\papers\\2604_17982v1.pdf"}
{"arxiv_id": "2510.18439v3", "title": "Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation", "authors": ["Yasser Hamidullah", "Koel Dutta Chowdhury", "Yusser Al Ghussin", "Shakib Yazdani", "Cennet Oguz", "Josef van Genabith", "Cristina España-Bonet"], "year": "2025", "abstract": "Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.", "pdf_url": "https://arxiv.org/pdf/2510.18439v3", "local_path": "data\\papers\\2510_18439v3.pdf"}
{"arxiv_id": "2606.05868v1", "title": "YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition", "authors": [" PSBC LLM Team", " Huawei LLM Team", "Ruihan Long", "Junjie Wu", "Tianan Zhang", "Duo Zhang", "Yaozong Wu", "Jinbin Fu", "Chang Liu", "Zhentao Tang", "Wenshuang Yang", "Xin Wang", "Zhihao Song", "Ning Huang", "Wenjing Xu", "Shuai Zong", "Shupei Sun", "Sen Wang", "Jing Hu", "Bin Wang", "Xinyu Wang", "Junkui Ju", "Zequn Ding", "Jie Ran", "Man Luo", "Shixiong Kai", "Linkai Hou", "Kaichao Liang", "Hu Zhao", "Yang Zhao", "Shucheng Lin", "Wei Yu", "Chenghan Jiang", "Jingjing Ding", "Jiahui Zhang", "Tian Jin", "Yuhang Zhang", "Dong Guo", "Wei Sun", "Jun Xie", "Jianwei Li", "Lei Cao", "Pei Li", "Jiabin Li", "Jia Yuan", "Rui Yuan", "Jing Zhu", "Mingxuan Yuan", "Zhangcheng Lv", "Xin Jiang", "Xiuhong Fei", "Xiaozhe Ren", "Yulong Li", "Zhipeng Zhang", "Hang Wang", "Zhaohui Xu", "Rui Zhao", "Yibo He", "Xinzhuang Niu"], "year": "2026", "abstract": "Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.", "pdf_url": "https://arxiv.org/pdf/2606.05868v1", "local_path": "data\\papers\\2606_05868v1.pdf"}
{"arxiv_id": "2606.00898v1", "title": "Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs", "authors": ["Volodymyr Ovcharov"], "year": "2026", "abstract": "Large language models systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components -- citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) -- enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems -- four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system -- reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std < 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.", "pdf_url": "https://arxiv.org/pdf/2606.00898v1", "local_path": "data\\papers\\2606_00898v1.pdf"}
{"arxiv_id": "physics/0512170v1", "title": "Active Amplification of the Terrestrial Albedo to Mitigate Climate Change: An Exploratory Study", "authors": ["Robert M. Hamwey"], "year": "2005", "abstract": "This study explores the potential to enhance the reflectance of solar insolation by the human settlement and grassland components of the Earth's terrestrial surface as a climate change mitigation measure. Preliminary estimates derived using a static radiative transfer model indicate that such efforts could amplify the planetary albedo enough to offset the current global annual average level of radiative forcing caused by anthropogenic greenhouse gases by as much as 30 percent or 0.76 W/m2. Terrestrial albedo amplification may thus extend, by about 25 years, the time available to advance the development and use of low-emission energy conversion technologies which ultimately remain essential to mitigate long-term climate change. However, additional study is needed to confirm the estimates reported here and to assess the economic and environmental impacts of active land-surface albedo amplification as a climate change mitigation measure.", "pdf_url": "https://arxiv.org/pdf/physics/0512170v1", "local_path": "data\\papers\\physics_0512170v1.pdf"}
{"arxiv_id": "2509.03518v1", "title": "Can LLMs Lie? Investigation beyond Hallucination", "authors": ["Haoran Huan", "Mihir Prabhudesai", "Mengning Wu", "Shantanu Jaiswal", "Deepak Pathak"], "year": "2025", "abstract": "Large language models (LLMs) have demonstrated impressive capabilities across a variety of tasks, but their increasing autonomy in real-world applications raises concerns about their trustworthiness. While hallucinations-unintentional falsehoods-have been widely studied, the phenomenon of lying, where an LLM knowingly generates falsehoods to achieve an ulterior objective, remains underexplored. In this work, we systematically investigate the lying behavior of LLMs, differentiating it from hallucinations and testing it in practical scenarios. Through mechanistic interpretability techniques, we uncover the neural mechanisms underlying deception, employing logit lens analysis, causal interventions, and contrastive activation steering to identify and control deceptive behavior. We study real-world lying scenarios and introduce behavioral steering vectors that enable fine-grained manipulation of lying tendencies. Further, we explore the trade-offs between lying and end-task performance, establishing a Pareto frontier where dishonesty can enhance goal optimization. Our findings contribute to the broader discourse on AI ethics, shedding light on the risks and potential safeguards for deploying LLMs in high-stakes environments. Code and more illustrations are available at https://llm-liar.github.io/", "pdf_url": "https://arxiv.org/pdf/2509.03518v1", "local_path": "data\\papers\\2509_03518v1.pdf"}
{"arxiv_id": "2502.11306v1", "title": "Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation", "authors": ["Hieu Nguyen", "Zihao He", "Shoumik Atul Gandre", "Ujjwal Pasupulety", "Sharanya Kumari Shivakumar", "Kristina Lerman"], "year": "2025", "abstract": "Large language models (LLMs) often suffer from hallucination, generating factually incorrect or ungrounded content, which limits their reliability in high-stakes applications. A key factor contributing to hallucination is the use of hard labels during training, which enforce deterministic supervision, encourage overconfidence, and disregard the uncertainty inherent in natural language. To address this, we propose mitigating hallucination through knowledge distillation (KD), where a teacher model provides smoothed soft labels to a student model, reducing overconfidence and improving factual grounding. We apply KD during supervised finetuning on instructional data, evaluating its effectiveness across LLMs from different families. Experimental results on summarization benchmarks demonstrate that KD reduces hallucination compared to standard finetuning while preserving performance on general NLP tasks. These findings highlight KD as a promising approach for mitigating hallucination in LLMs and improving model reliability.", "pdf_url": "https://arxiv.org/pdf/2502.11306v1", "local_path": "data\\papers\\2502_11306v1.pdf"}
{"arxiv_id": "2506.09886v2", "title": "Probabilistic distances-based hallucination detection in LLMs with RAG", "authors": ["Rodion Oblovatny", "Alexandra Kuleshova", "Konstantin Polev", "Alexey Zaytsev"], "year": "2025", "abstract": "Detecting hallucinations in large language models (LLMs) is critical for their safety in many applications. Without proper detection, these systems often provide harmful, unreliable answers. In recent years, LLMs have been actively used in retrieval-augmented generation (RAG) settings. However, hallucinations remain even in this setting, and while numerous hallucination detection methods have been proposed, most approaches are not specifically designed for RAG systems. To overcome this limitation, we introduce a hallucination detection method based on estimating the distances between the distributions of prompt token embeddings and language model response token embeddings. The method examines the geometric structure of token hidden states to reliably extract a signal of factuality in text, while remaining friendly to long sequences. Extensive experiments demonstrate that our method achieves state-of-the-art or competitive performance. It also has transferability from solving the NLI task to the hallucination detection task, making it a fully unsupervised and efficient method with a competitive performance on the final task.", "pdf_url": "https://arxiv.org/pdf/2506.09886v2", "local_path": "data\\papers\\2506_09886v2.pdf"}
{"arxiv_id": "2506.00448v1", "title": "Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization", "authors": ["Suhas BN", "Han-Chin Shing", "Lei Xu", "Mitch Strong", "Jon Burnsky", "Jessica Ofor", "Jordan R. Mason", "Susan Chen", "Sundararajan Srinivasan", "Chaitanya Shivade", "Jack Moriarty", "Joseph Paul Cohen"], "year": "2025", "abstract": "Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical domain, and construct two datasets for the purpose: A fact-controlled Leave-N-out dataset -- generated by systematically removing facts from source dialogues to induce hallucinated content in summaries; and a natural hallucination dataset -- arising organically during LLM-based medical summarization. We show that general-domain detectors struggle to detect clinical hallucinations, and that performance on fact-controlled hallucinations does not reliably predict effectiveness on natural hallucinations. We then develop fact-based approaches that count hallucinations, offering explainability not available with existing methods. Notably, our LLM-based detectors, which we developed using fact-controlled hallucinations, generalize well to detecting real-world clinical hallucinations. This research contributes a suite of specialized metrics supported by expert-annotated datasets to advance faithful clinical summarization systems.", "pdf_url": "https://arxiv.org/pdf/2506.00448v1", "local_path": "data\\papers\\2506_00448v1.pdf"}
{"arxiv_id": "2603.27898v1", "title": "SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation", "authors": ["Tripti Shukla", "Zsolt Kira"], "year": "2026", "abstract": "Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual grounding using both self-attention maps and gradient-based attribution, and measures their spatial agreement. Based on this signal, self-attention distributions are adaptively sharpened or broadened to reinforce grounded regions or suppress unreliable ones. Extensive experiments across diverse hallucination benchmarks demonstrate that SAGE consistently outperforms existing decoding strategies, achieving substantial reductions in hallucination while preserving descriptive coverage, without requiring model retraining or architectural modifications. Our method achieves an average relative improvement of 10.65% on MSCOCO and 7.19% on AMBER across diverse VLM architectures, demonstrating consistent gains in hallucination mitigation.", "pdf_url": "https://arxiv.org/pdf/2603.27898v1", "local_path": "data\\papers\\2603_27898v1.pdf"}
{"arxiv_id": "2510.19507v2", "title": "Teaming LLMs to Detect and Mitigate Hallucinations", "authors": ["Demian Till", "John Smeaton", "Peter Haubrick", "Gouse Saheb", "Florian Graef", "David Berman"], "year": "2025", "abstract": "Recent work has demonstrated state-of-the-art results in large language model (LLM) hallucination detection and mitigation through consistency-based approaches which involve aggregating multiple responses sampled from a single LLM for a given prompt. These approaches help offset limitations stemming from the imperfect data on which LLMs are trained, which includes biases and under-representation of information required at deployment time among other limitations which can lead to hallucinations. We show that extending these single-model consistency methods to combine responses from multiple LLMs with different training data, training schemes and model architectures can result in substantial further improvements in hallucination detection and mitigation capabilities beyond their single-model consistency counterparts. We evaluate this \"consortium consistency\" approach across many model teams from a pool of 15 LLMs and explore under what conditions it is beneficial to team together different LLMs in this manner. Further, we show that these performance improvements often come with reduced inference costs, offsetting a significant drawback with single-model consistency methods.", "pdf_url": "https://arxiv.org/pdf/2510.19507v2", "local_path": "data\\papers\\2510_19507v2.pdf"}
{"arxiv_id": "2509.21473v1", "title": "Are Hallucinations Bad Estimations?", "authors": ["Hude Liu", "Jerry Yao-Chieh Hu", "Jennifer Yuntong Zhang", "Zhao Song", "Han Liu"], "year": "2025", "abstract": "We formalize hallucinations in generative models as failures to link an estimate to any plausible cause. Under this interpretation, we show that even loss-minimizing optimal estimators still hallucinate. We confirm this with a general high probability lower bound on hallucinate rate for generic data distributions. This reframes hallucination as structural misalignment between loss minimization and human-acceptable outputs, and hence estimation errors induced by miscalibration. Experiments on coin aggregation, open-ended QA, and text-to-image support our theory.", "pdf_url": "https://arxiv.org/pdf/2509.21473v1", "local_path": "data\\papers\\2509_21473v1.pdf"}
{"arxiv_id": "2507.20836v4", "title": "First Hallucination Tokens Are Different from Conditional Ones", "authors": ["Jakob Snel", "Seong Joon Oh"], "year": "2025", "abstract": "Large Language Models (LLMs) hallucinate, and detecting these cases is key to ensuring trust. While many approaches address hallucination detection at the response or span level, recent work explores token-level detection, enabling more fine-grained intervention. However, the distribution of hallucination signal across sequences of hallucinated tokens remains unexplored. We leverage token-level annotations from the RAGTruth corpus and find that the first hallucinated token is far more detectable than later ones. This structural property holds across models, suggesting that first hallucination tokens play a key role in token-level hallucination detection. Our code is available at https://github.com/jakobsnl/RAGTruth_Xtended.", "pdf_url": "https://arxiv.org/pdf/2507.20836v4", "local_path": "data\\papers\\2507_20836v4.pdf"}
{"arxiv_id": "2603.16664v1", "title": "Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation", "authors": ["Jiawei Mao", "Hardy Chen", "Haoqin Tu", "Yuhan Wang", "Letian Zhang", "Zeyu Zheng", "Huaxiu Yao", "Zirui Wang", "Cihang Xie", "Yuyin Zhou"], "year": "2026", "abstract": "Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.", "pdf_url": "https://arxiv.org/pdf/2603.16664v1", "local_path": "data\\papers\\2603_16664v1.pdf"}
{"arxiv_id": "2511.09018v1", "title": "Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs", "authors": ["Liu Yu", "Zhonghao Chen", "Ping Kuang", "Zhikun Feng", "Fan Zhou", "Lan Wang", "Gillian Dobbie"], "year": "2025", "abstract": "Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL", "pdf_url": "https://arxiv.org/pdf/2511.09018v1", "local_path": "data\\papers\\2511_09018v1.pdf"}
{"arxiv_id": "2409.20550v2", "title": "LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation", "authors": ["Ziyao Zhang", "Yanlin Wang", "Chong Wang", "Jiachi Chen", "Zibin Zheng"], "year": "2024", "abstract": "Code generation aims to automatically generate code from input requirements, significantly enhancing development efficiency. Recent large language models (LLMs) based approaches have shown promising results and revolutionized code generation task. Despite the promising performance, LLMs often generate contents with hallucinations, especially for the code generation scenario requiring the handling of complex contextual dependencies in practical development process. Although previous study has analyzed hallucinations in LLM-powered code generation, the study is limited to standalone function generation. In this paper, we conduct an empirical study to study the phenomena, mechanism, and mitigation of LLM hallucinations within more practical and complex development contexts in repository-level generation scenario. First, we manually examine the code generation results from six mainstream LLMs to establish a hallucination taxonomy of LLM-generated code. Next, we elaborate on the phenomenon of hallucinations, analyze their distribution across different models. We then analyze causes of hallucinations and identify four potential factors contributing to hallucinations. Finally, we propose an RAG-based mitigation method, which demonstrates consistent effectiveness in all studied LLMs. The replication package including code, data, and experimental results is available at https://github.com/DeepSoftwareAnalytics/LLMCodingHallucination", "pdf_url": "https://arxiv.org/pdf/2409.20550v2", "local_path": "data\\papers\\2409_20550v2.pdf"}
{"arxiv_id": "2511.15005v1", "title": "Mathematical Analysis of Hallucination Dynamics in Large Language Models: Uncertainty Quantification, Advanced Decoding, and Principled Mitigation", "authors": ["Moses Kiprono"], "year": "2025", "abstract": "Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable LLMs.", "pdf_url": "https://arxiv.org/pdf/2511.15005v1", "local_path": "data\\papers\\2511_15005v1.pdf"}
{"arxiv_id": "1906.07008v1", "title": "Hallucinated Adversarial Learning for Robust Visual Tracking", "authors": ["Qiangqiang Wu", "Zhihui Chen", "Lin Cheng", "Yan Yan", "Bo Li", "Hanzi Wang"], "year": "2019", "abstract": "Humans can easily learn new concepts from just a single exemplar, mainly due to their remarkable ability to imagine or hallucinate what the unseen exemplar may look like in different settings. Incorporating such an ability to hallucinate diverse new samples of the tracked instance can help the trackers alleviate the over-fitting problem in the low-data tracking regime. To achieve this, we propose an effective adversarial approach, denoted as adversarial \"hallucinator\" (AH), for robust visual tracking. The proposed AH is designed to firstly learn transferable non-linear deformations between a pair of same-identity instances, and then apply these deformations to an unseen tracked instance in order to generate diverse positive training samples. By incorporating AH into an online tracking-by-detection framework, we propose the hallucinated adversarial tracker (HAT), which jointly optimizes AH with an online classifier (e.g., MDNet) in an end-to-end manner. In addition, a novel selective deformation transfer (SDT) method is presented to better select the deformations which are more suitable for transfer. Extensive experiments on 3 popular benchmarks demonstrate that our HAT achieves the state-of-the-art performance.", "pdf_url": "https://arxiv.org/pdf/1906.07008v1", "local_path": "data\\papers\\1906_07008v1.pdf"}
{"arxiv_id": "1704.05295v1", "title": "Semantic Similarity from Natural Language and Ontology Analysis", "authors": ["Sébastien Harispe", "Sylvie Ranwez", "Stefan Janaqi", "Jacky Montmain"], "year": "2017", "abstract": "Artificial Intelligence federates numerous scientific fields in the aim of developing machines able to assist human operators performing complex treatments -- most of which demand high cognitive skills (e.g. learning or decision processes). Central to this quest is to give machines the ability to estimate the likeness or similarity between things in the way human beings estimate the similarity between stimuli. In this context, this book focuses on semantic measures: approaches designed for comparing semantic entities such as units of language, e.g. words, sentences, or concepts and instances defined into knowledge bases. The aim of these measures is to assess the similarity or relatedness of such semantic entities by taking into account their semantics, i.e. their meaning -- intuitively, the words tea and coffee, which both refer to stimulating beverage, will be estimated to be more semantically similar than the words toffee (confection) and coffee, despite that the last pair has a higher syntactic similarity. The two state-of-the-art approaches for estimating and quantifying semantic similarities/relatedness of semantic entities are presented in detail: the first one relies on corpora analysis and is based on Natural Language Processing techniques and semantic models while the second is based on more or less formal, computer-readable and workable forms of knowledge such as semantic networks, thesaurus or ontologies. (...) Beyond a simple inventory and categorization of existing measures, the aim of this monograph is to convey novices as well as researchers of these domains towards a better understanding of semantic similarity estimation and more generally semantic measures.", "pdf_url": "https://arxiv.org/pdf/1704.05295v1", "local_path": "data\\papers\\1704_05295v1.pdf"}
{"arxiv_id": "2604.25605v1", "title": "Health System Scale Semantic Search Across Unstructured Clinical Notes", "authors": ["Faith Wavinya Mutinda", "Spandana Makeneni", "Anna Lin", "Shivaji Dutta", "Irit R. Rasooly", "Patrick Dibussolo", "Shivani Kamath Belman", "Hessam Shahriari", "Kevin Murphy", "Alex B. Ruan", "Barbara H. Chaiyachati", "Sanjay Chainani", "Robert W. Grundmeier", "Scott M. Haag", "Jeffrey M. Miller", "Heather M. Griffis", "Ian M. Campbell"], "year": "2026", "abstract": "Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.", "pdf_url": "https://arxiv.org/pdf/2604.25605v1", "local_path": "data\\papers\\2604_25605v1.pdf"}
{"arxiv_id": "2009.13836v1", "title": "SIR: Similar Image Retrieval for Product Search in E-Commerce", "authors": ["Theban Stanley", "Nihar Vanjara", "Yanxin Pan", "Ekaterina Pirogova", "Swagata Chakraborty", "Abon Chaudhuri"], "year": "2020", "abstract": "We present a similar image retrieval (SIR) platform that is used to quickly discover visually similar products in a catalog of millions. Given the size, diversity, and dynamism of our catalog, product search poses many challenges. It can be addressed by building supervised models to tagging product images with labels representing themes and later retrieving them by labels. This approach suffices for common and perennial themes like \"white shirt\" or \"lifestyle image of TV\". It does not work for new themes such as \"e-cigarettes\", hard-to-define ones such as \"image with a promotional badge\", or the ones with short relevance span such as \"Halloween costumes\". SIR is ideal for such cases because it allows us to search by an example, not a pre-defined theme. We describe the steps - embedding computation, encoding, and indexing - that power the approximate nearest neighbor search back-end. We also highlight two applications of SIR. The first one is related to the detection of products with various types of potentially objectionable themes. This application is run with a sense of urgency, hence the typical time frame to train and bootstrap a model is not permitted. Also, these themes are often short-lived based on current trends, hence spending resources to build a lasting model is not justified. The second application is a variant item detection system where SIR helps discover visual variants that are hard to find through text search. We analyze the performance of SIR in the context of these applications.", "pdf_url": "https://arxiv.org/pdf/2009.13836v1", "local_path": "data\\papers\\2009_13836v1.pdf"}
{"arxiv_id": "1612.07710v2", "title": "Set Similarity Search Beyond MinHash", "authors": ["Tobias Christiani", "Rasmus Pagh"], "year": "2016", "abstract": "We consider the problem of approximate set similarity search under Braun-Blanquet similarity $B(\\mathbf{x}, \\mathbf{y}) = |\\mathbf{x} \\cap \\mathbf{y}| / \\max(|\\mathbf{x}|, |\\mathbf{y}|)$. The $(b_2, b_2)$-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets $P$ such that, given a query set $\\mathbf{q}$, if there exists $\\mathbf{x} \\in P$ with $B(\\mathbf{q}, \\mathbf{x}) \\geq b_1$, then we can efficiently return $\\mathbf{x}' \\in P$ with $B(\\mathbf{q}, \\mathbf{x}') > b_2$. We present a simple data structure that solves this problem with space usage $O(n^{1+ρ}\\log n + \\sum_{\\mathbf{x} \\in P}|\\mathbf{x}|)$ and query time $O(|\\mathbf{q}|n^ρ \\log n)$ where $n = |P|$ and $ρ= \\log(1/b_1)/\\log(1/b_2)$. Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of $ρ$ is tight across the parameter space, i.e., for every choice of constants $0 < b_2 < b_1 < 1$. In the case where all sets have the same size our solution strictly improves upon the value of $ρ$ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).", "pdf_url": "https://arxiv.org/pdf/1612.07710v2", "local_path": "data\\papers\\1612_07710v2.pdf"}
{"arxiv_id": "2502.14620v1", "title": "Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity", "authors": ["Xinghan Pan"], "year": "2025", "abstract": "This paper investigates the efficacy of RWKV, a novel language model architecture known for its linear attention mechanism, for generating sentence embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate the semantic similarity captured by embeddings from different hidden layers of a pre-trained RWKV model. The performance is assessed on the Microsoft Research Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared against a GloVe-based baseline. My results indicate that while RWKV embeddings capture some semantic relatedness, they underperform compared to the GloVe baseline in terms of Spearman correlation. I also analyze the inference time and GPU memory usage, highlighting the computational trade-offs associated with RWKV embeddings. The findings suggest that while RWKV offers potential advantages in terms of linear scaling, its zero-shot sentence embedding quality for semantic similarity tasks requires further investigation and potential task-specific fine-tuning to match or exceed simpler baselines.", "pdf_url": "https://arxiv.org/pdf/2502.14620v1", "local_path": "data\\papers\\2502_14620v1.pdf"}
{"arxiv_id": "2401.08281v4", "title": "The Faiss library", "authors": ["Matthijs Douze", "Alexandr Guzhva", "Chengqi Deng", "Jeff Johnson", "Gergely Szilvasy", "Pierre-Emmanuel Mazaré", "Maria Lomeli", "Lucas Hosseini", "Hervé Jégou"], "year": "2024", "abstract": "Vector databases typically manage large collections of embedding vectors. Currently, AI applications are growing rapidly, and so is the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper describes the trade-off space of vector search and the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected applications to highlight its broad applicability.", "pdf_url": "https://arxiv.org/pdf/2401.08281v4", "local_path": "data\\papers\\2401_08281v4.pdf"}
{"arxiv_id": "1312.5150v1", "title": "Semantic Jira - Semantic Expert Finder in the Bug Tracking Tool Jira", "authors": ["Velten Heyn", "Adrian Paschke"], "year": "2013", "abstract": "The semantic expert recommender extension for the Jira bug tracking system semantically searches for similar tickets in Jira and recommends experts and links to existing organizational (Wiki) knowledge for each ticket. This helps to avoid redundant work and supports the search and collaboration with experts in the project management and maintenance phase based on semantically enriched tickets in Jira.", "pdf_url": "https://arxiv.org/pdf/1312.5150v1", "local_path": "data\\papers\\1312_5150v1.pdf"}
{"arxiv_id": "1706.00957v1", "title": "Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines", "authors": ["Jan Rygl", "Jan Pomikálek", "Radim Řehůřek", "Michal Růžička", "Vít Novotný", "Petr Sojka"], "year": "2017", "abstract": "Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to `vector similarity searching' over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.", "pdf_url": "https://arxiv.org/pdf/1706.00957v1", "local_path": "data\\papers\\1706_00957v1.pdf"}
{"arxiv_id": "1401.2517v1", "title": "The semantic similarity ensemble", "authors": ["Andrea Ballatore", "Michela Bertolotto", "David C. Wilson"], "year": "2014", "abstract": "Computational measures of semantic similarity between geographic terms provide valuable support across geographic information retrieval, data mining, and information integration. To date, a wide variety of approaches to geo-semantic similarity have been devised. A judgment of similarity is not intrinsically right or wrong, but obtains a certain degree of cognitive plausibility, depending on how closely it mimics human behavior. Thus selecting the most appropriate measure for a specific task is a significant challenge. To address this issue, we make an analogy between computational similarity measures and soliciting domain expert opinions, which incorporate a subjective set of beliefs, perceptions, hypotheses, and epistemic biases. Following this analogy, we define the semantic similarity ensemble (SSE) as a composition of different similarity measures, acting as a panel of experts having to reach a decision on the semantic similarity of a set of geographic terms. The approach is evaluated in comparison to human judgments, and results indicate that an SSE performs better than the average of its parts. Although the best member tends to outperform the ensemble, all ensembles outperform the average performance of each ensemble's member. Hence, in contexts where the best measure is unknown, the ensemble provides a more cognitively plausible approach.", "pdf_url": "https://arxiv.org/pdf/1401.2517v1", "local_path": "data\\papers\\1401_2517v1.pdf"}
{"arxiv_id": "1307.2669v1", "title": "Text Categorization via Similarity Search: An Efficient and Effective Novel Algorithm", "authors": ["Hubert Haoyang Duan", "Vladimir Pestov", "Varun Singla"], "year": "2013", "abstract": "We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize overall. The algorithm is quite different from existing approaches in that it is based on similarity search in the metric space of measure distributions on the dictionary. At the preprocessing stage, given a labeled learning sample of texts, we associate to every class label (document category) a point in the space of question. Unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure distribution on a selection of domain-specific words. At the execution stage, an unlabeled text is assigned a text category as defined by the closest labeled neighbour to the point representing the frequency distribution of the words in the text. The algorithm is both effective and efficient, as further confirmed by experiments on the Reuters 21578 dataset.", "pdf_url": "https://arxiv.org/pdf/1307.2669v1", "local_path": "data\\papers\\1307_2669v1.pdf"}
{"arxiv_id": "2110.13151v2", "title": "Self-supervised similarity search for large scientific datasets", "authors": ["George Stein", "Peter Harrington", "Jacqueline Blaum", "Tomislav Medan", "Zarija Lukic"], "year": "2021", "abstract": "We present the use of self-supervised learning to explore and exploit large unlabeled datasets. Focusing on 42 million galaxy images from the latest data release of the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys, we first train a self-supervised model to distill low-dimensional representations that are robust to symmetries, uncertainties, and noise in each image. We then use the representations to construct and publicly release an interactive semantic similarity search tool. We demonstrate how our tool can be used to rapidly discover rare objects given only a single example, increase the speed of crowd-sourcing campaigns, and construct and improve training sets for supervised applications. While we focus on images from sky surveys, the technique is straightforward to apply to any scientific dataset of any dimensionality. The similarity search web app can be found at https://github.com/georgestein/galaxy_search", "pdf_url": "https://arxiv.org/pdf/2110.13151v2", "local_path": "data\\papers\\2110_13151v2.pdf"}
{"arxiv_id": "2006.07180v1", "title": "High-Level ETL for Semantic Data Warehouses -- Full Version", "authors": ["Rudra Pratap Deb Nath", "Oscar Romero", "Torben Bach Pedersen", "Katja Hose"], "year": "2020", "abstract": "The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.", "pdf_url": "https://arxiv.org/pdf/2006.07180v1", "local_path": "data\\papers\\2006_07180v1.pdf"}
{"arxiv_id": "1105.1406v1", "title": "Comparison Latent Semantic and WordNet Approach for Semantic Similarity Calculation", "authors": ["I Wayan Simri Wicaksana", "Bambang Wahyudi"], "year": "2011", "abstract": "Information exchange among many sources in Internet is more autonomous, dynamic and free. The situation drive difference view of concepts among sources. For example, word 'bank' has meaning as economic institution for economy domain, but for ecology domain it will be defined as slope of river or lake. In this aper, we will evaluate latent semantic and WordNet approach to calculate semantic similarity. The evaluation will be run for some concepts from different domain with reference by expert or human. Result of the evaluation can provide a contribution for mapping of concept, query rewriting, interoperability, etc.", "pdf_url": "https://arxiv.org/pdf/1105.1406v1", "local_path": "data\\papers\\1105_1406v1.pdf"}
{"arxiv_id": "2509.05750v1", "title": "Toward Efficient and Scalable Design of In-Memory Graph-Based Vector Search", "authors": ["Ilias Azizi", "Karima Echihab", "Themis Palpanas", "Vassilis Christophides"], "year": "2025", "abstract": "Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus, increasing the complexity of their analysis. Vector search is the backbone of many critical analytical tasks, and graph-based methods have become the best choice for analytical tasks that do not require guarantees on the quality of the answers. Although several paradigms (seed selection, incremental insertion, neighborhood propagation, neighborhood diversification, and divide-and-conquer) have been employed to design in-memory graph-based vector search algorithms, a systematic comparison of the key algorithmic advances is still missing. We conduct an exhaustive experimental evaluation of twelve state-of-the-art methods on seven real data collections, with sizes up to 1 billion vectors. We share key insights about the strengths and limitations of these methods; e.g., the best approaches are typically based on incremental insertion and neighborhood diversification, and the choice of the base graph can hurt scalability. Finally, we discuss open research directions, such as the importance of devising more sophisticated data adaptive seed selection and diversification strategies.", "pdf_url": "https://arxiv.org/pdf/2509.05750v1", "local_path": "data\\papers\\2509_05750v1.pdf"}
{"arxiv_id": "2405.05431v2", "title": "Searching for Programmatic Policies in Semantic Spaces", "authors": ["Rubens O. Moraes", "Levi H. S. Lelis"], "year": "2024", "abstract": "Syntax-guided synthesis is commonly used to generate programs encoding policies. In this approach, the set of programs, that can be written in a domain-specific language defines the search space, and an algorithm searches within this space for programs that encode strong policies. In this paper, we propose an alternative method for synthesizing programmatic policies, where we search within an approximation of the language's semantic space. We hypothesized that searching in semantic spaces is more sample-efficient compared to syntax-based spaces. Our rationale is that the search is more efficient if the algorithm evaluates different agent behaviors as it searches through the space, a feature often missing in syntax-based spaces. This is because small changes in the syntax of a program often do not result in different agent behaviors. We define semantic spaces by learning a library of programs that present different agent behaviors. Then, we approximate the semantic space by defining a neighborhood function for local search algorithms, where we replace parts of the current candidate program with programs from the library. We evaluated our hypothesis in a real-time strategy game called MicroRTS. Empirical results support our hypothesis that searching in semantic spaces can be more sample-efficient than searching in syntax-based spaces.", "pdf_url": "https://arxiv.org/pdf/2405.05431v2", "local_path": "data\\papers\\2405_05431v2.pdf"}
{"arxiv_id": "2512.14336v1", "title": "Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure", "authors": ["Jooyeol Yun", "Jaegul Choo"], "year": "2025", "abstract": "Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.", "pdf_url": "https://arxiv.org/pdf/2512.14336v1", "local_path": "data\\papers\\2512_14336v1.pdf"}
{"arxiv_id": "0807.4618v1", "title": "AceWiki: A Natural and Expressive Semantic Wiki", "authors": ["Tobias Kuhn"], "year": "2008", "abstract": "We present AceWiki, a prototype of a new kind of semantic wiki using the controlled natural language Attempto Controlled English (ACE) for representing its content. ACE is a subset of English with a restricted grammar and a formal semantics. The use of ACE has two important advantages over existing semantic wikis. First, we can improve the usability and achieve a shallow learning curve. Second, ACE is more expressive than the formal languages of existing semantic wikis. Our evaluation shows that people who are not familiar with the formal foundations of the Semantic Web are able to deal with AceWiki after a very short learning phase and without the help of an expert.", "pdf_url": "https://arxiv.org/pdf/0807.4618v1", "local_path": "data\\papers\\0807_4618v1.pdf"}
{"arxiv_id": "2105.00813v2", "title": "Transformers: \"The End of History\" for NLP?", "authors": ["Anton Chernyavskiy", "Dmitry Ilvovsky", "Preslav Nakov"], "year": "2021", "abstract": "Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for pre-existing models. Thus, here we aim to shed light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks -- segmentation and segment labeling -- and on four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naive ways, can yield sizable improvements over vanilla RoBERTa and XLNet models. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.", "pdf_url": "https://arxiv.org/pdf/2105.00813v2", "local_path": "data\\papers\\2105_00813v2.pdf"}
{"arxiv_id": "2104.12405v2", "title": "A dissemination workshop for introducing young Italian students to NLP", "authors": ["Lucio Messina", "Lucia Busso", "Claudia Roberta Combei", "Ludovica Pannitto", "Alessio Miaschi", "Gabriele Sarti", "Malvina Nissim"], "year": "2021", "abstract": "We describe and make available the game-based material developed for a laboratory run at several Italian science festivals to popularize NLP among young students.", "pdf_url": "https://arxiv.org/pdf/2104.12405v2", "local_path": "data\\papers\\2104_12405v2.pdf"}
{"arxiv_id": "2505.22202v2", "title": "Latent Reasoning via Sentence Embedding Prediction", "authors": ["Hyeonbin Hwang", "Byeongguk Jeon", "Seungone Kim", "Jiyeon Kim", "Hoyeon Chang", "Sohee Yang", "Seungpil Won", "Dohaeng Lee", "Youbin Ahn", "Minjoon Seo"], "year": "2025", "abstract": "Autoregressive language models (LMs) generate one token at a time, yet human reasoning operates over higher-level abstractions - sentences, propositions, and concepts. This contrast raises a central question- Can LMs likewise learn to reason over structured semantic units rather than raw token sequences? In this work, we investigate whether pretrained LMs can be lifted into such abstract reasoning spaces by building on their learned representations. We present a framework that adapts a pretrained token-level LM to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. We explore two embedding paradigms inspired by classical representation learning: 1) semantic embeddings, learned via autoencoding to preserve surface meaning; and 2) contextual embeddings, trained via next-sentence prediction to encode anticipatory structure. We evaluate both under two inference regimes: Discretized, which decodes each predicted embedding into text before re-encoding; and Continuous, which reasons entirely in embedding space for improved efficiency. Across four domains - mathematics, logic, commonsense, and planning - contextual embeddings under continuous inference show competitive performance with Chain-of-Thought (CoT) while reducing inference-time FLOPs on average by half. We also present early signs of scalability and modular adaptation. Finally, to visualize latent trajectories, we introduce SentenceLens, a diagnostic tool that decodes intermediate model states into interpretable sentences. Together, our results indicate that pretrained LMs can effectively transition to abstract, structured reasoning within latent embedding spaces.", "pdf_url": "https://arxiv.org/pdf/2505.22202v2", "local_path": "data\\papers\\2505_22202v2.pdf"}
{"arxiv_id": "2104.12422v2", "title": "Teaching NLP with Bracelets and Restaurant Menus: An Interactive Workshop for Italian Students", "authors": ["Ludovica Pannitto", "Lucia Busso", "Claudia Roberta Combei", "Lucio Messina", "Alessio Miaschi", "Gabriele Sarti", "Malvina Nissim"], "year": "2021", "abstract": "Although Natural Language Processing (NLP) is at the core of many tools young people use in their everyday life, high school curricula (in Italy) do not include any computational linguistics education. This lack of exposure makes the use of such tools less responsible than it could be and makes choosing computational linguistics as a university degree unlikely. To raise awareness, curiosity, and longer-term interest in young people, we have developed an interactive workshop designed to illustrate the basic principles of NLP and computational linguistics to high school Italian students aged between 13 and 18 years. The workshop takes the form of a game in which participants play the role of machines needing to solve some of the most common problems a computer faces in understanding language: from voice recognition to Markov chains to syntactic parsing. Participants are guided through the workshop with the help of instructors, who present the activities and explain core concepts from computational linguistics. The workshop was presented at numerous outlets in Italy between 2019 and 2021, both face-to-face and online.", "pdf_url": "https://arxiv.org/pdf/2104.12422v2", "local_path": "data\\papers\\2104_12422v2.pdf"}
{"arxiv_id": "2507.01991v1", "title": "FinAI-BERT: A Transformer-Based Model for Sentence-Level Detection of AI Disclosures in Financial Reports", "authors": ["Muhammad Bilal Zafar"], "year": "2025", "abstract": "The proliferation of artificial intelligence (AI) in financial services has prompted growing demand for tools that can systematically detect AI-related disclosures in corporate filings. While prior approaches often rely on keyword expansion or document-level classification, they fall short in granularity, interpretability, and robustness. This study introduces FinAI-BERT, a domain-adapted transformer-based language model designed to classify AI-related content at the sentence level within financial texts. The model was fine-tuned on a manually curated and balanced dataset of 1,586 sentences drawn from 669 annual reports of U.S. banks (2015 to 2023). FinAI-BERT achieved near-perfect classification performance (accuracy of 99.37 percent, F1 score of 0.993), outperforming traditional baselines such as Logistic Regression, Naive Bayes, Random Forest, and XGBoost. Interpretability was ensured through SHAP-based token attribution, while bias analysis and robustness checks confirmed the model's stability across sentence lengths, adversarial inputs, and temporal samples. Theoretically, the study advances financial NLP by operationalizing fine-grained, theme-specific classification using transformer architectures. Practically, it offers a scalable, transparent solution for analysts, regulators, and scholars seeking to monitor the diffusion and framing of AI across financial institutions.", "pdf_url": "https://arxiv.org/pdf/2507.01991v1", "local_path": "data\\papers\\2507_01991v1.pdf"}
{"arxiv_id": "2204.03251v3", "title": "Towards Automatic Construction of Filipino WordNet: Word Sense Induction and Synset Induction Using Sentence Embeddings", "authors": ["Dan John Velasco", "Axel Alba", "Trisha Gail Pelagio", "Bryce Anthony Ramirez", "Unisse Chua", "Briane Paul Samson", "Jan Christian Blaise Cruz", "Charibeth Cheng"], "year": "2022", "abstract": "Wordnets are indispensable tools for various natural language processing applications. Unfortunately, wordnets get outdated, and producing or updating wordnets can be slow and costly in terms of time and resources. This problem intensifies for low-resource languages. This study proposes a method for word sense induction and synset induction using only two linguistic resources, namely, an unlabeled corpus and a sentence embeddings-based language model. The resulting sense inventory and synonym sets can be used in automatically creating a wordnet. We applied this method on a corpus of Filipino text. The sense inventory and synsets were evaluated by matching them with the sense inventory of the machine translated Princeton WordNet, as well as comparing the synsets to the Filipino WordNet. This study empirically shows that the 30% of the induced word senses are valid and 40% of the induced synsets are valid in which 20% are novel synsets.", "pdf_url": "https://arxiv.org/pdf/2204.03251v3", "local_path": "data\\papers\\2204_03251v3.pdf"}
{"arxiv_id": "2101.10642v1", "title": "Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks", "authors": ["Hyunjin Choi", "Judong Kim", "Seongho Joe", "Youngjune Gwon"], "year": "2021", "abstract": "Contextualized representations from a pre-trained language model are central to achieve a high performance on downstream NLP task. The pre-trained BERT and A Lite BERT (ALBERT) models can be fine-tuned to give state-ofthe-art results in sentence-pair regressions such as semantic textual similarity (STS) and natural language inference (NLI). Although BERT-based models yield the [CLS] token vector as a reasonable sentence embedding, the search for an optimal sentence embedding scheme remains an active research area in computational linguistics. This paper explores on sentence embedding models for BERT and ALBERT. In particular, we take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT). We also experiment with an outer CNN sentence-embedding network for SBERT and SALBERT. We evaluate performances of all sentence-embedding models considered using the STS and NLI datasets. The empirical results indicate that our CNN architecture improves ALBERT models substantially more than BERT models for STS benchmark. Despite significantly fewer model parameters, ALBERT sentence embedding is highly competitive to BERT in downstream NLP evaluations.", "pdf_url": "https://arxiv.org/pdf/2101.10642v1", "local_path": "data\\papers\\2101_10642v1.pdf"}
{"arxiv_id": "2412.04784v2", "title": "NLP-ADBench: NLP Anomaly Detection Benchmark", "authors": ["Yuangang Li", "Jiaqi Li", "Zhuo Xiao", "Tiankai Yang", "Yi Nian", "Xiyang Hu", "Yue Zhao"], "year": "2024", "abstract": "Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.", "pdf_url": "https://arxiv.org/pdf/2412.04784v2", "local_path": "data\\papers\\2412_04784v2.pdf"}
{"arxiv_id": "2412.08520v1", "title": "GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek", "authors": ["Lefteris Loukas", "Nikolaos Smyrnioudis", "Chrysa Dikonomaki", "Spyros Barbakos", "Anastasios Toumazatos", "John Koutsikakis", "Manolis Kyriakakis", "Mary Georgiou", "Stavros Vassos", "John Pavlopoulos", "Ion Androutsopoulos"], "year": "2024", "abstract": "We present GR-NLP-TOOLKIT, an open-source natural language processing (NLP) toolkit developed specifically for modern Greek. The toolkit provides state-of-the-art performance in five core NLP tasks, namely part-of-speech tagging, morphological tagging, dependency parsing, named entity recognition, and Greeklishto-Greek transliteration. The toolkit is based on pre-trained Transformers, it is freely available, and can be easily installed in Python (pip install gr-nlp-toolkit). It is also accessible through a demonstration platform on HuggingFace, along with a publicly available API for non-commercial use. We discuss the functionality provided for each task, the underlying methods, experiments against comparable open-source toolkits, and future possible enhancements. The toolkit is available at: https://github.com/nlpaueb/gr-nlp-toolkit", "pdf_url": "https://arxiv.org/pdf/2412.08520v1", "local_path": "data\\papers\\2412_08520v1.pdf"}
{"arxiv_id": "2110.15725v1", "title": "Batch-Softmax Contrastive Loss for Pairwise Sentence Scoring Tasks", "authors": ["Anton Chernyavskiy", "Dmitry Ilvovsky", "Pavel Kalinin", "Preslav Nakov"], "year": "2021", "abstract": "The use of contrastive loss for representation learning has become prominent in computer vision, and it is now getting attention in Natural Language Processing (NLP). Here, we explore the idea of using a batch-softmax contrastive loss when fine-tuning large-scale pre-trained transformer models to learn better task-specific sentence embeddings for pairwise sentence scoring tasks. We introduce and study a number of variations in the calculation of the loss as well as in the overall training procedure; in particular, we find that data shuffling can be quite important. Our experimental results show sizable improvements on a number of datasets and pairwise sentence scoring tasks including classification, ranking, and regression. Finally, we offer detailed analysis and discussion, which should be useful for researchers aiming to explore the utility of contrastive loss in NLP.", "pdf_url": "https://arxiv.org/pdf/2110.15725v1", "local_path": "data\\papers\\2110_15725v1.pdf"}
{"arxiv_id": "2501.15876v1", "title": "Optimizing Sentence Embedding with Pseudo-Labeling and Model Ensembles: A Hierarchical Framework for Enhanced NLP Tasks", "authors": ["Ziwei Liu", "Qi Zhang", "Lifu Gao"], "year": "2025", "abstract": "Sentence embedding tasks are important in natural language processing (NLP), but improving their performance while keeping them reliable is still hard. This paper presents a framework that combines pseudo-label generation and model ensemble techniques to improve sentence embeddings. We use external data from SimpleWiki, Wikipedia, and BookCorpus to make sure the training data is consistent. The framework includes a hierarchical model with an encoding layer, refinement layer, and ensemble prediction layer, using ALBERT-xxlarge, RoBERTa-large, and DeBERTa-large models. Cross-attention layers combine external context, and data augmentation techniques like synonym replacement and back-translation increase data variety. Experimental results show large improvements in accuracy and F1-score compared to basic models, and studies confirm that cross-attention and data augmentation make a difference. This work presents an effective way to improve sentence embedding tasks and lays the groundwork for future NLP research.", "pdf_url": "https://arxiv.org/pdf/2501.15876v1", "local_path": "data\\papers\\2501_15876v1.pdf"}
{"arxiv_id": "2602.19174v5", "title": "TurkicNLP: An NLP Toolkit for Turkic Languages", "authors": ["Sherzod Hakimov"], "year": "2026", "abstract": "Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .", "pdf_url": "https://arxiv.org/pdf/2602.19174v5", "local_path": "data\\papers\\2602_19174v5.pdf"}
{"arxiv_id": "2208.02402v2", "title": "Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models", "authors": ["Vilém Zouhar", "Marius Mosbach", "Dietrich Klakow"], "year": "2022", "abstract": "Although masked language models are highly performant and widely adopted by NLP practitioners, they can not be easily used for autoregressive language modelling (next word prediction and sequence probability estimation). We present an LSTM-based autoregressive language model which uses prefix embeddings (from a pretrained masked language model) via fusion (e.g. concatenation) to obtain a richer context representation for language modelling. We find that fusion helps reliably in lowering the perplexity (16.74 $\\rightarrow$ 15.80), which is even preserved after a transfer to a dataset from a different domain than the training data. We also evaluate the best-performing fusion model by correlating its next word surprisal estimates with human reading times. Contradicting our expectation, and despite the improvement in perplexity overall, the correlation remains the same as for the baseline model. Lastly, while we focus on language models pre-trained on text as the sources for the fusion, our approach can be possibly extended to fuse any information represented as a fixed-size vector into an auto-regressive language model. These include e.g. sentence external information retrieved for a knowledge base or representations of multi-modal encoders.", "pdf_url": "https://arxiv.org/pdf/2208.02402v2", "local_path": "data\\papers\\2208_02402v2.pdf"}
{"arxiv_id": "1906.01575v1", "title": "Pitfalls in the Evaluation of Sentence Embeddings", "authors": ["Steffen Eger", "Andreas Rücklé", "Iryna Gurevych"], "year": "2019", "abstract": "Deep learning models continuously break new records across different NLP tasks. At the same time, their success exposes weaknesses of model evaluation. Here, we compile several key pitfalls of evaluation of sentence embeddings, a currently very popular NLP paradigm. These pitfalls include the comparison of embeddings of different sizes, normalization of embeddings, and the low (and diverging) correlations between transfer and probing tasks. Our motivation is to challenge the current evaluation of sentence embeddings and to provide an easy-to-access reference for future research. Based on our insights, we also recommend better practices for better future evaluations of sentence embeddings.", "pdf_url": "https://arxiv.org/pdf/1906.01575v1", "local_path": "data\\papers\\1906_01575v1.pdf"}
{"arxiv_id": "1911.03895v2", "title": "A Bilingual Generative Transformer for Semantic Sentence Embedding", "authors": ["John Wieting", "Graham Neubig", "Taylor Berg-Kirkpatrick"], "year": "2019", "abstract": "Semantic sentence embedding models encode natural language sentences into vectors, such that closeness in embedding space indicates closeness in the semantics between the sentences. Bilingual data offers a useful signal for learning such embeddings: properties shared by both sentences in a translation pair are likely semantic, while divergent properties are likely stylistic or language-specific. We propose a deep latent variable model that attempts to perform source separation on parallel sentences, isolating what they have in common in a latent semantic vector, and explaining what is left over with language-specific latent vectors. Our proposed approach differs from past work on semantic sentence encoding in two ways. First, by using a variational probabilistic framework, we introduce priors that encourage source separation, and can use our model's posterior to predict sentence embeddings for monolingual data at test time. Second, we use high-capacity transformers as both data generating distributions and inference networks -- contrasting with most past work on sentence embeddings. In experiments, our approach substantially outperforms the state-of-the-art on a standard suite of unsupervised semantic similarity evaluations. Further, we demonstrate that our approach yields the largest gains on more difficult subsets of these evaluations where simple word overlap is not a good indicator of similarity.", "pdf_url": "https://arxiv.org/pdf/1911.03895v2", "local_path": "data\\papers\\1911_03895v2.pdf"}
{"arxiv_id": "2204.00820v2", "title": "Efficient comparison of sentence embeddings", "authors": ["Spyros Zoupanos", "Stratis Kolovos", "Athanasios Kanavos", "Orestis Papadimitriou", "Manolis Maragoudakis"], "year": "2022", "abstract": "The domain of natural language processing (NLP), which has greatly evolved over the last years, has highly benefited from the recent developments in word and sentence embeddings. Such embeddings enable the transformation of complex NLP tasks, like semantic similarity or Question and Answering (Q&A), into much simpler to perform vector comparisons. However, such a problem transformation raises new challenges like the efficient comparison of embeddings and their manipulation. In this work, we will discuss about various word and sentence embeddings algorithms, we will select a sentence embedding algorithm, BERT, as our algorithm of choice and we will evaluate the performance of two vector comparison approaches, FAISS and Elasticsearch, in the specific problem of sentence embeddings. According to the results, FAISS outperforms Elasticsearch when used in a centralized environment with only one node, especially when big datasets are included.", "pdf_url": "https://arxiv.org/pdf/2204.00820v2", "local_path": "data\\papers\\2204_00820v2.pdf"}
{"arxiv_id": "2106.02359v3", "title": "How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact", "authors": ["Zhijing Jin", "Geeticka Chauhan", "Brian Tse", "Mrinmaya Sachan", "Rada Mihalcea"], "year": "2021", "abstract": "Recent years have seen many breakthroughs in natural language processing (NLP), transitioning it from a mostly theoretical field to one with many real-world applications. Noting the rising number of applications of other machine learning and AI techniques with pervasive societal impact, we anticipate the rising importance of developing NLP technologies for social good. Inspired by theories in moral philosophy and global priorities research, we aim to promote a guideline for social good in the context of NLP. We lay the foundations via the moral philosophy definition of social good, propose a framework to evaluate the direct and indirect real-world impact of NLP tasks, and adopt the methodology of global priorities research to identify priority causes for NLP research. Finally, we use our theoretical framework to provide some practical guidelines for future NLP research for social good. Our data and code are available at http://github.com/zhijing-jin/nlp4sg_acl2021. In addition, we curate a list of papers and resources on NLP for social good at https://github.com/zhijing-jin/NLP4SocialGood_Papers.", "pdf_url": "https://arxiv.org/pdf/2106.02359v3", "local_path": "data\\papers\\2106_02359v3.pdf"}
{"arxiv_id": "2105.00895v1", "title": "Teaching NLP outside Linguistics and Computer Science classrooms: Some challenges and some opportunities", "authors": ["Sowmya Vajjala"], "year": "2021", "abstract": "NLP's sphere of influence went much beyond computer science research and the development of software applications in the past decade. We see people using NLP methods in a range of academic disciplines from Asian Studies to Clinical Oncology. We also notice the presence of NLP as a module in most of the data science curricula within and outside of regular university setups. These courses are taken by students from very diverse backgrounds. This paper takes a closer look at some issues related to teaching NLP to these diverse audiences based on my classroom experiences, and identifies some challenges the instructors face, particularly when there is no ecosystem of related courses for the students. In this process, it also identifies a few challenge areas for both NLP researchers and tool developers.", "pdf_url": "https://arxiv.org/pdf/2105.00895v1", "local_path": "data\\papers\\2105_00895v1.pdf"}
{"arxiv_id": "2011.14743v1", "title": "IPPOG : Bridging the gap between science education at school and modern scientific research", "authors": ["Barbora Bruant Gulejova"], "year": "2020", "abstract": "The International Particle Physics Outreach Group (IPPOG) has been making concerted and systematic efforts to present and popularise particle physics across all audiences and age groups since 1997. Today the scientific community has in IPPOG a strategic pillar in fostering long-term, sustainable support for fundamental research around the world. One of the main tools IPPOG has been offering to the scientific community, teachers and educators for almost 10 years is the Resource Database (RDB), an online platform containing a collection of high quality engaging education and outreach materials in particle physics and related sciences.", "pdf_url": "https://arxiv.org/pdf/2011.14743v1", "local_path": "data\\papers\\2011_14743v1.pdf"}
{"arxiv_id": "1911.08755v1", "title": "Global Thread-Level Inference for Comment Classification in Community Question Answering", "authors": ["Shafiq Joty", "Alberto Barrón-Cedeño", "Giovanni Da San Martino", "Simone Filice", "Lluís Màrquez", "Alessandro Moschitti", "Preslav Nakov"], "year": "2019", "abstract": "Community question answering, a recent evolution of question answering in the Web context, allows a user to quickly consult the opinion of a number of people on a particular topic, thus taking advantage of the wisdom of the crowd. Here we try to help the user by deciding automatically which answers are good and which are bad for a given question. In particular, we focus on exploiting the output structure at the thread level in order to make more consistent global decisions. More specifically, we exploit the relations between pairs of comments at any distance in the thread, which we incorporate in a graph-cut and in an ILP frameworks. We evaluated our approach on the benchmark dataset of SemEval-2015 Task 3. Results improved over the state of the art, confirming the importance of using thread level information.", "pdf_url": "https://arxiv.org/pdf/1911.08755v1", "local_path": "data\\papers\\1911_08755v1.pdf"}
{"arxiv_id": "1608.04185v3", "title": "Learning to Rank Questions for Community Question Answering with Ranking SVM", "authors": ["Minh-Tien Nguyen", "Viet-Anh Phan", "Truong-Son Nguyen", "Minh-Le Nguyen"], "year": "2016", "abstract": "This paper presents our method to retrieve relevant queries given a new question in the context of Discovery Challenge: Learning to Re-Ranking Questions for Community Question Answering competition. In order to do that, a set of learning to rank methods was investigated to select an appropriate method. The selected method was optimized on training data by using a search strategy. After optimizing, the method was applied to development and test set. Results from the competition indicate that the performance of our method outperforms almost participants and show that Ranking SVM is efficient for retrieving relevant queries in community question answering.", "pdf_url": "https://arxiv.org/pdf/1608.04185v3", "local_path": "data\\papers\\1608_04185v3.pdf"}
{"arxiv_id": "1906.01727v1", "title": "SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums", "authors": ["Tsvetomila Mihaylova", "Georgi Karadjov", "Pepa Atanasova", "Ramy Baly", "Mitra Mohtarami", "Preslav Nakov"], "year": "2019", "abstract": "We present SemEval-2019 Task 8 on Fact Checking in Community Question Answering Forums, which features two subtasks. Subtask A is about deciding whether a question asks for factual information vs. an opinion/advice vs. just socializing. Subtask B asks to predict whether an answer to a factual question is true, false or not a proper answer. We received 17 official submissions for subtask A and 11 official submissions for Subtask B. For subtask A, all systems improved over the majority class baseline. For Subtask B, all systems were below a majority class baseline, but several systems were very close to it. The leaderboard and the data from the competition can be found at http://competitions.codalab.org/competitions/20022", "pdf_url": "https://arxiv.org/pdf/1906.01727v1", "local_path": "data\\papers\\1906_01727v1.pdf"}
{"arxiv_id": "1912.02998v1", "title": "Machine Translation Evaluation Meets Community Question Answering", "authors": ["Francisco Guzmán", "Lluís Màrquez", "Preslav Nakov"], "year": "2019", "abstract": "We explore the applicability of machine translation evaluation (MTE) methods to a very different problem: answer ranking in community Question Answering. In particular, we adopt a pairwise neural network (NN) architecture, which incorporates MTE features, as well as rich syntactic and semantic embeddings, and which efficiently models complex non-linear interactions. The evaluation results show state-of-the-art performance, with sizeable contribution from both the MTE features and from the pairwise NN architecture.", "pdf_url": "https://arxiv.org/pdf/1912.02998v1", "local_path": "data\\papers\\1912_02998v1.pdf"}
{"arxiv_id": "2005.11401v4", "title": "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", "authors": ["Patrick Lewis", "Ethan Perez", "Aleksandra Piktus", "Fabio Petroni", "Vladimir Karpukhin", "Naman Goyal", "Heinrich Küttler", "Mike Lewis", "Wen-tau Yih", "Tim Rocktäschel", "Sebastian Riedel", "Douwe Kiela"], "year": "2020", "abstract": "Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.", "pdf_url": "https://arxiv.org/pdf/2005.11401v4", "local_path": "data\\papers\\2005_11401v4.pdf"}
{"arxiv_id": "2112.08688v2", "title": "Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks", "authors": ["Akari Asai", "Matt Gardner", "Hannaneh Hajishirzi"], "year": "2021", "abstract": "Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks such as open question answering and fact verification. These models are trained to generate the final output given the retrieved passages, which can be irrelevant to the original query, leading to learning spurious cues or answer memorization. This work introduces a method to incorporate the evidentiality of passages -- whether a passage contains correct evidence to support the output -- into training the generator. We introduce a multi-task learning framework to jointly generate the final output and predict the evidentiality of each passage, leveraging a new task-agnostic method to obtain silver evidentiality labels for supervision. Our experiments on five datasets across three knowledge-intensive tasks show that our new evidentiality-guided generator significantly outperforms its direct counterpart with the same-size model and advances the state of the art on FaVIQ-Ambig. We attribute these improvements to both the auxiliary multi-task learning and silver evidentiality mining techniques.", "pdf_url": "https://arxiv.org/pdf/2112.08688v2", "local_path": "data\\papers\\2112_08688v2.pdf"}
{"arxiv_id": "2202.08772v1", "title": "A Survey of Knowledge-Intensive NLP with Pre-Trained Language Models", "authors": ["Da Yin", "Li Dong", "Hao Cheng", "Xiaodong Liu", "Kai-Wei Chang", "Furu Wei", "Jianfeng Gao"], "year": "2022", "abstract": "With the increasing of model capacity brought by pre-trained language models, there emerges boosting needs for more knowledgeable natural language processing (NLP) models with advanced functionalities including providing and making flexible use of encyclopedic and commonsense knowledge. The mere pre-trained language models, however, lack the capacity of handling such knowledge-intensive NLP tasks alone. To address this challenge, large numbers of pre-trained language models augmented with external knowledge sources are proposed and in rapid development. In this paper, we aim to summarize the current progress of pre-trained language model-based knowledge-enhanced models (PLMKEs) by dissecting their three vital elements: knowledge sources, knowledge-intensive NLP tasks, and knowledge fusion methods. Finally, we present the challenges of PLMKEs based on the discussion regarding the three elements and attempt to provide NLP practitioners with potential directions for further research.", "pdf_url": "https://arxiv.org/pdf/2202.08772v1", "local_path": "data\\papers\\2202_08772v1.pdf"}
{"arxiv_id": "1711.01505v1", "title": "Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task", "authors": ["Allyson Ettinger", "Sudha Rao", "Hal Daumé", "Emily M. Bender"], "year": "2017", "abstract": "This paper presents a summary of the first Workshop on Building Linguistically Generalizable Natural Language Processing Systems, and the associated Build It Break It, The Language Edition shared task. The goal of this workshop was to bring together researchers in NLP and linguistics with a shared task aimed at testing the generalizability of NLP systems beyond the distributions of their training data. We describe the motivation, setup, and participation of the shared task, provide discussion of some highlighted results, and discuss lessons learned.", "pdf_url": "https://arxiv.org/pdf/1711.01505v1", "local_path": "data\\papers\\1711_01505v1.pdf"}
{"arxiv_id": "2604.16915v1", "title": "KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains", "authors": ["Parthaw Goswami", "Jaynto Goswami Deep"], "year": "2026", "abstract": "Retrieval augmented generation (RAG) has transformed text based question answering, yet its extension to visual domains remains hindered by fundamental challenges: bridging the modality gap between image queries and text heavy knowledge bases, constructing semantically meaningful visual knowledge bases, performing multihop reasoning over retrieved images, and verifying that generated answers are faithfully grounded in visual evidence. We present KIRA (Knowledge Intensive Image Retrieval and Reasoning Architecture), a unified five stage framework that addresses ten core problems in visual RAG for specialized domains. KIRA introduces: (1) hierarchical semantic chunking with DINO based region detection for multi granularity knowledge base construction, (2) domain adaptive contrastive encoders with fewshot adaptation for rare visual concepts, (3) dualpath crossmodal retrieval with chainOfThought query expansion, (4) chainOfRetrieval for multihop visual reasoning with temporal and multiview support, and (5) evidence conditioned grounded generation with posthoc hallucination verification. We also propose DOMAINVQAR, a benchmark suite that evaluates visual RAG along three axes (retrieval precision, reasoning faithfulness, and domain correctness) going beyond standard recall metrics. Experiments across four specialized domains (medical Xray, circuit diagrams, satellite imagery, and histopathology) with a progressive six variant ablation demonstrate that KIRA achieves 0.97 retrieval precision, 1.0 grounding scores, and 0.707 domain correctness averaged across domains, while the ablation reveals actionable insights about when each component helps and when components introduce precision diversity tradeoffs that must be managed. Code will be released upon acceptance.", "pdf_url": "https://arxiv.org/pdf/2604.16915v1", "local_path": "data\\papers\\2604_16915v1.pdf"}
{"arxiv_id": "2410.08918v1", "title": "Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing", "authors": ["Isaac Johnson", "Lucie-Aimée Kaffee", "Miriam Redi"], "year": "2024", "abstract": "Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.", "pdf_url": "https://arxiv.org/pdf/2410.08918v1", "local_path": "data\\papers\\2410_08918v1.pdf"}
{"arxiv_id": "2302.04700v1", "title": "Augmenting NLP data to counter Annotation Artifacts for NLI Tasks", "authors": ["Armaan Singh Bhullar"], "year": "2023", "abstract": "In this paper, we explore Annotation Artifacts - the phenomena wherein large pre-trained NLP models achieve high performance on benchmark datasets but do not actually \"solve\" the underlying task and instead rely on some dataset artifacts (same across train, validation, and test sets) to figure out the right answer. We explore this phenomenon on the well-known Natural Language Inference task by first using contrast and adversarial examples to understand limitations to the model's performance and show one of the biases arising from annotation artifacts (the way training data was constructed by the annotators). We then propose a data augmentation technique to fix this bias and measure its effectiveness.", "pdf_url": "https://arxiv.org/pdf/2302.04700v1", "local_path": "data\\papers\\2302_04700v1.pdf"}
{"arxiv_id": "2112.09924v2", "title": "The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus", "authors": ["Aleksandra Piktus", "Fabio Petroni", "Vladimir Karpukhin", "Dmytro Okhonko", "Samuel Broscheit", "Gautier Izacard", "Patrick Lewis", "Barlas Oğuz", "Edouard Grave", "Wen-tau Yih", "Sebastian Riedel"], "year": "2021", "abstract": "In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus to a universal web snapshot. We investigate a slate of NLP tasks which rely on knowledge - either factual or common sense, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, otherwise a common background corpus in KI-NLP, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the web. Despite potential gaps in coverage, challenges of scale, lack of structure and lower quality, we find that retrieval from Sphere enables a state of the art system to match and even outperform Wikipedia-based models on several tasks. We also observe that while a dense index can outperform a sparse BM25 baseline on Wikipedia, on Sphere this is not yet possible. To facilitate further research and minimise the community's reliance on proprietary, black-box search engines, we share our indices, evaluation metrics and infrastructure.", "pdf_url": "https://arxiv.org/pdf/2112.09924v2", "local_path": "data\\papers\\2112_09924v2.pdf"}
{"arxiv_id": "2408.05664v1", "title": "Training an NLP Scholar at a Small Liberal Arts College: A Backwards Designed Course Proposal", "authors": ["Grusha Prasad", "Forrest Davis"], "year": "2024", "abstract": "The rapid growth in natural language processing (NLP) over the last couple years has generated student interest and excitement in learning more about the field. In this paper, we present two types of students that NLP courses might want to train. First, an \"NLP engineer\" who is able to flexibly design, build and apply new technologies in NLP for a wide range of tasks. Second, an \"NLP scholar\" who is able to pose, refine and answer questions in NLP and how it relates to the society, while also learning to effectively communicate these answers to a broader audience. While these two types of skills are not mutually exclusive -- NLP engineers should be able to think critically, and NLP scholars should be able to build systems -- we think that courses can differ in the balance of these skills. As educators at Small Liberal Arts Colleges, the strengths of our students and our institution favors an approach that is better suited to train NLP scholars. In this paper we articulate what kinds of skills an NLP scholar should have, and then adopt a backwards design to propose course components that can aid the acquisition of these skills.", "pdf_url": "https://arxiv.org/pdf/2408.05664v1", "local_path": "data\\papers\\2408_05664v1.pdf"}
{"arxiv_id": "2508.16910v1", "title": "Unbiased Reasoning for Knowledge-Intensive Tasks in Large Language Models via Conditional Front-Door Adjustment", "authors": ["Bo Zhao", "Yinghao Zhang", "Ziqi Xu", "Yongli Ren", "Xiuzhen Zhang", "Renqiang Luo", "Zaiwen Feng", "Feng Xia"], "year": "2025", "abstract": "Large Language Models (LLMs) have shown impressive capabilities in natural language processing but still struggle to perform well on knowledge-intensive tasks that require deep reasoning and the integration of external knowledge. Although methods such as Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) have been proposed to enhance LLMs with external knowledge, they still suffer from internal bias in LLMs, which often leads to incorrect answers. In this paper, we propose a novel causal prompting framework, Conditional Front-Door Prompting (CFD-Prompting), which enables the unbiased estimation of the causal effect between the query and the answer, conditional on external knowledge, while mitigating internal bias. By constructing counterfactual external knowledge, our framework simulates how the query behaves under varying contexts, addressing the challenge that the query is fixed and is not amenable to direct causal intervention. Compared to the standard front-door adjustment, the conditional variant operates under weaker assumptions, enhancing both robustness and generalisability of the reasoning process. Extensive experiments across multiple LLMs and benchmark datasets demonstrate that CFD-Prompting significantly outperforms existing baselines in both accuracy and robustness.", "pdf_url": "https://arxiv.org/pdf/2508.16910v1", "local_path": "data\\papers\\2508_16910v1.pdf"}
{"arxiv_id": "2010.03061v1", "title": "A Survey on Recognizing Textual Entailment as an NLP Evaluation", "authors": ["Adam Poliak"], "year": "2020", "abstract": "Recognizing Textual Entailment (RTE) was proposed as a unified evaluation framework to compare semantic understanding of different NLP systems. In this survey paper, we provide an overview of different approaches for evaluating and understanding the reasoning capabilities of NLP systems. We then focus our discussion on RTE by highlighting prominent RTE datasets as well as advances in RTE dataset that focus on specific linguistic phenomena that can be used to evaluate NLP systems on a fine-grained level. We conclude by arguing that when evaluating NLP systems, the community should utilize newly introduced RTE datasets that focus on specific linguistic phenomena.", "pdf_url": "https://arxiv.org/pdf/2010.03061v1", "local_path": "data\\papers\\2010_03061v1.pdf"}