Daily Dose of Large Language Models
Published:
Update LLM paper everyday!
09.Oct.
SCALABLE AND ACCURATE GRAPH REASONING WITH LLM-BASED MULTI-AGENTS
Yuwei Hu, Runlin Lei, Xinyi Huang, Zhewei Wei, Yongchao Liu2∗
In graph reasoning tasks, traditional methods often use a single LLM whereas the paper propose a framework based on multi-agent collaboration to solve graph reasoning problems. On each node of the graph, a LLM receives and passes messages until maximum iterations.
KeyWord: Multi-agent, Graph Reasoning
REVISEVAL: IMPROVING LLM-AS-A-JUDGE VIA RESPONSE-ADAPTED REFERENCES
Qiyuan Zhang, Yufei Wang, Tiezheng Yu, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma1
Good related work on LLM-as-Judge. But the methodology seems trivial. For a generated response, the paper use an LLM to generate a reference, and use another LLM to evaluate the input-output-reference combination.
KeyWord: LLM-as-Judge
EFFICIENT INFERENCE FOR LARGE LANGUAGE MODELBASED GENERATIVE RECOMMENDATION\
Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua
A Speculative Decoding variant
KeyWord: LLM Inference
03.Oct.
Loki: An Open-Source Tool for Fact Verification
Haonan Li, Xudong Han, Hao Wang, Yuxia Wang, Minghan Wang, Rui Xing, Yilin Geng, Zenan Zhai, Preslav Nakov, Timothy Baldwin (LibrAI, MBZUAI, Monash University, The University of Melbourne)
The paper use direct prompting to decompose the task of fact checking into five processes, and the sub-moudles are: Decomposer, Checkworthiness Identifier, Query Generator, Evidence Retriever and Claim Verifier. I like the idea of breaking the thinking process into pre-defined stages and direct prompting is a straightforward implementation. The paper also talk about practical parallel implementation.
KeyWord: Direct Prompting
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, Thomas L. Griffiths (Yale, OpenAI, Princeton, )
A technical report finding that o1 scores substantially better on examples with high-probability outputs than ones with low-probability outputs. And o1 shows substantially less sensitivity to task frequency than the other LLMs. The problem of auto-regression still exists.
KeyWord: Evaluation, O1
DreamGarden: A Designer Assistant for Growing Games from a Single Prompt
Sam Earle, Samyak Parajuli, Andrzej Banburski-Fahey LLM assistant for Game design through direct prompting. The key is to decompose the task and assign the subtasks to each agents.
KeyWord: Direct Prompting, Human-AI Interaction
Investigating on RLHF methodology
Alexey Kutalev, Sergei Markoff
A survey on RLHF (Well it save my time :))
KeyWord: Survey, RLHF
OPEN-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models
Shayekh Bin Islam, Md Asib Rahman, K S M Tozammel Hossain, Enamul Hoque, Shafiq Joty, Md Rizwan Parvez
LLM + Moe, train the model to generate retrieval/no_retrieval reflection tokens and measure the confidence of outputs conditioned on enforced no_retrieval during inference, to decide whether to do retrival.
KeyWord: RAG
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context
Spencer Frei, Gal Vardi (UC Davis)
Investigate the generalization abilities of a linear transformer on linear classification tasks. It generalize nicely when data has label-flipping noise, or in ICL, the model remember the noise but still generalize.
KeyWord: fundation model, generalization
DEFINE: ENHANCING LLM DECISION-MAKING WITH FACTOR PROFILES AND ANALOGICAL REASONING
Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu (University of Central Florida)
This one is interesting. The question it tries to answer is: how LLMs can guide investment decisions by analyzing earnings call transcripts? The goal is to dig for key factors that are important for decision-making. The paper interprete in a Bayesian inference way. Given an article, the LLM is firstly used (as a prior proposal) to summerize factors, then they use train a Bradle-Terry Model to score the likelihood (conditiaonl likelihood function). A direct posterior sampling analogy.
KeyWord: factor analysis, bayesian sense
QUANTIFYING GENERALIZATION COMPLEXITY FOR LARGE LANGUAGE MODELS
Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, James Glass (Harvard, MIT, Chicago, Meta)
The paper propose an evaluation pipeline to test generalization. The key insight here is to query the LLM for its in-distribution data, and sample OOD data from its complement, then test LLM on both of the synthetic data. A dynamic generalization evaluation piptline, also intersting. Wonder what an open-ended version will look like?
KeyWord: evaluation, generalization
Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models
Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen (Unversity of Pennsylvania)
LLM for bioinformatics! Always happy to see these cross-area paper. The data quality is always the bottleneck due to high dimension, missing data or small size. The paper utilize prior knowledge embedded in LLM and apply self-consistent inference to select key features from tabular data.
KeyWord: bioinformatics, feature selection, application of LLM
LOCRET: ENHANCING EVICTION IN LONG-CONTEXT LLM INFERENCE WITH TRAINED RETAINING HEADS
Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu (TsingHua)
The paper propose a novel way to do selective KV cache eviction for long-text inference. They train a retaining head to estimate the causal importance of each cache unit. Such a training paradigm is able to provide accurate token importance scoring prediction and can be integrated with other efficient inference algorithms.
KeyWord: hardward-level optimization
02.Oct.
On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation
Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag (Google)
In machine translation, verbosity is bad and prevalent. The paper conduct exploration of what causing verbosity and how it affect evalution.
KeyWord: NLP
Quantifying reliance on external information over parametric knowledge during Retrieval Augmented Generation (RAG) using mechanistic analysis
Reshmi Ghosh, Rahul Seetharaman, Hitesh Wadhwa, Somyaa Aggarwal, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari, Ehsan Aghazadeh (Microsoft, University of Massachusetts, Amherst, University of Maryland, College Park,)
RAG-related LLM has shortcut to answer question using context information instead of model prior. The paper proposes three mechanistical way to identify how LLM utilize information: causal inference method, attention weight checking, and attention knockout (remove one attention edge and see how the performance degrade).
KeyWord: RAG
ADDITION IS ALL YOU NEED FOR ENERGY-EFFICIENT LANGUAGE MODELS
Hongyin Luo & Wei Sun (BitEnergy AI, Inc)
Approximate tensor multiplication with integer addition. The paper do detailed computational complexity analysis. Achieve similar performance to full-scale model but use less energy. Worth reading.
KeyWord: computation complexity, hardward-level optimization
Generative AI and Perceptual Harms: Who’s Suspected of using LLMs?
Kowe Kadoma, Danaë Metaxa, Mor Naaman
How much harm caused to users when others perceive or suspect them of using AI. The paper have done social experiments to verify perceptual harms.
KeyWord: AI for social good
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh (University of Wisconsin-Madison, Adobe Research)
Solve the data inbalance problem in image-text alignment problems, by generating negative samples from the positive.
KeyWord: data augmentation
Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences
Zahra Ashktorab, Michael Desmond, Qian Pan, James M. Johnson, Martin Santillan Cooper, Elizabeth M. Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, Werner Geyer (IBM)
The paper investigate how human behave as an evaluator for LLLM’s outputs, trying to understand what practitioners prioritize in evaluation criteria when using LLMs as judges and how these priorities differ. The paper also provide a tool to help practitioners refine evaluation criteria using both direct and pairwise assessment strategies.
KeyWord: Trustworthy AI, human-AI interaction