<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://yansong97.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://yansong97.github.io/" rel="alternate" type="text/html" /><updated>2026-05-08T10:39:48+00:00</updated><id>https://yansong97.github.io/feed.xml</id><title type="html">Yan Song</title><subtitle>personal description</subtitle><author><name>Yan Song</name><email>yan.song.24@ucl.ac.uk</email></author><entry><title type="html">Code BreakDown - MCTS in LLM</title><link href="https://yansong97.github.io/posts/2024/10/blog-post-2/" rel="alternate" type="text/html" title="Code BreakDown - MCTS in LLM" /><published>2024-10-24T00:00:00+00:00</published><updated>2024-10-24T00:00:00+00:00</updated><id>https://yansong97.github.io/posts/2024/10/blog-post-2</id><content type="html" xml:base="https://yansong97.github.io/posts/2024/10/blog-post-2/"><![CDATA[<p>Re-implementation of rStar MCTS</p>

<h2 id="rstar">RStar</h2>

<h3 id="1020">10.20</h3>

<p><a href="https://arxiv.org/abs/2408.06195">RStar</a> is a fairly simple method but have achieved great performance on small model. The basic idea is to introduce varies types of queries (actions) during the process of MCTS, to decompose the question, response to sub-questions or rephrase the sub-questions. This mimics cognitive reasoning and adds exploration to tree searching.</p>

<p>Its <a href="https://github.com/zhentingqi/rStar">codebase</a> is interesting to read. The key elements lie in a single python file: <code class="language-plaintext highlighter-rouge">run_src/MCTS_for_reasoning.py</code> defining three classes:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Generator</code></p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Reasoning_MCTS_Node</code></p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">search_for_answers</code></p>
  </li>
</ul>

<p>In short, what the codebase do is to define a complex language node data structure <code class="language-plaintext highlighter-rouge">Reasoning_MCTS_Node</code> and set up a MCTS searcher to navigate the pathway by creating children nodes inside <code class="language-plaintext highlighter-rouge">Reasoning_MCTS_Node</code> which inherit all necessary information.</p>

<p>I found it interesting as I was participating in another open-source project: <a href="https://github.com/openreasoner/openr"><strong><em>OpenR</em></strong></a> (big shoutout to the team!), where we also support a vanilla version of MCTS reasoning. The main difference in terms of implementation is that: RStar pass LLM call between nested nodes whereas <strong><em>OpenR</em></strong> follows conventiaonl RL framework and use LLM call in a centralized <code class="language-plaintext highlighter-rouge">env</code> entity. I feel like the latter way is more user-friendly and would like to transfer RStar to our developing codebase.</p>

<p>The language node class <code class="language-plaintext highlighter-rouge">Reasoning_MCTS_Node</code> contains basic attributes such as parent, depth, node type, as well as key information for building children, such as generator function and the question status. It will first inherit almost everything from its parent and then define detailed rules for action generation, simple as that.</p>

<p>In its <code class="language-plaintext highlighter-rouge">_create_children</code> function you will see the essence of the project. Those are five ways of action generation written as <code class="language-plaintext highlighter-rouge">def do_action_generate_xxx</code>, each of them will query the generator to create specific prompts and generate response to form children nodes. All children nodes will be added to the current language node and wait for selection and value update.</p>

<h3 id="1025">10.25</h3>

<p>Ok, now I have almost finished the re-implementation of rStar in <a href="https://github.com/openreasoner/openr"><strong><em>OpenR</em></strong></a>. At least I make it executable :). The motivation behind is that I hope to scale it up by borrowing the usage of Ray package in <strong><em>OpenR</em></strong>.  I have also noticed a tiny bug in the original rStar repo (might be wrong). The <code class="language-plaintext highlighter-rouge">is_valid_solution_node</code> seems to consider the <code class="language-plaintext highlighter-rouge">DIRECT ANSWER</code>, <code class="language-plaintext highlighter-rouge">SUBQUESTION</code> and <code class="language-plaintext highlighter-rouge">OST</code> node type but during answer extracting it throws <code class="language-plaintext highlighter-rouge">OST</code> away. My current task is to run experiments and demonstrate that this week’s work is not wasted ! Fingers crossed!</p>]]></content><author><name>Yan Song</name><email>yan.song.24@ucl.ac.uk</email></author><category term="Code" /><summary type="html"><![CDATA[Re-implementation of rStar MCTS]]></summary></entry><entry><title type="html">Daily Dose of Large Language Models</title><link href="https://yansong97.github.io/posts/2012/08/blog-post-1/" rel="alternate" type="text/html" title="Daily Dose of Large Language Models" /><published>2024-10-02T00:00:00+00:00</published><updated>2024-10-02T00:00:00+00:00</updated><id>https://yansong97.github.io/posts/2012/08/blog-post-1</id><content type="html" xml:base="https://yansong97.github.io/posts/2012/08/blog-post-1/"><![CDATA[<p>Update LLM paper everyday!</p>

<h1 id="09oct">09.Oct.</h1>

<h3 id="scalable-and-accurate-graph-reasoning-with-llm-based-multi-agents">SCALABLE AND ACCURATE GRAPH REASONING WITH LLM-BASED MULTI-AGENTS</h3>
<p><em>Yuwei Hu, Runlin Lei, Xinyi Huang, Zhewei Wei, Yongchao Liu2∗</em></p>

<p>In graph reasoning tasks, traditional methods often use a single LLM whereas the paper propose a framework based on multi-agent collaboration to solve graph reasoning problems. On each node of the graph, a LLM receives and passes messages until maximum iterations.</p>

<p>KeyWord: Multi-agent, Graph Reasoning</p>

<h3 id="reviseval-improving-llm-as-a-judge-via-response-adapted-references">REVISEVAL: IMPROVING LLM-AS-A-JUDGE VIA RESPONSE-ADAPTED REFERENCES</h3>
<p><em>Qiyuan Zhang, Yufei Wang, Tiezheng Yu, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma1</em></p>

<p>Good related work on LLM-as-Judge. But the methodology seems trivial. For a generated response, the paper use an LLM to generate a reference, and use another LLM to evaluate the input-output-reference combination.</p>

<p>KeyWord: LLM-as-Judge</p>

<h3 id="efficient-inference-for-large-language-modelbased-generative-recommendation">EFFICIENT INFERENCE FOR LARGE LANGUAGE MODELBASED GENERATIVE RECOMMENDATION\</h3>
<p><em>Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua</em></p>

<p>A Speculative Decoding variant</p>

<p>KeyWord: LLM Inference</p>

<h1 id="03oct">03.Oct.</h1>

<h3 id="loki-an-open-source-tool-for-fact-verification">Loki: An Open-Source Tool for Fact Verification</h3>
<p><em>Haonan Li, Xudong Han, Hao Wang, Yuxia Wang, Minghan Wang, Rui Xing, Yilin Geng, Zenan Zhai, Preslav Nakov, Timothy Baldwin (LibrAI, MBZUAI, Monash University, The University of Melbourne)</em></p>

<p>The paper use direct prompting to decompose the task of fact checking into five processes, and the sub-moudles are: Decomposer, Checkworthiness Identifier, Query Generator, Evidence Retriever and Claim Verifier. I like the idea of breaking the thinking process into pre-defined stages and direct prompting is a straightforward implementation. The paper also talk about practical parallel implementation.</p>

<p>KeyWord: Direct Prompting</p>

<h3 id="when-a-language-model-is-optimized-for-reasoning-does-it-still-show-embers-of-autoregression-an-analysis-of-openai-o1">When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1</h3>
<p><em>R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, Thomas L. Griffiths (Yale, OpenAI, Princeton, )</em></p>

<p>A technical report finding that o1 scores substantially better on
examples with high-probability outputs than ones
with low-probability outputs. And o1 shows substantially less sensitivity to task frequency than the other LLMs. The problem of auto-regression still exists.</p>

<p>KeyWord: Evaluation, O1</p>

<h3 id="dreamgarden-a-designer-assistant-for-growing-games-from-a-single-prompt">DreamGarden: A Designer Assistant for Growing Games from a Single Prompt</h3>
<p><em>Sam Earle, Samyak Parajuli, Andrzej Banburski-Fahey</em>
LLM assistant for Game design through direct prompting. The key is to decompose the task and assign the subtasks to each agents.</p>

<p>KeyWord: Direct Prompting, Human-AI Interaction</p>

<h3 id="investigating-on-rlhf-methodology">Investigating on RLHF methodology</h3>
<p><em>Alexey Kutalev, Sergei Markoff</em></p>

<p>A survey on RLHF (Well it save my time :))</p>

<p>KeyWord: Survey, RLHF</p>

<h3 id="open-rag-enhanced-retrieval-augmented-reasoning-with-open-source-large-language-models">OPEN-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models</h3>
<p><em>Shayekh Bin Islam, Md Asib Rahman, K S M Tozammel Hossain, Enamul Hoque, Shafiq Joty, Md Rizwan Parvez</em></p>

<p>LLM + Moe, train the model to generate retrieval/no_retrieval reflection tokens and measure the confidence of outputs conditioned on enforced no_retrieval during inference, to decide whether to do retrival.</p>

<p>KeyWord: RAG</p>

<h3 id="trained-transformer-classifiers-generalize-and-exhibit-benign-overfitting-in-context">Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context</h3>
<p><em>Spencer Frei, Gal Vardi (UC Davis)</em></p>

<p>Investigate the generalization abilities of a linear transformer on linear classification tasks. It generalize nicely when data has label-flipping noise, or in ICL, the model remember the noise but still generalize.</p>

<p>KeyWord: fundation model, generalization</p>

<h3 id="define-enhancing-llm-decision-making-with-factor-profiles-and-analogical-reasoning">DEFINE: ENHANCING LLM DECISION-MAKING WITH FACTOR PROFILES AND ANALOGICAL REASONING</h3>
<p><em>Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu (University of Central Florida)</em></p>

<p>This one is interesting. The question it tries to answer is: how LLMs can guide investment decisions by analyzing earnings call transcripts? The goal is to dig for key factors that are important for decision-making. The paper interprete in a Bayesian inference way. Given an article, the LLM is firstly used (as a prior proposal) to summerize factors, then they use train a Bradle-Terry Model to score the likelihood (conditiaonl likelihood function). A direct posterior sampling analogy.</p>

<p>KeyWord: factor analysis, bayesian sense</p>

<h3 id="quantifying-generalization-complexity-for-large-language-models">QUANTIFYING GENERALIZATION COMPLEXITY FOR LARGE LANGUAGE MODELS</h3>
<p><em>Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, James Glass (Harvard, MIT, Chicago, Meta)</em></p>

<p>The paper propose an evaluation pipeline to test generalization. The key insight here is to query the LLM for its in-distribution data, and sample OOD data from its complement, then test LLM on both of the synthetic data. A dynamic generalization evaluation piptline, also intersting. Wonder what an open-ended version will look like?</p>

<p>KeyWord: evaluation, generalization</p>

<h3 id="knowledge-driven-feature-selection-and-engineering-for-genotype-data-with-large-language-models">Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models</h3>
<p><em>Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen (Unversity of Pennsylvania)</em></p>

<p>LLM for bioinformatics! Always happy to see these cross-area paper. The data quality is always the bottleneck due to high dimension, missing data or small size. The paper utilize prior knowledge embedded in LLM and apply self-consistent inference to select key features from tabular data.</p>

<p>KeyWord: bioinformatics, feature selection, application of LLM</p>

<h3 id="locret-enhancing-eviction-in-long-context-llm-inference-with-trained-retaining-heads">LOCRET: ENHANCING EVICTION IN LONG-CONTEXT LLM INFERENCE WITH TRAINED RETAINING HEADS</h3>
<p><em>Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu (TsingHua)</em></p>

<p>The paper propose a novel way to do selective KV cache eviction for long-text inference. They train a retaining head to estimate the causal importance of each cache unit. Such a training paradigm is able to provide accurate token importance scoring prediction and can be integrated with other efficient inference algorithms.</p>

<p>KeyWord: hardward-level optimization</p>

<h1 id="02oct">02.Oct.</h1>

<h3 id="on-the-implications-of-verbose-llm-outputs-a-case-study-in-translation-evaluation">On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation</h3>
<p><em>Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag (Google)</em></p>

<p>In machine translation, verbosity is bad and prevalent. The paper conduct exploration of what causing verbosity and how it affect evalution.</p>

<p>KeyWord:  NLP</p>

<h3 id="quantifying-reliance-on-external-information-over-parametric-knowledge-during-retrieval-augmented-generation-rag-using-mechanistic-analysis">Quantifying reliance on external information over parametric knowledge during Retrieval Augmented Generation (RAG) using mechanistic analysis</h3>
<p><em>Reshmi Ghosh, Rahul Seetharaman, Hitesh Wadhwa, Somyaa Aggarwal, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari, Ehsan Aghazadeh (Microsoft, University of Massachusetts, Amherst, University of Maryland, College Park,)</em></p>

<p>RAG-related LLM has shortcut to answer question using context information instead of model prior. The paper proposes three mechanistical way to identify how LLM utilize information: causal inference method, attention weight checking, and attention knockout (remove one attention edge and see how the performance degrade).</p>

<p>KeyWord: RAG</p>

<h3 id="addition-is-all-you-need-for-energy-efficient-language-models">ADDITION IS ALL YOU NEED FOR ENERGY-EFFICIENT LANGUAGE MODELS</h3>
<p><em>Hongyin Luo &amp; Wei Sun (BitEnergy AI, Inc)</em></p>

<p>Approximate tensor multiplication with integer addition. The paper do detailed computational complexity analysis. Achieve similar performance to full-scale model but use less energy.
 Worth reading.</p>

<p>KeyWord: computation complexity, hardward-level optimization</p>

<h3 id="generative-ai-and-perceptual-harms-whos-suspected-of-using-llms">Generative AI and Perceptual Harms: Who’s Suspected of using LLMs?</h3>
<p><em>Kowe Kadoma, Danaë Metaxa, Mor Naaman</em></p>

<p>How much harm caused to users when others perceive or
suspect them of using AI. The paper have done social experiments to verify perceptual harms.</p>

<p>KeyWord: AI for social good</p>

<h3 id="removing-distributional-discrepancies-in-captions-improves-image-text-alignment">Removing Distributional Discrepancies in Captions Improves Image-Text Alignment</h3>
<p><em>Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh (University of Wisconsin-Madison, Adobe Research)</em></p>

<p>Solve the data inbalance problem in image-text alignment problems, by generating negative samples from the positive.</p>

<p>KeyWord: data augmentation</p>

<h3 id="aligning-human-and-llm-judgments-insights-from-evalassist-on-task-specific-evaluations-and-ai-assisted-assessment-strategy-preferences">Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences</h3>
<p><em>Zahra Ashktorab, Michael Desmond, Qian Pan, James M. Johnson, Martin Santillan Cooper, Elizabeth M. Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, Werner Geyer (IBM)</em></p>

<p>The paper investigate how human behave as an evaluator for LLLM’s outputs, trying to understand what practitioners prioritize in evaluation criteria when using LLMs as judges and how these priorities differ. The paper also provide a tool to help practitioners refine evaluation criteria using both direct and pairwise assessment strategies.</p>

<p>KeyWord: Trustworthy AI, human-AI interaction</p>]]></content><author><name>Yan Song</name><email>yan.song.24@ucl.ac.uk</email></author><category term="LLM" /><category term="Daily" /><summary type="html"><![CDATA[Update LLM paper everyday!]]></summary></entry></feed>