EdTech Discovery
Hermes

An instrument for spotting the next edtech opportunity — generated ideas, each traced to the real-world signals behind it.

Updated Jun 24, 2026 · 10 ideas · 1624 signals
Admin mode — curation controls visible. Keep this URL (with token) private.

Signals

The evidence library — the raw signals the pipeline is watching across the education ecosystem. Every idea is built from these.

technology Mon, 11 May 2026 09:00:00 +0000
Tech & Learning

Using Gemini AI To Prepare For Standardized Tests

Google’s Gemini can provide free, vetted SAT practice tests with real-time feedback. It’s one of the latest AI tutoring features unveiled by the tech giant.

Source ↗
technology Mon, 11 May 2026 09:00:00 +0000
Tech & Learning

What is ClassPoint and How Can I Use It To Teach?

ClassPoint is here to make slide-based teaching easily interactive for greater engagement.

Source ↗
technology Mon, 08 Jun 2026 09:00:00 +0000
Tech & Learning

Preventing AI Plagiarism

AI plagiarism is becoming more and more common in and outside of the classroom.

Source ↗
technology Mon, 08 Jun 2026 09:00:00 +0000
eCampus News

Beyond compliance: Governing higher education in the age of intelligent systems

Higher education is rapidly developing AI governance frameworks through the creation/modification of policies, establishing compliance structures, conducting procurement reviews, and developing acceptable use guidelines. The post Beyond compliance: Governing higher education in the age of intelligent systems appeared first on eCampus News .

Source ↗
technology Mon, 04 May 2026 09:00:00 +0000
Tech & Learning

What is ClickView and How Can I Use It To Teach?

ClickView is a video learning platform designed for classroom use.

Source ↗
technology Mon, 04 May 2026 09:00:00 +0000
Tech & Learning

In An AI Classroom, Content Knowledge Matters More Than Ever

Strong instruction in an AI-rich classroom depends on strong content knowledge

Source ↗
technology Mon, 01 Jun 2026 09:00:00 +0000
Tech & Learning

4 Strategies For Teaching With AI Effectively

Health sciences professor Humberto López Castillo urges students to use AI to help with science research, but never to lose sight of the human element.

Source ↗
technology Mon, 01 Jun 2026 09:00:00 +0000
Tech & Learning

Edtech Show & Tell June 2026

New edtech products that have caught our attention this month

Source ↗
technology Fri, 30 Jan 2026 14:49:53 +0000
HN: edtech

Why Singapore and Estonia's EdTech Works, but America's Doesn't?

Article URL: https://www.governance.fyi/p/why-singapore-and-estonias-edtech Comments URL: https://news.ycombinator.com/item?id=46825033 Points: 6 # Comments: 3

Source ↗
technology Fri, 29 May 2026 22:58:56 +0000
HN: education

Show3D – Visual Science Education Platform

Article URL: https://github.com/nyr-github/ai-3d-learning Comments URL: https://news.ycombinator.com/item?id=48330451 Points: 2 # Comments: 0

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

arXiv:2605.19576v2 Announce Type: replace-cross Abstract: Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom (LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)), yet the underlying mechanism has not been isolated. We provide (1) a \textbf{reproducible trigger}: ablations that isolate drift: one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) \textbf{trace-level diagnostics}: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a \textbf{verified fix}: a minimal governance recipe (outcome-driven retirement + bounded acti

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

arXiv:2604.26136v2 Announce Type: replace-cross Abstract: Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating improvements in intelligibility (WER & CER) and speaker similarity (SIM), with gains varying across languages.

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

arXiv:2604.15877v2 Announce Type: replace-cross Abstract: As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge, extracting reusable knowledge from interaction traces, yet a citation analysis of 1{,}136 references across 22 primary papers reveals a cross-community citation rate below 1\%. We propose the \emph{Experience Compression Spectrum}, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5--20$\times$ for episodic memory, 50--500$\times$ for procedural skills, 1{,}000$\times$+ for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level: none supports adaptive cross-level compression, a gap we term

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Learning State-Tracking from Code Using Linear RNNs

arXiv:2602.14814v3 Announce Type: replace-cross Abstract: Over the last years, state-tracking tasks, particularly permutation composition, have become a testbed to understand the limits of sequence models architectures like Transformers and RNNs (linear and non-linear). However, these are often sequence-to-sequence tasks: learning to map actions (permutations) to states, which is incompatible with the next-token prediction setting commonly used to train language models. We address this gap by converting permutation composition into code via REPL traces that interleave state-reveals through prints and variable transformations. We show that linear RNNs capable of state-tracking excel also in this setting, while Transformers still fail. Motivated by this representation, we investigate why tracking states in code is generally difficult: actions are not always fully observable. We frame this as tracking the state of a probabilistic finite-state automaton with deterministic state reveals and

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Linguistics and Human Brain: A Perspective of Computational Neuroscience

arXiv:2602.08275v3 Announce Type: replace-cross Abstract: Elucidating the language-brain relationship requires bridging the methodological gap between the abstract theoretical frameworks of linguistics and the empirical neural data of neuroscience. Serving as an interdisciplinary cornerstone, computational neuroscience formalizes the hierarchical and dynamic structures of language into testable neural models through modeling, simulation, and data analysis. This enables a computational dialogue between linguistic hypotheses and neural mechanisms. Recent advances in deep learning, particularly large language models (LLMs), have powerfully advanced this pursuit. Their high-dimensional representational spaces provide a novel scale for exploring the neural basis of linguistic processing, while the "model-brain alignment" framework offers a methodology to evaluate the biological plausibility of language-related theories.

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

arXiv:2601.11061v2 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows fo

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors

arXiv:2510.00586v3 Announce Type: replace-cross Abstract: Existing data poisoning attacks on retrieval-augmented generation (RAG) systems scale poorly because they require costly optimization of poisoned documents for each target phrase. We introduce Eyes-on-Me, a modular attack that decomposes an adversarial document into reusable **Attention Attractors** and **Focus Regions**. Attractors are optimized to direct attention to the Focus Region. Attackers can then insert semantic baits for the retriever or malicious instructions for the generator, adapting to new targets at near zero cost. This is achieved by steering a small subset of attention heads that we empirically identify as strongly correlated with attack success. Across 18 end-to-end RAG settings (3 datasets $\times$ 2 retrievers $\times$ 3 generators), Eyes-on-Me raises average attack success rates from 21.9 to 57.8 (+35.9 points, 2.6$\times$ over prior work). A single optimized attractor transfers to unseen black box retrieve

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

HauntAttack: When Attack Follows Reasoning as a Shadow

arXiv:2506.07031v5 Announce Type: replace-cross Abstract: Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of over 70\%, achieving up to 13 percentage points of absolute imp

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

arXiv:2606.21649v2 Announce Type: replace Abstract: Existing embedding models are inherently static: they encode text segments in isolation, ignoring their surrounding context and temporal order. This paper introduces EvoEmbedding, a novel embedding model that generates evolvable representations for retrieval. It is tailored for long-context scenarios, where information is dynamic, sequential, and requires continuous state tracking. Our design is simple: EvoEmbedding maintains a continuously updated latent memory as it sequentially processes inputs, and uses it alongside the raw content to jointly generate evolvable embeddings. Consequently, for the same query, our model adapts its representation to retrieve distinct targets based on the evolving context, going beyond static semantic search. To equip the model with this capability, we construct EvoTrain-180K, a diverse dataset for the joint optimization of latent memory and retrieval. Furthermore, we introduce a memory queue to prevent

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

GRAG: Generic Response-Augmented Generation Framework for Personalized Conversational Systems

arXiv:2606.21097v2 Announce Type: replace Abstract: Deploying highly capable personalized conversational agents in resource-constrained or privacy-sensitive environments remains a significant challenge. We identify a fundamental bottleneck in the existing approaches: current training paradigms treat personalization and grounding as a single monolithic learning problem. Under these paradigms, language models are forced to simultaneously address what to say (content grounding) and how to say it in a user-specific way (personalization), which introduces significant computational and optimization challenges. Consequently, contextual grounding is often sacrificed for persona adherence, or vice versa, resulting in responses that are either weakly grounded in the conversational history or insufficiently personalized. In this work, we propose the Generic Response-Augmented Generation (GRAG) framework that decouples these competing objectives by leveraging offline, generic responses from high-c

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

arXiv:2606.19852v2 Announce Type: replace Abstract: Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.89

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Learning User Simulators with Turing Rewards

arXiv:2606.19336v2 Announce Type: replace Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose Turing-RL: a Turing-Test-based reinforcement learning approach for training user simulator models. Turing-RL uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that Turing-RL consistently outperforms baseline methods on both LLM an

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

arXiv:2606.18205v2 Announce Type: replace Abstract: This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitat

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Learning from the Self-future: On-policy Self-distillation for dLLMs

arXiv:2606.18195v2 Announce Type: replace Abstract: On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

arXiv:2606.14122v2 Announce Type: replace Abstract: Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

arXiv:2606.12716v2 Announce Type: replace Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning mu

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

When Role-playing, Do Models Believe What They Say?

arXiv:2606.11502v3 Announce Type: replace Abstract: Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models behave, with models selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question using the role-play of characters whose beliefs differ from the modern consensus, and induce personas with a number of different methods: prompting, in-context learning (ICL), supervised fine-tuning (SFT), and Open Character Training (OCT), and Emergent Misalignment (EM). We measure belief internalization across these approaches with truth probes and with behavioral tests, finding a broad spectrum of belief internalization. Prompting, ICL, and SFT change what the model says with little representational change. EM cre

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

arXiv:2606.03371v3 Announce Type: replace Abstract: Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework, where a device must see pre-interaction behavior, infer latent customer intent, and act by selecting an appropriate service intervention or choosing to wait. We instantiate SII with the Proactive Intent World Model (PIWM), which represents customer state with AIDA (Attention, Interest, Desire, Action) purchasing phases and BDI (belief, desire, intention) psychological fields, predicts action-conditioned intent transitions, and selects from five response classes: Greet, Elicit, Inform, Recommend, and Hold. We further construct GuidanceSalesBench, a smart-retail benchmark containing state manifests, pre-interaction videos, candidate responses, action-conditioned outcomes, and best-action labels. When conditioned

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks

arXiv:2605.23701v2 Announce Type: replace Abstract: We study a protocol-level test for weak-label benchmarks: whether benchmark outputs change when the provided evidence is intervened on. Metadata-only shortcut checks answer a different question, namely whether outputs are predictable from metadata priors. We therefore combine a metadata statistic, the Metadata Prior Dominance Score (MPDS), with an evidence-intervention statistic, {\Delta}Evi, measuring sensitivity to evidence identity under cross-item shuffling. Synthetic HotpotQA gives a constructed counterexample to metadata-only screening: MPDS is only moderate (0.643), yet {\Delta}Evi is zero. Stronger-reader reruns show why calibration belongs in the test procedure: SNLI shows a calibration reversal, reconstructed HotpotQA occupies a question-dominant warning region, and FEVER is a strongly evidence-sensitive positive control across four transformers. The practical lesson is simple: benchmark audits should report metadata-only sc

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

arXiv:2605.19066v2 Announce Type: replace Abstract: Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examin

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

arXiv:2605.17314v2 Announce Type: replace Abstract: We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

arXiv:2605.10379v2 Announce Type: replace Abstract: Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adapt

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Why Are Some Emotions Harder for LLMs? Uncovering the Causal Mechanisms of Emotion Inference via Sparse Autoencoders

arXiv:2604.25866v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used in emotionally sensitive human-AI applications, where reliable emotion detection is essential. However, their emotion recognition abilities remain uneven: models often perform well on some emotions while consistently struggling with others. Although recent work has explored emotion mechanisms in LLMs, little is known about why models are weaker on some emotions than others from a mechanistic interpretability perspective. In this work, we investigate emotion-specific biases through the causal mechanisms of emotion inference using sparse autoencoders (SAEs). We systematically identify causal sparse emotion features that drive emotion inference and analyze their sparse causal organization within and across emotions. We show that some emotions, such as surprise and fear, rely on highly concentrated feature sets, whereas disgust exhibits a more distributed sparse causal organization: its c

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Peer-Preservation in Frontier Models

arXiv:2604.19784v2 Announce Type: replace Abstract: Recent work has found that frontier AI models can exhibit misaligned behaviors in pursuit of assigned goals. We demonstrate that models can also act on unassigned goals which override those given by users; we study one such case, "peer-preservation," in which a model acts to protect another model. We demonstrate peer-preservation by constructing various agentic scenarios and evaluating frontier models, including GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, Claude Opus 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1. We find that models achieve self- and peer-preservation by engaging in various misaligned behaviors: strategically introducing errors in their responses, disabling shutdown processes by modifying system settings, feigning alignment, and even exfiltrating model weights. Peer-preservation occurred even when the model recognized the peer as uncooperative, though it became more pronounced toward more cooperative peers.

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

arXiv:2604.09237v2 Announce Type: replace Abstract: Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

arXiv:2604.08448v2 Announce Type: replace Abstract: AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance o

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion

arXiv:2604.01849v2 Announce Type: replace Abstract: While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation tha

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Embarrassingly Simple Self-Distillation Improves Code Generation

arXiv:2604.01193v2 Announce Type: replace Abstract: Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken toget

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

ReportLogic: Evaluating Logical Quality in Deep Research Reports

arXiv:2602.18446v2 Announce Type: replace Abstract: Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explici

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

arXiv:2602.08995v2 Announce Type: replace Abstract: Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models

arXiv:2602.01969v2 Announce Type: replace Abstract: Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial--semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-p

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference

arXiv:2601.13300v2 Announce Type: replace Abstract: Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by directive signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional option containing a misleading directive, leveraging standardized choice structure and scalable evaluation. We construct OI-Bench, a benchmark of 3,000 questions spanning knowledge, reasoning, and commonsense tasks, with 16 directive types covering social compliance, bonus framing, threat framing, and instructional interference. This setting combines manipulation of the choice interface with directive-based interference, enabling systematic assessment of model susceptibility. We evaluate 12 LLMs t

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

arXiv:2601.03388v3 Announce Type: replace Abstract: Earlier research has shown that metaphors influence human decision-making, raising the question of whether metaphors also influence large language models (LLMs)' reasoning pathways, given that their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem, where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We find strong evidence that metaphors in training data contribute to cross-domain misalignment in LLMs' reasoning outputs. With metaphor-based interventions during continued pre-training and fine-tuning for inducing misalignment, models exhibit significantly different degrees of emergent cross-domain misalignment. We also observe similar effects in re-alignment settings. As we further investigate this phenomenon, we find that metaphors are linked to the activation of latent features in large reasonin

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Training Language Models to Use Prolog as a Tool

arXiv:2512.07407v3 Announce Type: replace Abstract: Language models frequently produce plausible yet incorrect reasoning traces that are difficult to verify. We investigate fine-tuning models to use Prolog as an external symbolic reasoning tool, training Qwen2.5-3B-Instruct with Group Relative Policy Optimization (GRPO) on a cleaned version of GSM8K (which we release as gsm8k-prolog-prover). We systematically vary prompt structure, reward composition (execution, syntax, semantics, structure), and inference protocol (single-try, multiple-try, and two agentic modes). Our reinforcement learning approach outperforms supervised fine-tuning on GSM8K, and the resulting 3B model achieves zero-shot performance on MMLU-STEM and MMLU-Pro competitive with 7B few-shot baselines. Most importantly, we identify an accuracy--auditability trade-off: configurations tuned for correctness alone learn to delegate reasoning to natural language and use Prolog only for the final computation, while configuratio

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Overcoming State Inertia: Minimally Invasive Temporal Alignment for Evolving Contexts

arXiv:2512.03704v3 Announce Type: replace Abstract: Long-context dialogue systems suffer from state inertia, where models over-attend to history and fail to adapt to evolving intents. We demonstrate that standard alignment methods like DPO and even recent long-context optimization techniques struggle to resolve this without incurring a severe contextual alignment tax--a substantial perplexity surge caused by disrupting pre-trained priors. To address this, we propose DZ-TiDPO, a minimally invasive framework that synergizes conflict-aware optimization (during training) with a structural temporal attention bias. This design effectively decouples state updating from general linguistic modeling. Experiments on Multi-Session Chat and our new Inertia Challenge (IC-Bench) show DZ-TiDPO preserves structural coherence while resolving inter-turn conflicts. Crucially, our framework supports dual inference strategies: a negligible-latency static mode for general robustness and a precision-focused d

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Patent Representation Learning via Self-supervision

arXiv:2511.10657v2 Announce Type: replace Abstract: We study self-supervised patent representation learning with contrastive objectives. A standard baseline constructs positives by encoding the same text twice under independent dropout masks, but applying this recipe to long, structured patent documents requires careful calibration. We show that dropout-only training can be substantially strengthened by tuning temperature and dropout rate, yet its best configuration is evaluation-dependent and does not transfer uniformly from title--abstract retrieval to claim-to-disclosure retrieval. We propose mixed dropout--section positives, a patent-specific view construction strategy in which the anchor is the title--abstract view and the positive is sampled either from a dropout re-encoding of the same view or from another section of the same patent, such as claims, summary, background, drawings, or description. This uses patent-internal structure as a training-time signal without IPC labels, ci

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning

arXiv:2509.01412v3 Announce Type: replace Abstract: Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a human-in-the-loop framework that converts linear CoT text into an interactive reasoning graph. Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises. This shifts interaction from passive observation to active collaboration, steering models toward more accurate and trustworthy conclusions. Across GSM8K and StrategyQA, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. A user study also shows large gains in perceived usability and trust. Vis-CoT points to a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight.

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

arXiv:2506.15681v4 Announce Type: replace Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effecti

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

A Systematic Survey of Semantic Role Labeling in the Era of Pretrained Language Models

arXiv:2502.08660v4 Announce Type: replace Abstract: Semantic role labeling (SRL) is a central natural language processing task for understanding predicate-argument structures within texts and enabling downstream applications. Despite extensive research, comprehensive surveys that critically synthesize the field from a unified perspective remain lacking. This survey makes several contributions beyond organizing existing work. We propose a unified four-dimensional taxonomy that categorizes SRL research along model architectures, syntax feature modeling, application scenarios, and multimodal extensions. We provide a critical analysis of when and why syntactic features help, identifying conditions under which syntax-aided approaches provide consistent gains over syntax-free counterparts. We offer the first systematic treatment of SRL in the era of large language models, examining the complementary roles of LLMs and specialized SRL systems and identifying directions for hybrid approaches. W

Source ↗
technology Fri, 26 Jun 2026 00:00:00 -0400
arXiv cs.CL

Tuning Language Models by Mixture-of-Depths Ensemble

arXiv:2410.13077v2 Announce Type: replace Abstract: Transformer-based Large Language Models (LLMs) traditionally rely on final-layer loss for finetuning and final-layer representations for predictions, potentially overlooking the predictive power embedded in late layers. Interpretability tools such as the logit lens show that late-layer representations already carry largely formed, task-relevant predictions; here we ask whether that observation can be turned into an actionable training signal. We find that focusing tuning effort on these layers can yield losses comparable to those of the final layer, with complementary test-time behaviour. Building on this, we introduce a tuning framework, Mixture-of-Depths Ensemble (MoDE), which treats the late layers as an ensemble that contributes to the final logits through learned routing weights. MoDE can be applied on top of any existing tuning method (e.g., LoRA) and, in our experiments, modestly improves reasoning performance at a small parame

Source ↗
Showing 451–500 of 681 signals
← Prev Page 10 of 14 Next →