sonnet · 相关事件
相关事件
Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
2026-06-05PRODUCT_LAUNCH影响: MEDIUM
Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison
2026-06-05PRODUCT_LAUNCH影响: MEDIUM
Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill
2026-06-05PRODUCT_LAUNCH影响: MEDIUM
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives
2026-06-05PRODUCT_LAUNCH影响: MEDIUM
Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents
2026-06-05PRODUCT_LAUNCH影响: MEDIUM
AIP: A Graph Representation for Learning and Governing Agent Skills
2026-06-04PRODUCT_LAUNCH影响: MEDIUM
AIP: A Graph Representation for Learning and Governing Agent Skills
2026-06-04SHUTDOWN影响: LOW
Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit
2026-06-04PRODUCT_LAUNCH影响: MEDIUM
GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
2026-06-03PRODUCT_LAUNCH影响: MEDIUM
Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency
2026-06-03PRODUCT_LAUNCH影响: MEDIUM
MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety
2026-06-03PRODUCT_LAUNCH影响: MEDIUM
MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety
2026-06-03REGULATION影响: MEDIUM
Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems
2026-06-02PRODUCT_LAUNCH影响: MEDIUM
LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies
2026-06-02PRODUCT_LAUNCH影响: MEDIUM
From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication
2026-06-02PRODUCT_LAUNCH影响: MEDIUM
Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
2026-06-02PRODUCT_LAUNCH影响: MEDIUM
Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas
2026-06-02PRODUCT_LAUNCH影响: MEDIUM
ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law
2026-06-01PRODUCT_LAUNCH影响: MEDIUM
ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law
2026-06-01REGULATION影响: MEDIUM
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
AIRGuard: Guarding Agent Actions with Runtime Authority Control
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs
2026-05-29PRODUCT_LAUNCH影响: MEDIUM
"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs
2026-05-29BREAKTHROUGH影响: HIGH
Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction
2026-05-28PRODUCT_LAUNCH影响: MEDIUM
The Alignment Floor: When Persona Customization Is Safe
2026-05-28PRODUCT_LAUNCH影响: MEDIUM
Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict
2026-05-28PRODUCT_LAUNCH影响: MEDIUM
DRTriton: Large-Scale Synthetic Data Driven Reinforcement Learning for Triton Kernel Generation
2026-05-28PRODUCT_LAUNCH影响: MEDIUM
DRTriton: Large-Scale Synthetic Data Driven Reinforcement Learning for Triton Kernel Generation
2026-05-28BREAKTHROUGH影响: HIGH
A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents
2026-05-28PRODUCT_LAUNCH影响: MEDIUM
A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks
2026-05-27PRODUCT_LAUNCH影响: MEDIUM
A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks
2026-05-27BREAKTHROUGH影响: HIGH
DEI: Diversity in Evolutionary Inference for Quality-Diversity Search
2026-05-27PRODUCT_LAUNCH影响: MEDIUM
LLMs versus the Halting Problem: Characterizing Program Termination Reasoning
2026-05-27PRODUCT_LAUNCH影响: MEDIUM
Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study
2026-05-27PRODUCT_LAUNCH影响: MEDIUM
MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving
2026-05-27FUNDING影响: MEDIUM
MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving
2026-05-27PRODUCT_LAUNCH影响: MEDIUM
MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving
2026-05-27OPEN_SOURCE影响: MEDIUM
GPT-5.5反杀Claude登顶,AI编码旧榜不准了?
2026-05-27PERSONNEL影响: LOW
Chain-of-Thought Hijacking
2026-05-26PRODUCT_LAUNCH影响: MEDIUM