Conference · 2026

DAC 2026

Search by title, author, abstract, track, or session. Favorites are saved locally in your browser.

543 papers
RESEARCH004

Warp-STAR: High-Performance, Differentiable GPU-Accelerated Static Timing Analysis Through Warp-Oriented Parallel Orchestration

En-Ming Huang; Shih-Hao Hung

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

Static timing analysis (STA) is crucial for Electronic Design Automation (EDA) flows but remains a computational bottleneck. While existing GPU-based STA engines are faster than CPU, they suffer from inefficiencies, particularly intra-warp load imbalance caused by irregular circuit graphs. This paper introduces Warp-STAR, a novel GPU-accelerated STA engine that eliminates this imbalance by orchestrating parallel computations at the warp level. This approach achieves a 2.4X speedup over previous state-of-the-art (SoTA) GPU-based STA. When integrated into a timing-driven global placement framework, Warp-STAR delivers a 1.7X speedup over SoTA frameworks. The method also proves effective for differentiable gradient analysis with minimal overhead.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH012

Moduplace: LLM-Assisted Modular PCB Placement via Preference-Optimized Constraint Graph Generation

Yaohui Han; Beichen Li; Mingyang Zhao; Rongliang Fu; Qunsong Ye; Tinghuan Chen; Bei Yu; Tsung-Yi Ho

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

Despite the existence of various automated PCB placement frameworks, the industry still relies heavily on human engineers. This is because these frameworks cannot perform placement according to specific user requirements and preferences, which limits their flexibility and industrial adoption. Moreover, the existing frameworks lack a modular perspective in placement, resulting in outputs that often perform poorly and fail to meet practical requirements. To address these problems, we propose ModuPlace, leveraging the powerful capabilities of LLMs to perform modular PCB placement with preference-optimized constraint graph generation. In ModuPlace, the entire component set is partitioned into modules at different granularities, enabling hierarchical and modular placement. Moreover, ModuPlace utilizes fine-tuning and preference optimization to enhance the quality of constraint graph generation and align the results with those of human experts. Experimental results demonstrate that ModuPlace outperforms all baselines, achieving superior placement quality across all metrics.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH025

Powerpath: Internal-Probe-Free Search Interval Reduction for Efficient Statistical Sequential Constraint Characterization

Junzhuo Zhou; Li Huang; Wei Xing; Ting-Jung Lin; Lei He

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

Sequential-constraint characterization consumes up to 80% of library signoff runtime. Traditional path-based methods rely on internal-node observability and fragile topology analysis, requiring extra simulations. They fail fundamentally for toggle-unexpected arcs such as hold and removal, where no robust internal voltage transitions exist. We present power-path, an internal-probe-free method that extracts a physics-grounded supply-current signature from the mandatory reference-delay simulation, enabling universal constraint estimation across all arc types without additional SPICE costs. An affine calibration using lookup table vertices refines the raw estimator for nominal characterization; for statistical deployment, an online ridge regression progressively tightens search intervals across Monte Carlo samples. Evaluated on post-layout TSMC 5 nm and 12 nm libraries across 5 PVT corners, power-path achieves 1.5× speedup with affine calibration and 3.6× with statistical calibration, while maintaining signoff-grade accuracy. The approach requires no topology analysis or additional simulations. We open-source our code and data.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH031

Quantum Circuit Synthesis Using an Exact T Library

Hanyu Wang; Mingfei Yu; Xinrui Wu; Jason Cong

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Pioneering the Foundations of Quantum Circuit Synthesis, Compilation and Algorithmic Optimization +1
Abstract

In fault-tolerant quantum circuit synthesis, T gates supplied via magic states dominate space–time cost, while Clifford gates incur negligible overhead. Conventional flows minimize AND count in an {XOR, AND, NOT} basis as a proxy for T, which neglects phase cancellation and can be far from T-optimal. We instead formulate an exact T synthesis problem and canonicalize Boolean functions under Clifford equivalence. By precomputing T-optimal implementations up to seven variables and developing a specialized mapper, we reduce the T count by up to 16% on EPFL benchmarks and improve the best-known T counts of several cryptographic modules.

DesignQuantumDES6. Quantum ComputingAI
RESEARCH032

TDH-GNN: An Efficient Topology-Driven Accelerator for Dynamic Heterogeneous GNN

Hui Yu; Wei Zhang; Ligang He; Jin Zhao; Yu Zhang; Zixiao Wang

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

Dynamic Heterogeneous Graph Neural Networks(DHGNNs) effectively capture complex and evolving structural and semantic information through metapath learning. Despite notable advancements, current solutions still suffer from redundant metapath matching. To overcome these challenges, we introduce TDH-GNN, a topology-driven accelerator tailored for high-performance DHGNN inference. Specifically, we propose an efficient hyperedge-centric incremental execution approach into accelerator design, utilizing the concept of hyperedge to encapsulate dependencies among metapath instances, enabling incremental metapath updates and reducing redundant matching and computation. Implemented on a Xilinx U280, TDH-GNN delivers average speedups of 4.3x and 2.8x, along with 5.9x and 3.7x energy savings, over leading HGNN accelerators.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH043

Strix: Re-Thinking NPU Reliability from a System Perspective

Jiapeng Guan; Jie Zhang; Hao Zhou; Ran Wei; Dean You; Hui Wang; Yingquan Wang; Tinglue Wang; Xudong Zhao; Jing Li; Zhe Jiang

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04x slowdown and minimal hardware overhead.

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH046

GEMIR: Graph-Based Joint Modeling of Electromigration and IR Drop for Power Grid

Feng Guo; Yueyue Xi; Jingyu Jia; Jiawei Liu; Tianshu Hou; Yuyang Ye; Jianwang Zhai; Kang Zhao; Chuan Shi

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

The combined effects of electromigration (EM) and IR drop critically affect the reliability and power integrity of power grids (PGs). Existing numerical methods are computationally prohibitive for large-scale designs, while most machine learning (ML) approaches analyze EM and IR drop in isolation, overlooking their shared physical structure and correlations in practical flows. To address this limitation, we propose GEMIR, a graph-based multi-task learning framework for the joint prediction of node-level static IR drop and edge-level EM-induced stress. GEMIR employs a cross-layer node-edge attention mechanism to effectively capture the mutual dependence between these two physical fields and integrates a physics-informed neural network (PINN) to enhance physical consistency in the EM path. Furthermore, we establish a composite optimization objective by incorporating physics-informed constraints that embed Kirchhoff's current law (KCL) and Korhonen's PDE to enhance model interpretability. To manage the inherently coupled yet sometimes conflicting optimization dynamics resulting from these constraints, we then develop a Conflict-Gated (CG) multi-task optimization that adaptively fuses or decouples task gradients based on their alignment, thereby achieving mutual optimality. Extensive experiments demonstrate that GEMIR outperforms existing single-task and multi-task baselines in accuracy and generalization. Specifically, it reduces the IR drop MAE by 40.79% and EM-induced stress RMSE by 19.35%, while maintaining high computational efficiency.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH052

Brain-Like Hyper-Dimensional Graph Learning System with Hardware-Efficient Adaptive Sparsity

Haomin Li; Fangxin Liu; Zongwu Wang; Shiyuan Huang; Ning Yang; Chenyang Guan; Tao Yang; Xinran Liang; Haibing Guan

Date:Monday, July 27 Location:Mtg Room 203AB Session:Advancing the Frontier of Neuromorphic Learning Systems +1
Abstract

Graph neural networks (GNNs) are crucial for numerous applications, yet their huge computational demands often lead to suboptimal performance. Hyper-Dimensional Computing (HDC) is a brain-inspired learning approach for efficient and robust learning. HDC-based Graph Learning (HDGL) shows significant improved computational efficiency and accuracy by learning graph representations in a high-dimensional space. Despite these advantages, general-purpose computing platforms such as CPUs and GPUs are insufficient for efficiently handling HDGL tasks. In this paper we propose an accelerator called HDGAS through algorithm and hardware co-design for HDGL. Based on the insight that not all node features in a graph are equally important, we propose to jointly optimize a lightweight filter with the HDGL model to dynamically identify and eliminate less significant node features during runtime. Moreover, we design a specialized system architecture for end-to-end HDGL acceleration, harnessing the proposed dynamic sparsification technique in tandem with the inherent SpMM operations within HDGL. Extensive experiments demonstrate that HDGAS achieves $6.76\times$ ($69.31\times$) speedup and $7.58\times$ ($80.12\times$) energy-efficiency improvements over GNN accelerators (GPU).

DesignDES3. Emerging Models of ComputationChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH054

GEMM-GS: Accelerating 3D Gaussian Splatting on Tensor Cores with GEMM-Compatible Blending

Haomin Li; Bowen Zhu; Fangxin Liu; Zongwu Wang; Xinran Liang; Li Jiang; Haibing Guan

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Scaling Intelligence for the Physical World +1
Abstract

3D Gaussian Splatting (3DGS) has emerged as a popular representation technique for efficient novel view synthesis due to its explicit Gaussian-based formulation. However, achieving real-time rendering at 90 frames per second (FPS) remains challenging. Prior hardware accelerators have explored custom ASIC or FPGA implementations that deliver higher performance than GPUs, but these designs often rely on advanced technology nodes (e.g., 7nm) and substantial on-chip storage, which significantly increase deployment cost and limit practicality. Meanwhile, existing studies largely overlook the potential of modern GPUs, as the 3DGS pipeline lacks native General Matrix Multiplication (GEMM) operations, which are essential for utilizing Tensor Cores. In this paper, we propose GEMM-GS, a GPU acceleration framework that enables efficient 3DGS execution on Tensor Cores through a GEMM-compatible reformulation of the blending stage. GEMM-GS transforms the original blending process into matrix-multiplication-friendly operations and incorporates a high-performance CUDA kernel with a three-stage double-buffered pipeline to overlap computation and memory accesses. The design is plug-and-play on commodity GPUs, requiring no hardware modifications and thus ensuring practical deployability. Experimental results demonstrate that GEMM-GS achieves a $1.42\times$ speedup over the baseline 3DGS implementation and provides an additional $1.47\times$ average speedup when integrated with existing acceleration techniques.

SystemsSYS1. Autonomous Systems (Automotive, Robotics, Drones)AIQuantum
RESEARCH068

A Differentiable Approach to Task Graph Partitioning: A Case Study in RTL Simulation

Aditya Das Sarma; Wan Luan Lee; Shui Jiang; Boyang Zhang; Tsung-Wei Huang

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

Graph partitioning is essential for many EDA applications that leverage task graph parallelism for faster execution. For instance, RTL simulators partition an input RTL design into dependent tasks and schedule them across threads. However, existing partitioners are largely limited to general-purpose heuristics that overlook real threading costs, resulting in suboptimal performance. Consequently, we introduce DiffPart, a differentiable task graph partitioning framework that automatically learns high-quality partitions under real operating conditions. Applied to RTL simulation, DiffPart improves state-of-the-art Verilator's partitioning quality, delivering up to 1.22--55.25x faster simulation runtime across diverse designs.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH073

BLADE: Bi-Level Bayesian Optimization for Metal-Density-Constrained Multi-Layer Package Power/ground Plane Synthesis

Siyuan Liang; Shanyi Li; Leilei Jin; Yuan Pu; Yushen Zhang; Zhen Zhuang; Kai-Yuan Chao; Ulf Schlichtmann; Tsung-Yi Ho

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

In advanced packages, large power/ground (P/G) planes are essential for power integrity but prone to warpage. While a uniform metal-density constraint combats warpage, mandatory degassing holes complicate adherence and degrade IR-drop. We propose BLADE, the first P/G plane synthesis methodology that enforces this constraint while optimizing power integrity. BLADE expands P/G planes from skeletons under metal density control to ensure connectivity, naturally leaving holes for degassing. Guided by a fast IR-drop evaluator, our bi-level Bayesian optimization framework efficiently navigates the design space. Experiments on six SiP cases demonstrate that BLADE satisfies the metal-density constraint and achieves superior IR-drop performance.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH080

Beyond Flat Netlist: Hierarchical Graph Representation Learning for Scalable Analysis of Sequential Circuits

Jingyi Zhou; Zhengyuan Shi; Jiaying Zhu; Ziyang Zheng; Qiang Xu

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

Circuit Representation Learning (CRL) offers a powerful paradigm to guide and optimize core Electronic Design Automation (EDA) tasks, but its practical adoption is hindered by the immense scale of industrial netlists and a failure to explicitly model register-level temporal dynamics. To overcome these barriers, we introduce DeepSeq3, a novel hierarchical framework that abstracts circuits into a two-level representation: fine-grained combinational subgraphs partitioned by flip-flops (FFs), and a high-level Super-Node Graph (SNG) that models the register-transfer structure. A dual Graph Neural Network (GNN) architecture learns representations at both levels, capturing local Boolean logic and global state transitions. Crucially, we introduce a state-centric pre-training scheme that predicts the reachability between FF states, endowing the model with a deep understanding of temporal behavior. Demonstrated on large-scale benchmarks, DeepSeq3's approach yields superior scalability and richer representations, reducing bounded model checking (BMC) solving time by 18% while guaranteeing correctness. Our code is avaiable at https://anonymous.4open.science/r/DeepSeq3-6760

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH083

Dataflowgen: An MLIR-Based Compiler for Efficient Dataflow Accelerator Generation

Jiangnan Li; Kaixiang Zhu; Zhengyi Zhang; Lingli Wang

Date:Monday, July 27 Location:Mtg Room 201B Session:Advances in 3D/2.5D Physical Design +1
Abstract

High-level synthesis (HLS) has been widely adopted to map high-level languages to hardware, significantly improving hardware design productivity. Existing HLS tools have gaps in generating custom hardware due to data dependencies and differences in intermediate representation (IR) levels. Appropriate IRs can be created for each by separating the algorithmic behaviour and the microarchitecture description, which is an effective way to reduce development costs and achieve competitive hardware designs. In this paper, we propose DataFlowGen, an open-source framework built on MLIR for efficient dataflow accelerator generation. DataFlowGen explicitly introduces a two-level IR to perform operations at suitable abstraction levels, capturing dataflow characteristics and multi-level hierarchy. Leveraging these representations, we develop an automated optimizer that outlines the application kernel and performs dataflow transformations to derive a hardware-oriented control dataflow graph (H-CDFG). It enables concise representation and resource efficiency of hardware architectures. Experiments show that DataFlowGen achieves a performance improvement with significant resource reduction compared to state-of-the-art HLS tools. The results show that our optimizer effectively leverages the expressive power of IRs thus capturing kernel parallelism.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageSYS4. Embedded System Design Tools and Methodologies
RESEARCH094

KL-MoE: A Hierarchical MoE Pruning Framework Exploiting KL Divergence

Zeyu Zhu; Gang Li; Minnan Pei; Zitao Mo; Peihuan Ni; Peisong Wang; Tielong Liu; Jian Cheng

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

Mixture of Experts (MoE) improves LLM expressiveness without proportional compute, yet its large parameter footprint hinders deployment in storage- and latency-constrained environments. Existing pruning methods attempt to reduce redundancy but often suffer severe accuracy loss at high pruning ratios due to limited pruning metrics. To tackle this challenge, we propose **KL-MoE**, a hierarchical pruning framework based on KL divergence scoring mechanism. KL-MoE first clusters experts by the similarity of experts. Then we develop a **KL-based Scoring** mechanism to retain the most representative expert within each cluster by jointly capturing the *local* and *global* functionality. In addition, we introduce the **Linear Restore** strategy, a lightweight mapping strategy that refines the outputs of the pruned MoE layer to approximate the original layer, thus recovering the accuracy of the pruned models. Extensive experiments across multiple models and tasks demonstrate that KL-MoE yields average gains of **12.89%**, **7.24%** and **6.14%** over state-of-the-art methods O-Prune, MoE-I$^2$ and HC-SMoE, respectively, while delivering a 1.31x inference speedup. Our code is available at https://anonymous.4open.science/r/KL-MoE-a.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH108

CANNON: A CXL-Based Near-Memory Processing Architecture for Approximate Nearest Neighbor Search on Real Hardware

Ganghyun Kim; Hyeonjun An; Hyunjoon Kim; Rathijit Sen; Kangkyu Park; Jinho Baek; Kwangsik Shin; Youngpyo Joo; Kwanghyun Park

Date:Monday, July 27 Location:Mtg Room 202C Session:Intelligent Fetch & Match Architectures +1
Abstract

Approximate Nearest Neighbor (ANN) search is a foundational primitive for AI applications such as Retrieval-Augmented Generation (RAG). CPU and GPU-based solutions face scalability bottlenecks due to limited local memory, while a multi-tier architecture using SSDs introduces high latency from coarse-grained I/O, mismatched with fine-grained data access patterns inherent to ANN search. We present CANNON (A CXL-Based Near-Memory Processing Architecture for Approximate Nearest Neighbor Search on Real Hardware), a fully offloaded Near-Memory Processing (NMP) architecture implemented on real CXL hardware. CANNON transforms the ANN search pipeline into a fine-grained, deeply pipelined dataflow architecture to maximize throughput, and introduces asynchronous hashing, a speculative execution mechanism that hides hash-check latency to prevent pipeline stalls. Evaluated on large-scale vector datasets, CANNON achieves up to two orders of magnitude performance improvement over state-of-the-art CPU and GPU baselines.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH121

Exquant: Global Expert Ranking–guided Mixed-Precision Quantization for Efficient MoE Inference

Chenyang Guan; Fangxin Liu; Junjie Wang; Ning Yang; Haibing Guan

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

Mixture-of-Experts (MoE) large language models (LLMs) leverage dynamic routing and sparse activation to improve efficiency and scalability, achieving high performance with reduced computational cost. However, their complex architecture and large memory footprint pose significant challenges for deployment, particularly on resource-constrained hardware. Post-training quantization (PTQ) is a widely used technique to reduce model size and memory usage. Existing PTQ approaches for MoE models are predominantly layer-wise and task-agnostic, optimizing reconstruction error independently within each layer. As a result, they ignore cross-layer differences in expert importance and fail to leverage task-specific signals, causing pivotal experts to be over-compressed, rarely activated experts to be over-provisioned, and overall accuracy to degrade. To overcome these limitations, we propose ExQuant, a PTQ framework for MoE LLMs that enables global, expert-level mixed-precision quantization. ExQuant first constructs a Globally Comparable Expert Importance Metric by integrating expert routing frequency and post-ablation performance. Based on this metric, it assigns tiered bit-widths to experts and employs a precision-aware load balancing strategy to dynamically schedule computation across processing elements, fully exploiting slack between low- and high-precision workloads. Experiments demonstrate that ExQuant significantly reduces memory footprint, improves inference efficiency, and achieves $2.87-5.93\%$ accuracy improvement over existing MoE quantization methods. These results validate the effectiveness of global, expert-level mixed-precision quantization for efficient and accurate deployment of MoE LLMs.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH124

Flashfps: Efficient Farthest Point Sampling for Large-Scale Point Clouds via Pruning and Caching

Yuzhe Fu; Hancheng Ye; Cong Guo; Junyao Zhang; Qinsi Wang; Yueqian Lin; Changchun Zhou; Hai (Helen) Li; Yiran Chen

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Point-based Neural Networks (PNNs) have become a key approach for point cloud processing. However, a core operation in these models, Farthest Point Sampling (FPS), often introduces significant inference latency, especially for large-scale processing. Despite existing CUDA- and hardware-level optimizations, FPS remains a major bottleneck due to exhaustive computations across multiple network layers in PNNs, which hinders scalability. Through systematic analysis, we identify three substantial redundancy in FPS, including unnecessary full-cloud computations, redundant late-stage iterations, and predictable inter-layer outputs that make later FPS computations avoidable. To address these, we propose FlashFPS, a hardware-agnostic, plug-and-play framework for FPS acceleration, composed of FPS-Prune and FPS-Cache. FPS-Prune introduces candidate pruning and iteration pruning to reduce redundant computations in FPS while preserving sampling quality, and FPS-Cache eliminates layer-wise redundancy via cache-and-reuse. Integrated into existing CUDA libraries and state-of-the-art PNN accelerators, FlashFPS achieves 5.16× speedup over the standard CUDA baseline on GPU and 2.69× on PNN accelerators, with negligible accuracy loss, enabling efficient and scalable PNN inference.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH127

Optimizing Dynamic-Shape Neural Networks: A Unified Approach to Adaptive Tuning

Zheng Zhang; Donglin Yang; Xiaobo Zhou; Yili Gong; Dazhao Cheng

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

Dynamic-shape neural networks are essential for handling variable-length data in tasks, but their optimization is challenging. Existing methods, such as hand-tuned libraries or static kernel compilation, are inefficient due to a lack of flexibility or the high overhead from repeated compilations. We propose DyTuner, an adaptive operator-level tuning framework for dynamic-shape neural networks. DyTuner uses a unified, system-compatible interface and an adaptive algorithm to efficiently tune kernels for varying shapes, significantly improving end-to-end performance and reducing redundant compilations. Experiments demonstrate that DyTuner effectively optimizes dynamic-shape workloads, achieving speedups of up to 1.41x and 1.64x over state-of-the-art solutions, BladeDISC and TorchInductor.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH130

C2C-Explorer: An Exploration Framework for Chip-to-Chip Interconnect Architectures in LLM Cloud Computing Systems

Jiayi Li; Di Wu; Qingxu Li; Hongxiao Zhao; Jiaqi Yang; Anjunyi Fan; Wenbin Zhang; Boqiang Wu; Shuting Liu; Shifeng Fang; Jianbo Dong; Dimin Niu; Bonan Yan

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

The scaling-up of large language models (LLMs) necessitates computing systems to have multi-processor-chip architectures, elevating the importance of chip-to-chip (C2C) communication. However, designing efficient C2C hardware architectures for LLM workloads faces three key challenges: generating realistic LLM-specific C2C traffic, accurately simulating hardware-level communication at scale, and efficiently exploring the exponentially large C2C design space. We propose C2C-Explorer, an adaptive Bayesian DSE framework that integrates a LLM-workload-driven traffic generator, a scalable interconnect simulator (switch/full-mesh, up to 512 chips), and a metric-guided evaluator into a workload-to-hardware optimization pipeline, enabling systematic C2C architectural co-design under realistic LLM workloads. Validated against FPGA-based C2C prototypes, the C2C simulator achieves 2.46–8.23% end-to-end timing error across diverse traffic patterns. Its hybrid cycle & event model further accelerates large-scale simulation by up to 7.8× over a pure cycle-accurate baseline. Applied to a 32-XPU DeepSeek-R1-671B inference workload, C2C-Explorer identifies configurations that improve goodput by 44.1% and reduce memory by 98.4%. C2C-Explorer is open-source and available at https://anonymous.4open.science/r/C2C-Explorer.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH132

Legotherm: Modular and Reusable Thermal Model for Chiplet-Based Heterogeneous Integration

Pengju Chen; Dan Niu; Depeng Xie; Gang Wang; Chen Wu; Wei Xing; Lei He

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

Thermal management critically affects the reliability of chiplet-based heterogeneous integrated circuits in the post-Moore era. The design process demands hundreds of thermal simulations across varying power distributions, cooling conditions, and structural configurations. Conventional methods based on finite element methods (FEM) and compact thermal model (CTM) lack the flexibility for efficient multi-scenario evaluations and incur prohibitive computational cost when modeling complex vertical stacks with cross-scale interconnects. We propose LegoTherm, a modular thermal modeling framework that decomposes chiplet-based systems into reusable reduced-order components. The framework exploits the inherent hierarchical structure of heterogeneous integration to construct high-fidelity 3D meshes capturing micro-interconnection details through modular discretization, then generates reduced-order modules by aggregating thermally coupled interface ports. This approach preserves critical hotspot accuracy while enabling rapid reassembly for different design scenarios. Evaluated on industrial-scale 2.5D and 3D packaging benchmarks, LegoTherm achieves up to 10.39X speedup compared to COMSOL while maintaining mean relative error below 0.40%. The framework reduces thermal design iteration cycles from weeks to half a day, addressing a critical bottleneck in modern IC design.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH138

PFGEMM: Fully Exploiting the Parallelism of SpGEMM With Element-Wise Out-of-Order Execution Flow and Prefix Decomposition of Indices

Yufeng Gao; Chen Zhang; Shuyao Cheng; Di Huang; Zhipeng Zhao; Xuehai Zhou

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

Sparse matrix multiplication (SpGEMM) is fundamental various applications, yet conventional processors suffer from poor parallelism due to bandwidth limitation and pipeline stalls . Therefore, we propose PF-GEMM, an accelerator exploiting hardware parallelism with three key methods. Firstly, we introduce a novel sparse format named prefix-CSR (PFCSR), which merges redundant prefixes of nearby indices to reduce I/O amount. Further, with element-wise out-of-order scheduling, PFGEMM achieves locally optimal data reuse to avoid pipeline stalls. Finally, PF-GEMM's cache employs a frequency-aware replacement policy to extend data residency. Overall, experiments demonstrate a gmean of 1.8x off-chip traffic reduction and 2.1x speedup over state-of-the-art accelerators.

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH142

PRO-V-R1: Reasoning Enhanced Programming Agent for RTL Verification

Yujie Zhao; Zhijing Wu; Boqin Yuan; Zhongming Yu; Hejia Zhang; Wentao Ni; Chia-Tung Ho; Haoxing Ren; Jishen Zhao

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

Register-Transfer Level (RTL) verification is a primary bottleneck, consuming 60–70% of development time. While Large Language Models (LLMs) show promise for RTL automation, their performance and research focus have overwhelmingly centered on RTL generation rather than verification. Current methods for RTL verification rely on large-scale proprietary models (e.g., GPT-4o) to generate Python-based functional references, incurring a high cost and raising data-privacy risks. To date, an end-to-end open-source solution for autonomous verification remains absent. We introduce PRO-V-R1, the first trainable open-source agentic framework for autonomous RTL verification. Our contributions are threefold: (1) we design PRO-V sys, a modular agentic system that couples LLM-based reasoning with programmatic tool use for RTL verification; (2) we establish a data construction pipeline that leverages existing RTL datasets to build simulation-validated, expert-level trajectories tailored for supervised fine-tuning (SFT) RTL verification agents; and (3) we implement an efficient reinforcement learning (RL) algorithm that uses verification-specific rewards derived from program-tool feedback to optimize the end-to-end verification workflow. Our empirical evaluation demonstrates PRO-V-R1 achieves a 57.7% functional correctness rate and 34.0% in robust fault detection, significantly outperforming the base model's 25.7% and 21.8% (respectively) from the state-of-the-art (SOTA) automatic verification system. This configuration also outperforms large-scale proprietary LLMs in functional correctness and shows comparable robustness for fault detection.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH144

Syndrome Extraction Circuits with Near-Optimal Depths for Practical Quantum Error Correcting Code Families

Daniel Bochen Tan; J. Pablo Bonilla Ataides; Varun Menon; Jin Ming Koh; Andrei Diaconu; Mikhail Lukin

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Charting the Architecture of Tomorrow’s Fault Tolerant Quantum Machines by Harmonizing Codes to Systems +1
Abstract

Fault-tolerant quantum computing relies on low-depth and high-fidelity syndrome extraction in quantum error correcting codes. We propose a scheduling strategy for syndrome extraction circuits in the broad class of quasi-abelian lifted-product codes, encompassing both hypergraph product and bivariate bicycle codes. Our approach constructs syndrome extraction circuits with CNOT depths no more than one layer above the fundamental lower bound, frequently achieving optimal depth. A pipelined variant further reduces the average depth per round, and the strategy generalizes naturally to higher-dimensional product codes. This scheduling framework enables computational speedups for quantum computing architectures built on these codes, all while preserving high-fidelity error correction.

DesignQuantumDES6. Quantum ComputingDES3. Emerging Models of Computation +1 more
RESEARCH146

Freebit: Unleashing the Performance Potential of Low-Bit LLMs Through PIM

Jiaxian Chen; Haoran Duan; Zhenxuan Ou; Qin Chen; Shangyu Wu; Kecheng Huang; Rui Mao; Yi Wang

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Move Less, Compute More: Memory-Centric Architectures for AI +1
Abstract

Low-bit large language models (LLMs) use quantization to compress weights to 1/2/4 bits, significantly shrinking model size while preserving accuracy. Existing work leverages the reduced precision to cut computation. However, roofline analysis on the NVIDIA A100 shows limited inference speedup even with a 6.7× reduction in computation.This limitation is caused by insufficient memory bandwidth. Processing-in-memory (PIM) provides a promising solution for the memory bottleneck by integrating compute near data. To support efficient low-bit LLM inference, PIM should be carefully designed with joint software and hardware optimizations. This paper presents FreeBit, a PIM-based architecture that unleashes the performance potential of low-bit LLMs. The objective is to capture the low-bit nature to better exploit PIM through hardware-software co-design. At the hardware level, a lookup-table (LUT)-centric architecture is designed to support quantized computation and minimize redundant computation. A sparsity-aware memory optimization is introduced to optimize memory access and leverage PIM bandwidth. At the software level, a static-dynamic decoupled scheduling strategy is presented to exploit PIM parallelism. Experimental results show that FreeBit effectively reduces redundant computation and memory access, and delivers notable performance improvements over CPUs, GPUs, and prior PIM baselines.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIQuantum
RESEARCH149

TRIDENT: An End-to-End Streaming Accelerator for TriSpGEMM

Yiyue Hu; an Hu; Hui Guo; Luchen Zhou; Yongzhang Nie; Runzhang Mao; Gaoyang Zhao; xia zhao; Yongwen Wang

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

Triple Sparse General Matrix-Matrix Multiplication (TriSpGEMM) is a major performance bottleneck in scientific computing and graph analytics. Existing domain-specific accelerators support only Sparse General Matrix-Matrix Multiplication (SpGEMM). Performing TriSpGEMM by sequencing two SpGEMM kernels introduces a global synchronization barrier, forces intermediate data to be written to and read from off-chip memory, and leads to frequent pipeline stalls, all of which significantly degrade performance. A natural solution is to design a fused architecture that executes TriSpGEMM directly, but such a design must overcome three key challenges, i.e., pipeline overflow caused by highly irregular and bursty sparse dataflows, locality inefficiency stemming from the trade-off between input reuse and output aggregation, and severe head-of-line blocking due to variable memory access latencies. In this work, we present TRIDENT, a transient-data-centric accelerator for fused TriSpGEMM processing. TRIDENT integrates a hierarchical flow-control mechanism, a hybrid dataflow co-designed with a windowed sparse format, and an out-of-order decoupled execution scheme to ensure pipeline stability, locality efficiency, and latency tolerance. Comprehensive evaluations demonstrate that TRIDENT achieves an average of 3.56x performance speedup over state-of-the-art SpGEMM accelerators, and up to 49.6x and 6.3x improvements over CPU and GPU baselines, respectively.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH150

RISCSmith: Finding RISC-V CPU Bugs via Rich Instruction Construction and On-the-Fly Differential Analysis

Xudong Zhang; Yuanliang Chen; Zehong Yu; Zhen Yan; Fuchen Ma; Dalong Shi; Yu Jiang

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Robust and Trusted AI Systems: Attacks, Defenses, and Hardware Reliability +1
Abstract

Processors are the foundation of all computing systems, yet complex RTL and microarchitectural front ends make logic defects difficult to eliminate and extremely costly to fix after tape-out.Although RISC-V fuzzing has revealed many bugs in practice, existing approaches remain limited: they rely on manually maintained instruction models and generate insufficiently rich CPU test programs.Moreover, they lack precise register and memory monitoring, and either perform high-overhead per-instruction differential checks or coarse end-of-program comparisons -- both making bug analysis and localization inefficient. In this work, we propose RISCSmith (30K+ Rust LoC), a fuzzing framework aimed at detecting bugs in RISC-V CPUs.First, RISCSmith automatically builds rich instruction models by parsing the RISC-V UnifiedDB to extract structured instruction metadata, resolve inconsistencies, and generate strongly typed models capturing operand roles and runtime semantics.Second, RISCSmith instruments RISC-V implementations with lightweight logging to collect per-instruction register, memory, and exception data, performing on-the-fly differential analysis to pinpoint the first divergence, classify its cause, and minimize the reproducing test case.We implemented and evaluated RISCSmith on six widely used RISC-V CPUs. In total, it uncovered 18 previously unknown bugs.Compared to state-of-the-art CPU fuzzers like Cascade and RISCV-DV, RISCSmith detects 3.5x and 2.6x more bugs and covers 37% and 61% more branches, respectively.

SecuritySEC1. AI/ML Security/PrivacyAIQuantum
RESEARCH155

Accelcim: Systematic Dataflow Exploration for SRAM Compute-in-Memory Accelerator

Chenhao Xue; Yukun Wang; An Guo; Yuhui Shi; Jinwei Zhou; Xiping Dong; Yihan Yin; Yuanpeng Zhang; Tianyu Jia; Wei Gao; Qiang Wu; Xin Si; Jun Yang; Guangyu Sun

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Move Less, Compute More: Memory-Centric Architectures for AI +1
Abstract

SRAM-based compute-in-memory (CIM) offers high computational density and energy efficiency for deep neural network (DNN) accelerators, but its limited capacity causes on/off-chip data movement overhead for large DNN models. Existing CIM accelerator studies typically assume that DNN models fit entirely on-chip, leaving efficient dataflow design largely untapped. This paper introduces AccelCIM, a systematic dataflow exploration framework for SRAM CIM accelerator, which addresses two key limitations of prior work. (1) It formulates a systematic dataflow design space spanning CIM macro configurations and macro-array organizations. (2) It introduces rigorous design evaluation using cycle-accurate architectural simulation and post-layout PPA analysis. We conduct an extensive design space exploration and apply AccelCIM to representative LLM applications, providing practical insights for the principled design of CIM accelerators.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIQuantum
RESEARCH158

Neural Domain Decomposition for Scalable Multi-Physics: Chip-Scale Thermal-Stress Analysis

Min-Chul Park; Ji-Hye Lee; Sungyeop Lee; Giyong Hong; Seokki Lee; In Huh; Hong-hyun Park; Hyunjae Jang; Changwook Jeong; Young-Gu Kim; Dae Sin Kim

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

Large neural surrogate models accelerate scientific simulation but still face scalability limits from model size and memory. We present FP-DDM (Foundation-model-based Physics-guided adaptation for the Domain Decomposition Method), a scalable framework for large-scale physics analysis. FP-DDM divides the domain into overlapping subdomains and enforces interface consistency by explicitly optimizing boundary values on overlaps via automatic differentiation, enabling fast alignment. Specifically, we propose a foundation architecture that effectively learns multi-physics and trasfer new domain. The combination of foundation-model adaptation and physics-guided test-time adaptation enables generalization to new physics and unseen domains without labels. On thermal and stress problems, FP-DDM achieves numerical-solver-level accuracy, scales to 10-billion-cell domains, and supports layout design optimization, establishing its potential as a multi-physics platform for next-generation System–Technology Co-Optimization.

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH160

ROSA: Robust and Energy-Efficient Microring-Based Optical Neural Networks via Optical Shift–and-Add and Layer-Wise Hybrid Mapping

Huifan Zhang; Yun Hu; Caizhi Sheng; Yurui Qu; Pingqiang Zhou

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

This work presents ROSA, a microring-based optical neural network architecture that improves robustness and energy efficiency using an optical shift-and-add (OSA) module and a layer-wise hybrid mapping strategy. It introduces a noise-aware voltage-to-weight model considering DAC and thermal variations, and a workload-aware framework to co-optimize MRR array size and layer-wise dataflow. Optimized arrays reduce aggregated relative energy-delay-product (EDP) by 64% and 26%, compared to DEAP-CNNs and general compact array respectively. OSA further contributes 29% EDP reduction. The proposed hybrid mapping strategy improves CIFAR-10 accuracy by 8.3% than weight-stationary mapping while achieving an average 54.7% lower EDP than DEAP-CNNs.

AIAI5-I. AI/ML System and Platform Design
RESEARCH163

Per Flow Asynchronous Traffic Shaping in Time Sensitive Networking

Wenxue Wu; Tong Zhang; Wanchun Jiang; Wufan Wang; Zhen Li; Fengyuan Ren

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

Asynchronous Traffic Shaping (ATS) bounds end-to-end delay in Time-Sensitive Networking (TSN) without relying on global synchronization, but its per-group queuing introduces inter-flow interference that causes large delay jitter for time-critical flows. To address this limitation, we present PFATS, a novel per-flow asynchronous traffic shaping architecture that integrates a per-flow frame eligibility time calculator, per-flow queues, and a Synchronous Shift Register Matrix (SSRM) scheduler. We implement PFATS on an FPGA-based TSN switch, and it supports 192 flows by incurring only 7.14% additional on-chip memory usage compared to conventional ATS. In realistic TSN scenarios, PFATS achieves microsecond-level end-to-end delay and bounds delay jitter within 5 μs, while maintaining high throughput.

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH177

Finding Behavior Bugs in IoT Messaging Protocols with Automated Oracle Generation and Monitoring

Qingpeng Du; Zhengxiong Luo; Yuanliang Chen; Fuchen Ma; Yu Jiang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

IoT messaging protocols face critical security risks from behavior bugs - specification violations that enable unauthorized data access and device compromise. Detecting such bugs requires comprehensive understanding of protocol specifications and communication semantics. This paper introduces ARES, a fully automated framework that extracts executable behavior oracles from protocol specifications for real-time compliance monitoring using LLM-driven behavior filtering. ARES evaluates six widely-used IoT protocol implementations, identifying 25 new bugs with 87.5% precision, including 21 behavior bugs. Of these, 18 have been confirmed or fixed and 10 CVEs assigned due to their severity.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH179

LLMVA: LLMs Empowered Verilog-A Iterative for MRAM Design Technology Co-Optimization

Jiongzhe Su; Quanhai Zhu; Jie Zhou; Xiangyi Meng; Haoran Du; Huanghui Wang; Yusheng Yin; Bo Liu; Yan Cui; Hugo Jiang; Hao Cai

Date:Tuesday, July 28 Location:Mtg Room 101B Session:AI Graces the Cell Party: Transformers, LLMs and Zero Drama Extraction +1
Abstract

Emerging device characteristics modeling is indispensable for potential circuit design integration. In the modeling process, the Verilog-A language is employed, leveraging the physical parameters and test results of the device. This work demonstrates the large language model (LLM) empowered Verilog-A iterative (LLMVA) for the modeling process of spin-transfer-torque magnetic-tunnel-junction (STT-MTJ). LLM enhances the quick interaction with device-level characteristics and circuit-level indicators, and realize design technology co-optimization (DTCO). In macro level, the 4-Mb 28-nm magnetic-random-access-memory (MRAM) macro is designed and then tape-out. To the best of the authors' knowledge, this is the first work to use LLM for device modeling and MRAM DTCO. The iterative test results show that the deviation between Front-test/Post-test results of MTJ does not exceed 9.2%, proving the effectiveness of the proposed LLMVA modeling process. With the proposed LLMVA agent, we have decreased our designer and time cost by about 50%.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH194

TRACE: Learning to Compute on Circuit Graphs

Ziyang Zheng; Jiaying Zhu; Jingyi Zhou; Qiang Xu

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Learning to compute, the ability to model the functional behavior of a circuit graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH200

Ternainfer: Pushing the Compression Boundary of Ternary LLMs Towards 1.58-Bit GPU Inference

Jie Gu; Edwin Hsing-Mean Sha; Longshan Xu; Yuhong Song; Yunfan Chi; Qingfeng Zhuge

Date:Monday, July 27 Location:Mtg Room 101B Session:Computing Without Waste: Lean and Scalable Neural Architectures +1
Abstract

Recent advances in Microsoft's ternary BitNet, restricting weights to {-1, 0, +1}, have highlighted the potential of low-bit large language models (LLMs). Ideally, each ternary weight requires only log_2(3), approximately 1.58 bits. However, existing inference systems still use redundant bit representations for computational efficiency. To push the compression boundary and maintain efficiency, we propose TernaInfer, a GPU inference framework for ternary LLMs that integrates lossless compression with ternary-optimized sparse matrix multiplication. To our knowledge, TernaInfer is the first work to realize 1.58 bits per weight on the ternary BitNet while providing 1.53 times throughput improvement over half-precision GPU inference.

AIAI3-I. AI/ML Application and InfrastructureChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH208

S2CIM: A Secure-Computation and Secure-Storage Compute-in-Memory Architecture with Circuit-Algorithm Co-Design for Efficient and Trustworthy Edge Inference

Hanyong Shao; Zhiyuan Ning; Wenshuai Yao; Runteng Zhu; Wenpu Luo; Xiaolei Wang; Xunzhao Yin; Meng Li; Kechao Tang; Ru Huang

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

Edge computing that offloads AI inference from the cloud to local devices can significantly reduce interaction latency and improve data privacy, while simultaneously imposing stringent requirements on both efficiency and security. Non-volatile compute-in-memory (nvCIM) boosts efficiency by executing multiply-accumulate (MAC) inside memory arrays, reducing data movement and eliminating weight refresh. However, deploying nvCIM in untrusted environments exposes both user inputs and model weights to multiple security threats: during storage, non-volatility keeps weights vulnerable to physical extraction; during computation, plaintext inputs must be transferred and exposed at runtime. Recent in-memory encryption CIM (IME-CIM) schemes integrate an XOR cipher and in-situ decryption to protect stored weights, yet still require plaintext keys and inputs during inference and often incur up to 2x array area overhead, leaving a security-efficiency gap. This paper proposes S²CIM, a FeFET-based nvCIM architecture with circuit-algorithm co-design to bridge this gap. For security, an Affine-Transform Splitting (ATS) scheme based on randomization strategies protects both offline weights (XORed weights) and online computation (obfuscated inputs) without explicitly exposing plaintext keys or inputs. For efficiency, a Drain-Input Gate-Scan (DIGS) based on 1FeFET-1R achieves low-variation signed multi-bit MAC in minimal cycles with high area efficiency comparable to common nvCIMs. We validate its functionality using an S²CIM macro with 16x16 arrays. Compared with recent IME-CIMs, S²CIM improves area efficiency by up to 16.8x and reduces energy per MAC by up to 6.7x, while ensuring 97.4% encrypted ViT inference accuracy and reducing attack success rates to 50% under all potential threat models.

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH209

Pushing the Limits of Inverse Lithography with Generative Reinforcement Learning

Haoyu Yang; Haoxing Ren

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Differentiable Dreams: GPU-Powered Litho Gets Its Gradient Groove Back +1
Abstract

Inverse lithography (ILT) is critical for modern semiconductor manufacturing but faces the challenge of non-convex optimization, which can lead to sub-optimal local minima. Although advances in model architectures and loss functions have improved performance and reduced the number of post-ILT refinement iterations, two critical issues remain unresolved, leaving generative AI–based inverse solutions sub-optimal: 1) The training dataset is often sub-optimal; simply mimicking its behavior does not necessarily yield higher-quality masks. 2) Post-training ILT refinement involves navigating a highly non-convex manifold. To address this, we reformulate the generative model G as a learned distribution over the mask space conditioned on designs. Using a Style-Aware GAN pre-trained on a large design dataset, we introduce a fine-tuning stage that combines policy optimization with imitation learning. This trains the GAN to generate masks that are both high-quality and robust, requiring minimal subsequent numerical refinement. Our hybrid framework mitigates the sub-optimal traps of conventional ILT, improves mask quality, and reduces optimization time, offering advantages beyond what traditional solvers can achieve.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH210

Cryozip: An Efficient Cryogenic Compressor for Quantum Error Correction Syndromes

Guanchen Tao; Alexander Knapen; Jacob Mack; Gokul Subramanian Ravi; Qirui Zhang; Mehdi Saligane; Dennis Sylvester

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

Scaling fault-tolerant quantum computing is increasingly constrained by the limited bandwidth and power budget across the 4\,K-room temperature (RT) interface. We present CryoZip, a cross-layer cryogenic compression framework that cooperates with an lightweight in-house quantum error correction (QEC) predecoders to reduce syndrome transmission under realistic, circuit-level noise. CryoZip targets sparse syndrome vectors with a sliding-window compression architecture sized under strict decoding-latency constraints to maximize energy efficiency. We implement and evaluate the design in 22\,nm FDSOI characterized at 4\,K, using vector-based power, performance, and area analysis to obtain realistic hardware data. CryoZip achieves up to 48.3$\times$ compression---1.81$\times$ higher than state-of-the-art compressors---across various QEC codes; when paired with the predecoder it yields over 14,238.86$\times$ bandwidth reduction (48.3$\times$ without predecoding), and delivers 3.97-25.74$\times$ energy savings for cryo-to-RT links alone, rising to 42.19$\times$ when accounting for predecoding and realistic QEC interface overheads.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH213

Exploiting Per-Core Leakage: Electromagnetic Side-Channel Monitoring of Multicore Architectures

Daehyeon Bae; Sujin Park; Insup Lee; Young-Giu Jung; Kyeongsik Lee; Heeseok Kim; Seokhie Hong

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

Multicore processors are increasingly adopted in embedded systems to meet growing performance demands. However, physical side-channel analysis of multicore architectures remains underexplored, as obtaining usable leakage is inherently challenging. Consequently, side-channel security research on such systems has lagged far behind, leaving a critical security gap. To address this gap, we reveal the electromagnetic leakage mechanisms in multicore architectures and, for the first time, demonstrate per-core leakage exploitation, thereby enabling physical side-channel analysis for these systems. As a practical extension, we present a non-intrusive side-channel monitoring method that achieves per-core granularity. To validate its feasibility and practicality, we implement a prototype on a heterogeneous SoC platform with an RF front-end, and evaluate on a commercial off-the-shelf quad-core embedded system, the Raspberry Pi 4B with ARM Cortex-A72 cores.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH214

Miter-Aware LUT Mapping: Aligning Structure and Solvability for Efficient Logic Equivalence Checking

Jiaying Zhu; Zhengyuan Shi; Mengxia Tao; Kezhi Li; Min Li; Qiang Xu

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

Logic Equivalence Checking (LEC) is often bottlenecked by synthesis-induced structural perturbations and XOR-dense regions that hinder SAT solving. We contend that the *modeling* of the miter is as critical as the SAT solver itself and propose a miter-aware mapping framework that reformulates the problem before solving by constructing a LUT-based miter preserving structural correspondence and exposing high-level logic relations. It integrates equivalence-preserving mapping, Gaussian-guided XOR modeling, and solver-oriented LUT selection to produce solver-efficient representations. Experiments on comprehensive benchmarks show up to 92.1% reduction across state-of-the-art SAT solvers, highlighting the importance of solver-aware modeling in enhancing LEC efficiency.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH223

Ihyperg: Incremental Hypergraph Partitioning on GPU

Wan Luan Lee; Aditya Das Sarma; Che Chang; Chih-Chun Chang; Tsung-Wei Huang

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

Recent advances in GPU-accelerated hypergraph partitioning have achieved substantial performance gains but remain limited to full partitioning. In particular, the lack of support for incrementality is a critical limitation for being used by many CAD applications, where circuit hypergraphs iteratively undergo incremental modifications as part of optimization loops. To overcome this limitation, we present iHyperG, the first GPU-parallel incremental k-way hypergraph partitioner. iHyperG introduces a scalable delta-based hypergraph data structure for efficient incremental modifications on a GPU, along with an effective incremental partitioning algorithm that rebalances partitions in a single pass and refines only cut-critical vertices. Experimental results show that iHyperG achieves average speedups of 190x for modification and 83x for partitioning over a state-of-the-art GPU-parallel hypergraph partitioner, while maintaining comparable partitioning quality.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH235

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

JUNTAO ZHAO; Jiuru Li; Chuan Wu

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints—existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation. Evaluated on five x86/ARM CPU platforms, Sandwich achieves average 2.01× end-to-end speedup and up to 3.40× latency reduction over SOTA systems. Its kernels match static compiler performance with three orders of magnitude less tuning cost.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH239

Edgesc: Universal Stochastic Computing Architecture for Efficient Edge Detection

Xincheng Feng; Wenyong Zhou; Taiqiang Wu; Zhengwu Liu; Meng Li; Ngai Wong

Date:Wednesday, July 29 Location:Mtg Room 101A Session:The Edge-Lord Chronicles: Tiny Gestures, Big LLMs, and Cluster Chaos +1
Abstract

Stochastic computing (SC) enables compact and low-complexity hardware but remains underexplored for vision applications. We propose EdgeSC, a unified stochastic framework that extends edge detection to diverse and composite operators. Pixel intensities are encoded as stochastic bitstreams, and gradients are computed through finite-state machines (FSMs) ensembles operating in the probability domain. A differentiable MUX mapping learns operator-specific behaviors without changing the architecture. Fabricated in 28-nm CMOS, EdgeSC achieves 15.9$\times$ smaller area, 4.4$\times$ lower power, and 6.3$\times$ better area-delay product than 8-bit baselines while maintaining comparable accuracy and throughput.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH240

A Locality-Aware Temporal Motif Mining Accelerator with Chunk-Based Search Tree Expansion

Yinbo Hou; Hao Qi; Jin Zhao; Yu Zhang; Yiling Lu; Hui Yu; longlong lin; Wenbin Jiang; Xiaofei Liao; Hai Jin

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Where Biology Meets Computation: Next‑Gen Bio‑Cyber‑Physical Innovations +1
Abstract

Mining temporal motifs in temporal graphs is essential for many critical applications. Despite several software/hardware temporal motif mining solutions have been proposed, they still suffer from substantial redundant and irregular off-chip communications due to misaligned search tree expansions across different motif matching tasks. In this work, we observe that different tasks traverse the same temporal graph edges in strict chronological order, exhibiting strong data locality among these tasks. Motivated by this insight, we propose LTMiner, a locality-aware hardware accelerator designed to efficiently handle temporal motif mining. Specifically, LTMiner proposes a novel chunk-based search tree expansion mechanism into the accelerator design to align the graph traversals of different tasks at the granularity of data chunks, substantially boosting the data locality among these tasks for lower data access cost. The results show that LTMiner gains 1.1×–652.6×, 1.8×–70.3× speedups and 3.9×–2050.9×, 1.2×–17.3× energy savings compared to the cutting-edge software and hardware solutions, respectively.

DesignDES3. Emerging Models of ComputationQuantumEDA
RESEARCH241

Pcbgen: Self-Evolving LLM Agent for Specification-to-Schematic PCB Design with Parameter Tuning

Yiming Zhang; yangbo wei; Zhangqi Huang; yuhao gao; Ting-Jung Lin; Jinmei Lai; Chen Wu; Lei He

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

Printed circuit board (PCB) design is increasingly critical as artificial intelligence (AI) systems demand higher integration density and stricter electrical constraints, while purely manual workflows struggle to keep up. We introduce PCBgen, a spec-to-schematic framework that integrates PCB knowledge graph (PCB-KG)-based retrieval, constraint-aware pre-filtering, a multi-modal intermediate representation (IR), and SPICE-in-the-loop, training-free Group Relative Policy Optimization (TF-GRPO) into a unified closed-loop pipeline for board-level design. This pipeline compresses an otherwise combinatorial search space (up to 10^20 candidates) to roughly 10^3 simulatable designs while preserving semantic and electrical validity. Extensive experiments for power supply designs demonstrate that, on a 336-case benchmark spanning six topology families and three design regimes, PCBgen improves topology adaptation (TA) by up to 57%, raises Pass@5 from 41% to 74%, and cuts token usage and wall-clock time by more than 50% compared with LLM-only baselines. In a 24-case comparison with human designers using existing vendor tools, it achieves 3.70x and 4.74x speedups in topology selection and schematic verification with comparable Pass@1, yielding an overall return on investment (ROI) of 4.39x and pointing toward a practical route to agentic, closed-loop PCB power supply design.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH256

PREBA: A Hardware/software Co-Design for Multi-Instance GPU Based AI Inference Servers

jiin Kim; Gwangoo Yeo; Yujeong Choi; Minsoo Rhu

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

NVIDIA's Multi-Instance GPU (MIG) technology, which enables reconfiguration of large GPUs into smaller, independent slices, is a promising feature for high-performance AI inference servers. Our characterization reveals that the data preprocessing stage of AI inference causes a significant performance bottleneck in these MIG-based inference systems. We present PREBA, a hardware/software co-design targeting MIG inference servers. PREBA offloads data preprocessing to a latency-optimized, FPGA-based accelerator. Simultaneously, PREBA's analytical model-based dynamic batching system maximizes small vGPU utilization by creating optimal, input-aware batches. PREBA provides 3.7x improvement in throughput, 3.5x energy-efficiency, and 3.0x cost-efficiency over a baseline that uses CPU-based preprocessing.

AIAI5-I. AI/ML System and Platform Design
RESEARCH257

Specmoe: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

Jehyeon Bang; Eunyeong Cho; Ranggi Hwang; Jinha Chung; Minsoo Rhu

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training. Our system improves inference throughput by up to 4.30×, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH259

HierPAS: Accelerating Video Generation Models with Hierarchical Precision and Adaptive Sparsity

Changxu Liu; Yifan Song; Ruibin Chen; Fan Yang

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Fast, Furious, and Fault-Tolerant: Accelerating the Generative Grind +1
Abstract

Diffusion Transformer (DiT)-based video generation models (VGMs) achieve state-of-the-art visual quality through global attention mod- eling but incur heavy computational overhead due to the quadratic complexity of attention. In high-resolution or long-duration videos, attention often dominates inference latency. However, much of this computation is redundant—many attention scores contribute little to the output and can be safely skipped or approximated. Fully exploiting this redundancy requires (1) identifying important re- gions in large attention maps, (2) determining adaptive retention ratios across heads and blocks, and (3) handling scores of varying importance efficiently. We present HierPAS, a hardware–software co-optimized design that accelerates VGMs using hierarchical pre- cision and adaptive sparsity. HierPAS employs a lightweight eager attention method to estimate attention patterns and a sampling- based entropy analysis to derive head-wise retention ratios with minimal cost. It applies progressively reduced precision to less critical regions and integrates a configurable top-𝑘 engine with a unified multi-precision GEMM engine supporting multiple preci- sions in one datapath. Evaluations show that HierPAS improves energy efficiency by up to 178×over NVIDIA H20 and 7.7×over state-of-the-art accelerators, with negligible loss in video quality.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH261

ELMBA: Escape from Local Minima in Buffer and Splitter Insertion for AQFP Circuits

Yanshuang Teng; Rongliang Fu; Xijie Wang; Dan Niu; Zhou Jin

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

Adiabatic Quantum-Flux-Parametron (AQFP) is a promising superconducting logic family that combines ultra-low power consumption with high-speed switching. However, the extensive insertion of buffers and splitters (B/S) to satisfy fan-out and synchronization constraints significantly increases circuit area and depth, becoming a key bottleneck in AQFP synthesis. Existing heuristic approaches are lightweight but prone to local optima, while global or exact methods achieve higher solution quality at the expense of runtime and scalability. In this work, we propose the first Large Neighborhood Search (LNS)-guided framework for AQFP B/S insertion, which combines multi-granularity group movement with a destruct-and-repair paradigm to systematically escape local minima while ensuring legality through constraint-aware repair. Extensive experiments on ISCAS'85 and EPFL benchmarks show that our framework achieves up to 14.0% fewer B/S insertions and 13.1% fewer junctions on large EPFL circuits, while yielding the lowest circuit depth among state-of-the-art methods. It also attains up to a 3.1× runtime speedup on large benchmarks, and on the ISCAS'85 suite reduces B/S and JJs by 13.0% and 8.1% respectively.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH270

3D-ReSAC: A Near-3D-Stacked-DRAM Processor with Area-Efficient Sequential Refresh-Skip and High-Link-Utilization Arrowmesh Communication for Fast Low-Batch LLM Inference

Wenbin Jia; Haocheng Li; Xiang Li; Xinyuan Lin; Yongpan Liu

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

Low-batch LLM inference on edge hardware places stringent demands on both memory bandwidth and computational capacity. While 3D-stacked DRAM accelerators offer a promising solution, they introduce two critical overheads that are frequently under-optimized: DRAM refresh and collective communication. To mitigate these issues, we propose 3D-ReSAC, a near-memory processor based on 3D-stacked DRAM, equipped with an area-efficient sequential refresh-skip method and high-link-utilization ArrowMesh communication. Our evaluations show that 3D-ReSAC reduces refresh and communication overheads by 7–100% and 53–75%, respectively, leading to a 1.12× to 2.02× latency reduction across low-batch LLM inference workloads.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH271

Learning the "why": Causal-Aware Learning in Explainable and Efficient CPU Design Optimization

Yiyang Zhao; Tianning Gao; Hongyang Pan; Liangji Wu; Yiming Wei; Zhaori Bi; Changhao yan; Xuan Zeng

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

As modern CPUs grow increasingly configurable, design space exploration (DSE) has become essential for navigating complex architectural trade-offs. However, existing DSE approaches predominantly rely on statistical correlations, resulting in opaque decision processes. They cannot clearly explain the underlying causes of how configurations affect PPA outcomes, which in turn reduces designers' confidence in automated recommendations. To address this limitation, we introduce causal learning into the DSE pipeline and develop CAL-DSE, a framework that constructs a validated causal graph combining statistical evidence with LLM-informed domain knowledge. Building on this structure, the causal graph decomposes the high-dimensional design space, enabling both interpretability and efficient exploration. Experimental results on RISC-V processor show that CAL-DSE achieves up to 4.12× hypervolume improvement while revealing validated causal pathways between design parameters and PPA outcomes.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more
RESEARCH272

ZK-Flex: A Flexible and Scalable Framework for Accelerating Zero-Knowledge Proofs

Adiwena Putra; Cuong Duong; Anh Pham; Joo Kim

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

Zero-knowledge proofs (ZKP) enable verification without revealing private data, but proof generation remains compute-intensive, dominated by polynomial (POLY) and elliptic-curve (EC) operations over large-bitwidth fields. Efficient acceleration requires flexible multi-precision arithmetic and high utilization across shifting POLY and EC workloads, yet existing reconfigurable designs address these demands only partially. We propose ZK-Flex, a software–hardware co-designed framework that reduces computation through hardware- and workload-aware POLY and EC optimizers, and employs TCore, a Toom–Cook–based multi-precision core supporting diverse bitwidths. Across representative benchmarks, ZK-Flex delivers 5-11x speedup and up to 3.8x higher area efficiency than prior accelerators.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH278

Celle: Automated Standard Cell Library Extension via Equality Saturation

Yi Ren; Yukun Wang; Xiang Meng; Guoyao Cheng; Baokang Peng; Lining Zhang; Yibo Lin; Runsheng Wang; Guangyu Sun

Date:Tuesday, July 28 Location:Mtg Room 101B Session:AI Graces the Cell Party: Transformers, LLMs and Zero Drama Extraction +1
Abstract

Automated standard cell library extension is crucial for maximizing Quality of Results (QoR) in modern VLSI design. We introduce CellE, a novel framework that leverages formal methods to achieve exhaustive discovery of functionally equivalent subcircuits. CellE applies equality saturation to the post-mapping netlist, generating an e-graph to cluster all functionally equivalent implementations. This canonical representation enables an efficient pattern mining algorithm to select the most area-optimal standard cells. Experimental results show a 15.41% average area reduction (up to 23.64% over prior work). Furthermore, characterization in a commercial flow demonstrates an 8.00% average delay reduction, confirming CellE's superior QoR optimization capabilities.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH279

SiTA: Exploiting Sparsity in Tensor-Product for Accelerating Quantum Readout Error Mitigation

Hanyu Zhang; Zhiwei Ye; Liqiang Lu; Kaiwen Zhou; Fangxu Guo; Siwei Tan; Fangxin Liu; Size Zheng; Jianwei Yin

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

Though quantum computing theoretically provides exponential computational advantage, an end-to-end quantum speedup should consider the encoding, compilation, and many other peripheral process that rely on classical computer. Among them, the readout error mitigation turns out to be the bottleneck, requiring mitigation of noise errors to achieve high fidelity. To this end, leveraging the sparsity in the tensor-product has emerged as an appealing approach to improve computational efficiency. However, existing approaches, based on intermediate data pruning or threshold pruning, suffer from the limited accelerator and fidelity loss. In this paper, we propose SiTA a quantum error mitigation accelerator that exploits the inherent sparsity of Hilbert space. The key insight lies in that the superposition states generated by quantum algorithms cover only a subset of the Hilbert space (output sparsity), and the basis state naturally exhibits bit-level sparsity (input sparsity). Therefore, we introduce a sparse dataflow that features probability-level and state-level parallelism, which effectively skips calculations related to zero values. Finally, we design its hardware architecture that supports various computational paradigms of tensor-product, enabling acceleration for readout error mitigation. Experiments show that SiTA achieves an end-to-end speedup of 4.9X to 1605X over prior mitigation methods, without sacrificing fidelity.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH282

Hiband: A Hierarchical Bandwidth-Aware HB-DRAM Accelerator with Array-Level Co-Execution and Reconfigurable NoC for LLM Inference

Huanyu Wang; Yang Wang; Yutong Su; Jiaxin Yang; Yang Hu; Shouyi Yin

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

Hybrid-bonding 3D DRAM (HB-DRAM) delivers massive in-situ bandwidth but shifts LLM accelerator bottlenecks from off-chip memory to the Network-on-Chip, where decentralized bank-local bandwidth must be coordinated across arrays. We present HiBand, a bandwidth-aware HB-DRAM accelerator that co-designs array-level execution with a four-mode reconfigurable NoC. HiBand combines grouped tensor parallelism and cross-head array-level co-execution to confine most traffic to short-range links while overlapping bandwidth-bound attention with compute-bound feed-forward layers. On LLM models mapped to a 32-array HB-DRAM accelerator, HiBand achieves up to 4.28× speedup over an HBM3 GPU-style accelerator and 1.67× speedup over a state-of-the-art HB-DRAM–based NMP design.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH284

PRS: An Efficient Parallel SAT Framework

Zhihan Chen; Xindi Zhang; Yuhang Qian; Congyi Zhang; Shaowei Cai

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

The propositional Satisfiability (SAT) problem is fundamental to many applications in Electronic Design Automation (EDA). This paper presents PRS, an efficient and comprehensive parallel SAT framework. We introduce two key techniques to enhance solver performance. The first is a lightweight preprocessing method called Resolution Checking, which efficiently simplifies circuit-encoded CNFs. The second is a new hybrid diversification strategy that combines a Regular Shifting method for the initial branching order with a parallel local search to generate diverse initial variable phases. PRS also supports extensive preprocessing, dynamic clause sharing, reproducible parallel solving, and parallel proof generation. Extensive experiments on the SAT Competition 2025 (SC25) benchmark demonstrate the effectiveness and scalability of our framework: PRS outperforms the SC25 Parallel-Track winner MallobSAT by solving 12 more instances with an 11.4% better PAR2 score, while also achieving a 4.7x speedup on 64 cores.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH285

Parallel Combinational Equivalence Checking via Sweeping-Based Task Scheduling

Zhihan Chen; Congyi Zhang; Shaohuang Chen; MengYu Zhao; Shaowei Cai

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

Combinational Equivalence Checking (CEC) is a cornerstone of modern IC design and verification flows, and state-of-the-art CEC solvers predominantly rely on SAT-sweeping frameworks. As circuit scale and complexity continue to increase, purely serial sweeping becomes a critical performance bottleneck, motivating the exploration of parallelism. However, existing parallel CEC approaches primarily focus on accelerating the verification of individual candidate pairs, while leaving the sweeping process itself essentially serial. This paper presents HydraCEC, a novel, general-purpose parallel CEC framework that, to the best of our knowledge, is the first to parallelize the sweeping process itself by concurrently executing its verification tasks. Its effectiveness is further enhanced by a dynamic benefit-aware scheduling policy guided by a Task Benefit Graph (TBG), and an asynchronous equivalence sharing mechanism that enables cooperation without global synchronization. Experimental results on ISCAS/ITC, Datapath, and Large-Scale benchmarks show that HydraCEC consistently outperforms existing parallel CEC solvers, particularly on large-scale instances, where it achieves a 10.5x speedup over the best competitor and exhibits excellent scalability, reaching 72.3x speedup with 128 cores.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH292

Throughput-Oriented Speculative Decoding via Intra-Device Parallelism

Zehua Li; Zijian Zhu; Zhanhong Tan; Fei Ren; Kaisheng Ma

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

Large Language Model (LLM) serving is crucial for applications like chatbots and code assistants, yet its implementation remains bottlenecked by low hardware utilization. Speculative decoding improves effective throughput, but still suffers from suboptimal resource partitioning. We introduce SmoothSpec, a speculative decoding system designed to maximize serving throughput by fully leveraging intra-device parallelism. SmoothSpec deploys draft and target models with the same parallelism configurations but allocates asymmetric compute resources at runtime. We design a novel scheduling strategy that dynamically adjusts both the draft parameters for the next step and the resource allocation ratio between the two models, based on the generation quality of draft tokens. We implement SmoothSpec on a single-node LLM serving system and demonstrate its effectiveness across diverse workloads.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH293

SkiST: Memory-Efficient Fine-Tuning of Spiking Neural Networks via Spatio-Temporal Adaptation

Zhibai Huang; James Yen; Zhixiang Wei; Yun Wang; Fangxin Liu; Zhengwei Qi

Date:Monday, July 27 Location:Mtg Room 203AB Session:Advancing the Frontier of Neuromorphic Learning Systems +1
Abstract

Building high-performance Spiking Neural Networks (SNNs) involves a trade-off among competing paradigms. While unsupervised rules like STDP offer efficiency and ANN-to-SNN conversion provides a simple path, direct training with Backpropagation Through Time (BPTT) is the most effective route for achieving high accuracy. However, BPTT's effectiveness comes at a steep price: its memory cost scales linearly with timesteps, creating a resource barrier that limits the information capacity and practicality of SNNs. This makes \emph{memory-efficient} training methods critical for real-world deployment. To this end, we propose the \underline{Ski}pping-\underline{S}ide-\underline{T}uning (SkiST) framework, which leverages both spatial and temporal redundancy in SNNs for \emph{memory-efficient} fine-tuning. On the spatial side, SkiST introduces side networks and applies low-rank approximation with adaptive rank filtering to compress trainable parameters. On the temporal side, it employs Dynamic Sparse BPTT, selectively skipping non-critical timesteps during gradient propagation while compensating for information loss with exponentially decayed gradients. Experiments show SkiST reduces GPU memory by up to 50\% over full fine-tuning while maintaining competitive accuracy. On an SNN-based language model, SkiST requires 23.2\,GB of memory for 1024-token sequences, compared to 40.3\,GB for the baseline, with minimal accuracy degradation. This work enables the deployment of adaptable, efficient spiking models on resource-constrained devices.

DesignDES3. Emerging Models of ComputationChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH296

A Remote Side-Channel Attack on Post-Quantum Key Encapsulation Mechanisms: A Case Study of ML-KEM on Intel x86

Mona Hashemi; Qianmei Wu; Fan Zhang; Shivam Bhasin; Trevor E. Carlson

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

As post-quantum cryptographic (PQC) schemes are standardized, evaluating their resilience to side-channel attacks (SCA) becomes critical. While most prior studies focus on physical SCAs, the practicality of remote SCAs on full implementations of standardized PQCs remains largely unexplored. In this paper, we present the first generic remote power SCA on the Module-Lattice-Based Key Encapsulation Mechanism (ML-KEM), evaluated on a modern Intel x86 processor. Our result demonstrates that power traces can be exploited remotely as a plaintext-checking oracle, enabling secret key recovery despite the scheme's theoretical IND-CCA security. Using ML-KEM as a case study, we show that complex microarchitectural mechanisms such as speculative execution and dynamic power management do not eliminate exploitable power leakage. We evaluated our attack on the PQClean implementation, achieving secret key recovery with a success rate up to 99.5%. These findings provide a realistic assessment of PQC leakage behavior on high-end processors and underscore the need for architecture-aware leakage models and co-designed hardware–software defenses to ensure secure PQC deployment in practice.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH297

Etherssd: An In-Storage Ethereum Analytics Platform with Minimized I/O and Authentication Overhead

Chenlin Ma; Yongbiao Zhu; Tianyu Wang; Jiaxian Chen; Kecheng Huang; Rui Mao; Yi Wang

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

Ethereum's massive data requires auxiliary proofs (e.g., Merkle proofs) for trusted queries, creating significant I/O and data movement overhead. We introduce EtherSSD, an innovative in-storage Ethereum analytics platform with computational storage devices (CSDs) designed for real-time data analysis. EtherSSD bridges the semantic gap between the host and CSDs, dissolving authenticated data queries into flattened (highly concurrent) in-CSD page accesses with slashed I/O operations from the host side. Additionally, EtherSSD incorporates an authentication engine that offloads cryptographic verification computations from host CPUs. Evaluations under real-world workloads demonstrate that it reduces authenticated query execution time, particularly the I/O and authentication overhead.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH315

Capacitive Touchscreens at Risk: Recovering Handwritten Trajectory on Smartphone via Electromagnetic Emanations

Yukun Cheng; Shiyu Zhu; Changhai Ou; Xingshuo Han; Yuan Li; Shihui Zheng

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

This paper reveals and exploits a critical security vulnerability: the electromagnetic (EM) side channel of capacitive touchscreens leaks sufficient information to recover fine-grained, continuous handwriting trajectories. We present Touchscreen Electromagnetic Side-channel Leakage Attack (TESLA), a non-contact attack framework that captures EM signals generated during on-screen writing and regresses them into two-dimensional (2D) handwriting trajectories in real time. Extensive evaluations across a variety of commercial off-the-shelf (COTS) smartphones show that TESLA achieves 77% character recognition accuracy and a Jaccard index of 0.74, demonstrating its capability to recover highly recognizable motion trajectories that closely resemble the original handwriting under realistic attack conditions.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH320

SAAP: An Efficient Spatial-Aware Analytic Partitioning Algorithm of VLSI Netlists for Parallel Routing

CHEN LIU; Hongxin Kong; Lang Feng; Wenchao Qian; Wuxi Li

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

As VLSI designs grow in complexity, partitioning is widely adopted to accelerate physical design through parallel computing. However, traditional hypergraph partitioning methods often degrade in performance when applied to 2D layouts due to spatial constraints. For routers with post-placement locations, a spatial-aware partitioning method fully utilizing placement data is preferable. Existing works can only consider soft spatial constraints, leading to a scattered distribution in one partition. We propose SAAP, an analytic partitioning algorithm enforcing hard spatial constraints while efficiently minimizing cut sizes. It includes analytic boundary modeling with regularity-guided simulated annealing and region embedding. Given placed netlists, it generates timing-friendly k-way spatially continuous partitions for parallel routing. Experiments show that it can quickly provide several to dozens of times smaller spatial cut sizes than previous state-of-the-art, with better spatial continuity.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH335

BSPDN-Elite: A Comprehensive Framework for Optimizing Timing, Power and Routing Resources in BSPDN Designs

Leilei Jin; Haoyang Xu; Siyuan Liang; Zhen Zhuang; Zhou Hu; Zixiao WANG; Bei Yu; Rongmei Chen; Tsung-Yi Ho

Date:Monday, July 27 Location:Mtg Room 201B Session:Advances in 3D/2.5D Physical Design +1
Abstract

Backside power delivery networks (BSPDN) provide superior power integrity while freeing backside routing resources. However, existing works fail to fully exploit these resources for power, performance, and area (PPA) optimization through strategic net allocation. This paper presents a co-optimization framework that maximizes PPA by strategically routing both clock and signal nets on the backside, enabling double-side routing. Experimental results demonstrate 88.1% IR-drop reduction, 23.0% frequency improvement, 10.9% power savings, and timing improvements (66.3% WNS, 40.2% TNS) over FSPDN with negligible nTSVs overhead.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageSYS4. Embedded System Design Tools and Methodologies
RESEARCH341

TAG: A Topology-Aware Architecture for Configurable and Memory-Efficient GNN Acceleration

Ruoheng Yao; Yichen Ouyang; Wenneng Jiang; Tianji Wang; Yangyi Zhang; Zhiyue Gao; Zhiyong Lai; Lei Chen; Fengwei An

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

Graph Neural Networks (GNNs) offer powerful graph-structured data modeling capabilities, yet their acceleration is challenging. Rigid hardware parallelism struggles to accommodate algorithmic diversity and the sparse, irregular topology inherent to GNNs. To resolve this conflict, we propose TAG, a topology-aware GNN accelerator that achieves high performance through synergistic innovations at three levels: dataflow, scheduling, and memory hierarchy. For algorithmic diversity, we employ a configurable, topology-driven dataflow that is aware of both algorithmic needs and graph structure. To mitigate irregularity, a contention-aware scheduler orchestrates irregular memory access by reordering them into a conflict-free stream. Furthermore, an algorithm-architecture co-designed memory hierarchy, combined with a coarse-to-fine graph partitioning algorithm, maximizes data reuse from sparse graphs and significantly minimizes off-chip traffic. Evaluations demonstrate that TAG achieves an average of 3.22x speedup and 3.04x energy efficiency over state-of-the-art GNN accelerators.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH343

SpecANNS: Accelerating Graph-Based Approximate Nearest Neighbor Search with Speculative In-Storage Computing

Yi Wang; Yongxin Shen; Yongbiao Zhu; Tianyu Wang; Chenlin Ma; Jiaxian Chen; Kecheng Huang; Rui Mao

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

Disk-based Approximate Nearest Neighbor Search (ANNS) suffers from high I/O latency due to random node access, which dominates over 90% of search time. To address this, we propose SpecANNS, an In-Storage Computing (ISC)-based solution leveraging in-storage FPGAs for fast distance computation. SpecANNS speculatively identifies pages likely to be accessed in subsequent hops and exploits NAND flash parallelism to minimize page read latency. Implemented on real Computational Storage Device (CSD) hardware with a simple interface, SpecANNS significantly reduces query latency and improves energy efficiency compared to state-of-the-art disk-based ANNS methods.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH358

Lyra: A Hardware-Accelerated RISC-V Verification Framework with Generative Model-Based Processor Fuzzing

Juncheng Huo; Yunfan Gao; Xinxin Liu; Sa Wang; Yungang Bao; Xitong Gao; Kan Shi

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

As processor designs grow more complex, verification remains bottlenecked by slow software simulation and low-quality random test stimuli. Recent research has applied software fuzzers to hardware verification, but these rely on semantically blind random mutations that may generate shallow, low-quality stimuli unable to explore complex behaviors. These limitations result in slow coverage convergence and prohibitively high verification costs. In this paper, we present Lyra, a heterogeneous RISC-V verification framework that addresses both challenges by pairing hardware-accelerated verification with an ISA-aware generative model. Lyra executes the DUT and reference model concurrently on an FPGA SoC, enabling high-throughput differential checking and hardware-level coverage collection. Instead of creating verification stimuli randomly or through simple mutations, we train a domain-specialized generative model, LyraGen, with inherent semantic awareness to generate high-quality, semantically rich instruction sequences. Empirical results show Lyra achieves up to 1.27x higher coverage and accelerates end-to-end verification by up to 107x to 3343x compared to state-of-the-art software fuzzers, while consistently demonstrating lower convergence difficulty.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH363

Network Design for Wafer-Scale Systems with Wafer-on-Wafer Hybrid Bonding

Patrick Iff; Tommaso Bonato; Maciej Besta; Luca Benini; Torsten Hoefler

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Interconnect-Driven System Architecture Innovation +1
Abstract

Transformer-based large language models are increasingly constrained by data movement as communication bandwidth drops sharply beyond the chip boundary. Wafer-scale integration using wafer-on-wafer hybrid bonding alleviates this limitation by providing ultra-high bandwidth between reticles on bonded wafers. In this paper, we investigate how the physical placement of reticles on wafers influences the achievable network topology and the resulting communication performance. Starting from a 2D mesh-like baseline, we propose four reticle placements (Aligned, Interleaved, Rotated, and Contoured) that improve throughput by up to 250%, reduce latency by up to 36%, and decrease energy per transmitted byte by up to 38%.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageQuantum +1 more
RESEARCH365

Advancing Macro Placement: Integrating Proven Design Practices with Reinforcement Learning

Hsin-Chuan Kuo; Chia-Wei Chen; Yi-Ying Liao; Tai-Yu Cheng; Bo-Jiun Hsu; Sheng-Tai Tseng; Chun-Chih Yang; Kun-Chin Huang; Tailai Tung

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

Macro placement is a critical stage in physical design, directly impacting the quality and performance of VLSI circuits. We propose a reinforcement learning (RL)-based macro placement framework that integrates proven design practices through a design-practice-embedded action mask, including peripheral placement, dead space avoidance, and proximal placement for macros with shared design hierarchy and physical footprint. Unlike previous RL based methods, our method uses macro clusters-formed according to design hierarchy and physical footprint-as the basic placement units, which reduces placement steps and consequently accelerates convergence and improves runtime. The proposed framework also introduces a novel compaction method to minimize wasted area caused by grid granularity, and jointly optimizes macro cluster location and tiling pattern for more effective exploration. Experimental results show that our approach achieves expert level placement quality and consistently outperforms three leading commercial macro placers on industrial designs, with reduced turnaround time. On public benchmarks, our method achieves up to 25.37% and 39.51% improvements in worst negative slack (WNS) and total negative slack (TNS), respectively, over five state of the art (SOTA) placers. These results demonstrate the effectiveness and real-world applicability of our RL based framework, paving the way for further advancements in physical design automation.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH368

Trimoe: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

Yudong Pan; Yintao He; Tianhua Han; Lian Liu; Shixin Zhao; Zhirong Chen; Mengdi Wang; Cangyuan Li; yinhe han; Ying Wang

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU–CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU–CPU–NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83$\times$ speedup over state-of-the-art solutions.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH373

LATTE: Legality-Assured Differentiable Timing-Driven Detailed Placement

Jing Mai; Yi-Chen Lu; Yibo Lin; Haoxing Ren

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

Differentiable techniques for timing optimization have be-come the dominant paradigm in global placement, surpassing weight-tuning heuristics by directly leveraging gradients that encompass full-chip context. However, in detailed placement, state-of-the-art flows still rely on rule-based or enumeration-centric methods that ignore existing gradient landscape, sacrificing significant wirelength only for marginal timing gains. To overcome this issue, we present LATTE, the first legality-assured, end-to-end differentiable timing-driven detailed placer that fuses contin-uous gradient smoothness with detailed discrete granularity. Particularly, at each iteration, LATTE relaxes local density constraints around selected timing-critical cells, steering relocation from exact timing and wirelength gradients with step sizes in annealed schedules, and enforces end-of-place legality via incremental legalization and bad-move filtering. Across input placements from mainstream commercial and academic placers, LATTE consistently delivers significant timing improvement with minimal routed wirelength impact while ensuring legal final solutions. On 15 designs in a 7nm technology node, LATTE on average improves state-of-the-art timing-driven detailed placers including DREAMPlace-4.0 TCAD and Rsyn by 29.7% in TNS and 44.4% in routed wirelength, with all metrics verified by an industry-leading commercial tool.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH374

Sparsity- and Tolerance-Aware Temporally Redundant Neural Networks

Kyoji Awaki; Ali Alvi; Yamato Saikawa; Shogo Semba; Yuichi Okuyama; Hiroshi Saito; Hirohide Demura; Yuta Takahashi; Sumio Morioka; Yoichi Tomioka

Date:Monday, July 27 Location:Mtg Room 101B Session:Computing Without Waste: Lean and Scalable Neural Architectures +1
Abstract

Recent progress in satellite formation flight in Low Earth Orbit (LEO) is driving demand for energy-efficient AI accelerators with strong soft-error resilience. This paper proposes a co-design of a fault-tolerant neural network and hardware that combines tolerance-aware temporal redundancy with sparse outer-product computation. A theoretical bound on single-bit-flip effects in residual-quantized dot products enables selective protection of only the most vulnerable computations. In addition, a zero-skip outer-product unit exploits activation sparsity created by residual quantization to reduce redundancy overhead. Implementation results show that the proposed design achieves a 16.0% speedup over the non-fault-tolerant model while keeping area overhead to 0.20–1.26%.

AIAI3-I. AI/ML Application and InfrastructureChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH375

Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle

Zihan Wang; Cheng Tang; Lei Gong; Cheng Li; Chao Wang; teng wang; Wenqi Lou; Xuehai Zhou

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

Chain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks, yet incurs excessive memory overhead due to the long think stage sequences stored in the Key-Value (KV) cache. Unlike traditional generation tasks where all tokens are uniformly important, CoT emphasizes the final answer, rendering conventional KV compression strategies ineffective. In this paper, we present Crystal-KV, an efficient KV cache management framework tailored for CoT reasoning. Our key insight is the answer-first principle. By mapping answer preferences into think-stage attention map, we distinguish between SlipKV, which mainly maintains the reasoning flow but may occasionally introduce misleading context, and CrystalKV, which truly contributes to the correctness of the final answer. Next, we propose an attention-based Least Recently Frequently Used algorithm. It precisely identifies when a SlipKV entry's utility expires and evicts it, retaining CrystalKV without disrupting reasoning flow. Finally, we introduce an adaptive cache budget allocation algorithm. Based on the dynamic proportion of CrystalKV, it estimates the importance of each layer/head and adjusts the KV cache budget during inference, amplifying critical components to improve budget utilization. Results show that Crystal-KV achieves state-of-the-art KV cache compression, significantly improves throughput, and enables faster response time, while maintaining, or even improving, answer accuracy for CoT reasoning.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH378

Scheduling Cause-Effect Chains Without Timing Anomalies in End to End Latency

Yixuan Zhu; Bo Zhang; Yinkang Gao; Haoyuan Ren; Cheng Tang; Zhao; Lei Gong; teng wang; Wenqi Lou; Xi Li

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

In real-time systems, both individual task execution and data propagation must meet strict timing constraints. Cause–effect (CE) chains are widely used to analyze such behaviors by end-to-end latency. However, timing anomalies (TAs) can distort it, where a local reduction in execution times leads to an increase in the overall end-to-end latency. As a result, precisely analyzing the upper bounds of the latency becomes challenging, and such systems typically exhibit larger upper bounds than TA-eliminated systems. Existing studies either eliminate TAs by completely sacrificing average latency to simplify analysis or, despite adopting complex safe analysis methods, do not eliminate TAs effectively, still having high latencies. To address this issue, we identify two basic causes of TAs in end-to-end latency. Based on these causes, we propose the first treatment that eliminates TAs in the latency with negligible average latency loss using Deterministic Data Flow (DDF). We further formally prove its TA-free property. Therefore, we can get a precise upper bound for latency when all jobs execute with their worst-case execution times. Experimental results show that it effectively reduces the maximum end-to-end latency, the average latency, and latency jitter compared with the state-of-the-art (SOTA) method.

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH380

Chiplanner: Physically-Aware and Timing-Driven Design Planner for 2.5D Multi-Chiplet Systems

Zixuan Li; Kanglin Tian; Fan Hu; Zirui Li; Xinyue Wu; hengyuan zhang; Shixin Chen; Jianwang Zhai; Xinfei Guo; Kang Zhao

Date:Monday, July 27 Location:Mtg Room 201B Session:Advances in 3D/2.5D Physical Design +1
Abstract

While chiplet-based 2.5D integration offers scalable high-performance computing with reduced manufacturing cost, it also poses significant challenges in design closure. Achieving timing closure in chiplet-based systems is notoriously difficult and frequently demands repeated engineering change orders (ECOs) and architectural-level iterations. To overcome these challenges, we propose ChiPlanner, a holistic {early-stage} design planner that integrates chiplet partitioning and physical planning. ChiPlanner first employs approximate placement to estimate post-chiplet-placement timing in advance of partitioning, which provides guidance for partitioning and helps reduce timing gaps. It further models inter-chiplet delay and incorporates a parallel net-weighting strategy in timing-driven physical planning to optimize overall system performance. Moreover, Bayesian optimization is leveraged to systematically explore the trade-off between chiplet manufacturing cost and system timing. Experimental results demonstrate that incorporating ChiPlanner's early-stage planning enables downstream chiplet design tools to achieve notably superior optimization outcomes, delivering average improvements of 42.4% in total negative slack (TNS) and 16.8% in worst negative slack (WNS). These results confirm that accurate early-stage planning provides a far better initialization for chiplet design and significantly reduces the burden on later ECOs and architectural-level iterations.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageSYS4. Embedded System Design Tools and Methodologies
RESEARCH384

AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices

Zirui Ma; zhihua fan; Wenxing Li; Haibin Wu; Fulin Zhang; Wenming Li; Xiaochun Ye

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single-NPU-PIM system faces idle overhead in traditional operator-level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves parallel drafting on the PIM and verification on a single NPU through task-level DLM-TLM decoupling and specifically, it incorporates Entropy-History-Aware Drafting Control and Time-Aware Pre-Verification Control to dynamically manage adaptive drafting algorithm execution and pre-verification timing, suppressing invalid drafting based on low-confidence drafts. Additionally, AHASD integrates Attention Algorithm Units and Gated Task Scheduling Units within LPDDR5-PIM to enable attention link localization and sub-microsecond task switching on the PIM side. Experimental results for different LLMs and adaptive drafting algorithms show that AHASD achieves up to 4.2× in throughput and 5.6× in energy efficiency improvements over a GPU-only baseline, and 1.5× in throughput and 1.24× in energy efficiency gains over the state-of-the-art GPU+PIM baseline, with hardware overhead below 3% of the DRAM area.

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH387

CODA: Cooperative Register Distribution and Arbitration Among Partitioned GPU Sub-Cores

Bo Yuan; Sheng Liu; Yang Guo; Zekun Jiang; Jianfeng Cui

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

The partitioned microarchitecture of Streaming Multiprocessor (SM) is a fundamental design that enables massive parallelism in modern General-Purpose Graphics Processing Units (GPGPUs). This approach, however, inherently introduces two critical inefficiencies: vast data redundancy across distributed private registers and severe conflicts and load imbalance in the limited banks. To holistically address these cross-partition bottlenecks, this paper introduces CODA, a cooperative Register File (RF) distribution and arbitration framework. CODA is composed of two synergistic mechanisms: the Cooperative Register Renaming (CRR), which eliminates data redundancy by maintaining a single physical copy of shared data across sub-cores, and the Dual-Skewed Arbitrator (DSA) that mitigates fine-grained bank conflicts by incorporating the operand collector ID into its arbitration logic. Evaluations on a diverse suite of deep learning workloads show that CODA exhibits superior performance compared to its baseline architecture, achieving a speedup of 28.9% alongside a 15.6% reduction in RF power consumption. These gains are directly attributed to a 15.9% reduction in RF load imbalance and a 12.7% lower bank conflict rate.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH389

Toki: Profiling HBM Performance on FPGA Systems with RISC-V Soft Cores and PCIe Host DMA Traffic

Andrea Galimberti; Andrea Motta; Gianni Antichi; Davide Zoni

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

Programmable RISC-V soft cores are becoming more widespread in data-center scenarios, making it crucial to design efficient systems that deploy them on FPGA chips with HBM memory. Toki, released as open source, is the first hardware-software framework that enables profiling the performance of HBM on FPGA accelerator cards by jointly considering (i) the execution of workloads on RISC-V soft cores instantiated on the FPGA and (ii) the injection of memory traffic from the host system via DMA over PCIe, providing insights that cannot be obtained with synthetic traffic generators alone. Extensive experiments target an AMD Alveo U55C card, deploying up to 60 RISC-V compute cores and stressing its HBM2 memory through real-world applications and microbenchmarks with user-defined access patterns. Results showcase how Toki can effectively profile the impact on HBM performance of the compute cores' organization, the workload's memory access patterns, data locality, contention over the memory controllers, and host traffic.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH397

Lipcon: A Physics-Aware Spatially Resolved Light Penetration Compensation Method for Low-Cost 3D-Printed Microfluidics

Yushen Zhang; Siyuan Liang; Tsun-Ming Tseng; Ulf Schlichtmann

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Where Biology Meets Computation: Next‑Gen Bio‑Cyber‑Physical Innovations +1
Abstract

Three-dimensional (3D) printing enables low-cost, rapid prototyping of complex microfluidic biochips, but light penetration during printing distorts small, multi-layered structures. Existing solutions are either inaccessible or overlook localized geometric effects, leading to inaccuracies that degrade device performance. We propose a novel physics-aware, data-driven design-for-manufacturing approach to improve dimensional fidelity. Using a new dataset of design-fabrication discrepancies, our machine learning-assisted method predicts spatially resolved compensation values. Experiments on complex multi-layer devices demonstrate that our method significantly enhances geometric accuracy and consistently outperforms existing compensation strategies, even with low-cost hobby-grade printers and off-the-shelf resins.

DesignDES3. Emerging Models of ComputationQuantumEDA
RESEARCH399

ChiSER: Surrogate-Enhanced Resource-Aware Exploration of Chiplet-Based Deep Neural Network Accelerators

Yi-Cheng Lo; Siva Satyendra Sahoo

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

Co-designing chiplet-based neural-network accelerators spans partitioning, placement, dataflow, and microarchitecture under tight energy and latency limits. Fast system-level estimators scale to large searches but blur intra-core effects that dominate energy; fine-grain reference models capture them but are too slow in-loop. We close this speed–fidelity gap with a coarse-to-fine flow: a compact, architecture-aware surrogate of intra-core cost guides the global search, and only top designs are rechecked with the reference model. Our contributions are an energy-focused feature space and a lightweight predictor across convolutional and GEMM-like workloads. On ResNet-50 and a Transformer, we reduce a manufacturing-weighted energy–delay objective by up to 62\% and 44\%, respectively.

AIAI5-I. AI/ML System and Platform Design
RESEARCH401

PAGE: Processing-Using-DRAM Architecture-Circuits Co-Optimization for Efficient Acceleration of General Matrix-Vector Multiplication

Yifan He; Hyung Joon Byun; Jimin Lee; Do-Yoon Lim; Houxiang Ji; Jungmin Choi; Hoshik Kim; Jae-sun Seo

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

As large language models (LLMs) continue to scale in size and complexity, DRAM access has become a performance bottleneck, especially during the memory-bound decoding phase dominated by general matrix-vector multiplication (GEMV) operations. Recent work has explored near-/in-memory computing to address bandwidth limitations. Among these, Processing-Using-DRAM (PUD) offers a unique advantage as it performs logic operations entirely within DRAM peripherals with minimal circuit changes, enabling deployment on commercial off-the-shelf (COTS) DRAM. While prior work reported multi-input logic operations, its potential to accelerate GEMV operations in LLMs remains underexplored due to the limited compute capability of DRAM sense amplifiers (SAs). To overcome this challenge, we propose PAGE, a PUD architecture that accelerates GEMV through four key optimizations: (1) mat-level parallelism that exploits row-level parallelism by activating SAs across multiple mats; (2) inversion-enabled SA supporting NOT operations within a mat; (3) AP-based row copy mechanism that performs simultaneous source–destination activation with only control-signal changes; and (4) adaptive adder-tree accumulation that reduces accumulation cycles. We demonstrate the baseline PUD architecture on a COTS DRAM–FPGA platform to verify the functionality and timing of PUD operations, implement the modified circuits in TSMC 16nm for circuit-level evaluation, and build a in-house simulator for system-level throughput and energy analysis. Overall, PAGE achieves up to 14× and 3.1× end-to-end throughput and energy improvement for GEMV workloads over the SoTA PUD designs, demonstrating its feasibility for accelerating memory-bound LLM workloads with PUD.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH406

PS-CAM: An Efficient Content-Addressable Memory Based Accelerator for Point Search in Point Clouds

Xipeng Lin; Shaoxuan Li; Shanshi Huang; Hongwu Jiang

Date:Monday, July 27 Location:Mtg Room 202C Session:Intelligent Fetch & Match Architectures +1
Abstract

Point-based neural networks have become the mainstream approaches for point cloud processing. However, they heavily rely on computationally intensive neighbor search operations like K-Nearest Neighbors (KNN) and ball query, which create significant bottlenecks for deployment in resource-constrained environments.In this paper, we present PS-CAM, a novel NOR-CAM-based accelerator designed to support both ball-query and KNN efficiently. PS-CAM converts the ball-query operation into a set of range search problems, formulating a one-shot search via the RENÉ range query scheme. To enhance energy efficiency within this scheme, a Two-Level Region-Partitioned (TLRP) technique is adopted, which reduces the number of activated CAM banks simultaneously. For KNN search, PS-CAM employs a progressive method through a series of ball-queries with increasing radius. The search iterations are bounded by the Density-Aware Radius Estimation (DARE) method, which rapidly approximates the KNN-equivalent ball radius (KEBR), thereby drastically reducing the need for repeated queries. Compared with prior works, PS-CAM demonstrates significant performance gains, achieving speedups of 1.25~1.33× and an energy efficiency improvement of 142× in ball query. For KNN task, PS-CAM delivers a speedup of 4.6~27.2× and energy-efficiency improvements of 7.1~2578×.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH407

AMS Layout Automation Using Circuit-Level Analog Standard Cells and Self-Biasing Techniques in Digital PnR Tools.

Wonsik Oh; Luke Theogarajan

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

This work presents a silicon-verified analog–mixed-signal (AMS) layout-automation framework integrating a circuit-level analog standard-cell (CLAS) library with self-biasing circuits in a digital place-and-route (PnR) environment. Unlike transistor-level stem-cell approaches, the CLAS library standardizes matched circuit-level blocks, including self-biased amplifiers, current mirrors, and delay cells. Fabricated 180-nm and 65-nm CMOS chips validate fully automated, DRC/LVS-clean layouts for a current source, adaptive-bandwidth PLL, and probabilistic computer. All benchmarks show strong pre-/post-layout and silicon correlation, with the PLL exhibiting 0.98–1.50 ps simulated and 4.1 ps measured jitter. The framework achieves ≥96 % area utilization, 14.4–69.5 % DCAP density, and <1 min–1 hr runtime, demonstrating scalable, silicon-consistent AMS automation.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH440

Activity-Aware Partitioning for Effective Multi-Threaded Event-Driven RTL Simulation

Kexing Zhou; Youwei Zhuo; Yibo Lin; Weikang Qian; Pengpeng Ren; Yun (Eric) Liang

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

Register-transfer-level (RTL) simulation is commonly accelerated through multi-threading and event-driven technique. However, when these techniques are combined, severe load imbalance arises: in event-driven execution, circuit regions exhibit highly uneven activity, leaving many threads idle with very low utilization. Existing multi-threaded RTL partitioning models often optimize full-cycle execution and ignore the dynamic activity variation that dominates event-driven workloads. We present PGSIM, a multi-threaded event-driven RTL simulator that incorporates runtime activity information into its partitioning model. Our model augments hypergraph weights with measured activity frequency and inter-block activation correlation, which is efficiently estimated through a lightweight MinHash profiler. By aligning thread assignment with true activity patterns, PGSIM substantially reduces idle time and improves utilization. Across experiments on large benchmarks, PGSIM achieves 20–30% higher performance than its activity-unaware version, delivers a simulation speed of 200-400 kHz and up to 8.9× speedup over Verilator. These results demonstrate that runtime activity modeling is essential for scalable multi-threaded event-driven simulation.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH441

Cachence: Fine-Grained Cache Partitioning in Both Time and Space

Liujia Li; Yuanlong Li; Yiming Yao; Jianyu Wu; Yi Fan; Jinhao Guo; Liren Zhu; Jie Zhang; Xiaolin Wang; Yingwei Luo; Zhenlin Wang; Diyu Zhou

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

Cache partitioning mechanisms improve system performance by judiciously adjusting cache space among applications. Its effectiveness can be enhanced by partitioning in both time and space. However, existing solutions are limited to only one dimension. This paper presents Cachence, a fine-grained cache partitioning mechanism in both time and space. Cachence profiles cache access patterns with theoretically guaranteed accuracy while incurring fixed storage overhead. Based on runtime microarchitectural statistics, we also propose a lightweight yet precise performance prediction model. Our evaluation shows that Cachence outperforms existing approaches by an average of 9.5% to 28.2%, and up to 48.8% on a 16-core system.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH442

VALVE: Accelerating Visible Version Searching in HTAP Systems via In-Storage Computing

Yi Wang; Yang Zheng; Tianyu Wang; Chenlin Ma; Jiaxian Chen; Kecheng Huang; Jianbin Qin; Rui Mao

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

Hybrid Transactional and Analytical Processing (HTAP) workloads are widely deployed in production systems. Under HTAP workloads, frequent updates lengthen the version chains maintained in Multi-Version-Concurrency-Control (MVCC) systems. Longer version chains require Visible Version Searching (VVS) to scan more versions during analytical queries. In-Storage Computing (ISC) technique alleviates the I/O burden. We propose VALVE to offload the VVS process to a Computational-Storage-Device (CSD). VALVE first solves the consistency issue under host-CSD concurrency. A frugal skip list is introduced in VALVE to further accelerate the version-chain traversal. Experimental results show that VALVE reduces end-to-end scan latency and improves analytical throughput.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH444

Gwrite: Offloading Replication in Disaggregated Memory Systems

Jianglang Zhu; Mingxing ZHANG; Yingdi Shan; Yongwei Wu

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

Disaggregated memory systems replicate data across memory nodes to ensure strong consistency and durability, but synchronous replication quickly saturates the compute node's RDMA NIC as the replica count grows. We propose GWrite, a hardware-oriented RDMA primitive, and GWrap, a co-designed replication protocol that offload replication fan-out to memory-side RNICs while preserving one-sided RDMA semantics. A software prototype that emulates GWrite on commodity RNICs achieves up to 2.18× higher throughput in a distributed transaction system and reduces replication-induced throughput degradation from 46.7% to 19.7% in a replicated hash table, showing that exploiting memory-side RNIC outbound capacity can substantially alleviate compute-side IOPS bottlenecks in write-intensive disaggregated deployments.

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH445

Hyperanalog: Domain-Aware Hypergraph Transformer for Analog Circuit Representation Learning

Taejin Paik; Suwan Kim

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Taming the Physical World: AI Strategies for Circuits and Cores to Quantum Machines +1
Abstract

Analog circuits exhibit rich multi-terminal relationships that are poorly captured by conventional graph-based learning methods. We introduce \coin, a domain-aware foundation model that operates directly on a hypergraph representation tailored to transistor-level behavior. \coin integrates specialized transistor modeling, voltage-aware positional encoding derived from topological distances to VDD/VSS, and a transformer-enhanced hypergraph neural network to improve expressivity and structural discrimination. We evaluate \coin on circuit classification and specification regression. Experimental results show that \coin significantly outperforms both graph- and hypergraph-based baselines, proving the efficacy of our domain-aware approach.

AIAI3-II. AI/ML Application and InfrastructureQuantumDES3. Emerging Models of Computation +1 more
RESEARCH461

MoE-Balance: Unleashing the Full Potential of CPU-GPU Heterogeneity for Efficient MoE Inference

Tiantian Lin; Xiaohang Wang; Yingtao Jiang; Amit Kumar Singh; Kui Ren

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

Sparse Mixture-of-Experts (MoE) models offer a powerful way to increase the capacity of large language models without proportionally increasing computation. Yet, their extremely large expert weight parameters create severe memory challenges for consumer-grade GPUs and edge devices. Existing offloading and hybrid CPU–GPU approaches fall short in fully utilizing CPU resources, and their coarse-grained scheduling and poorly overlapped CPU-GPU data transfers leave CPU cores frequently idle, thus causing pipeline bubbles and exacerbating overall system inefficiency. We present MoE-Balance, a system that efficiently utilizes heterogeneous CPU-GPU resources for MoE inference. The key innovations include a fine-grained dynamic scheduler that distributes expert tasks based on real-time load, and a cross-layer prefetching and chunked transfer mechanism that overlaps computation with weight movement via PCIe. Experiments on Mixtral-8x7B show that MoE-Balance achieves 1.6x higher throughput and 37.5% lower end-to-end latency compared with state-of-the-art hybrid baselines, demonstrating strong performance gains under memory-constrained settings.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH462

SCIMI: Semantic-Aware Collaborative Intrusion Detection for Multi-Domain In-Vehicle Networks

Chang Zhu; Xiaohang Wang; Abla Smahi; Kaiwei Wu; Yinhe Shen; Yingtao Jiang; Amit Kumar Singh; Kui Ren

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Scaling Intelligence for the Physical World +1
Abstract

Modern vehicles integrate multiple in-vehicle networks (IVNs), such as Controller Area Network (CAN), Automotive Ethernet, and host systems, to support advanced automation and connectivity. While these networks enable new functionalities, they also expand attack surface across multiple domains. Sophisticated multi-stage attacks exploit these surfaces to propagate covertly and existing vehicular IDS systems cannot detect them as these attacks cause minor changes in IVNs. This paper presents SCIMI, a Semantic-aware Collaborative Intrusion detection system for Multi-domain In-vehicle networks, which learns inter-domain causal relationships to identify subtle, coordinated attacks that evade single domain intrusion detection systems. SCIMI aligns asynchronous traffic patterns through a dual window temporal model, extracts semantic features via domain adaptive encoders, and applies cross domain attention to correlate suspicious behaviors. A dataset is also released for collaborative IDS. Experiments show that SCIMI achieves over 99% F1-score with a 17% improvement over state-of-the-art methods, while maintaining low false positive alarms and real time inference efficiency, underscoring its applicability to real time automotive cyber-security.

SystemsSYS1. Autonomous Systems (Automotive, Robotics, Drones)AIQuantum
RESEARCH484

Smartswap: Swap-Based Memory Optimization for LLM Training Under Varying Operator Sequences

Zibo Wang; Yuhang Zhou; Zhibin Wang; Shipeng Li; Xinjing Huang; Chendong Cai; Bingxu Mu; Yuqing Sun; Zhiheng Hu; Bin She; Shu You; Guanghuan Fang; Rong Gu; Wanchun Dou; Guihai Chen; Chen Tian

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

The increasing size of large language models (LLMs) has led to a surge in memory requirements during training, often exceeding the capacity of high-bandwidth memory (HBM). Swap-based memory optimization incurs neither accuracy loss nor additional end-to-end overhead when effectively overlapped, thus being an attractive solution. However, existing swap methods assume consistent operator sequences, which is impractical in Eager Mode, where operator sequences can vary across iterations. We propose SmartSwap, which redesigns the end-to-end process of swap-based memory optimization and is the first work to consider varying operator sequences in Eager Mode. SmartSwap (i) introduces a lightweight online profiler to enable continuous profiling for monitoring operator sequences, (ii) generates effective swap policies with limited operator information, and (iii) optimizes the policy execution module for accurate policy application and better performance. Experimental results demonstrate that SmartSwap reduces profiling overhead by 84.25%, enables training models up to 4× larger than hardware memory while adapting to changes in operator sequences, improves performance by up to 38.94% compared to recomputation or high-degree parallelism.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH509

MBA: Mega-Buffer Architecture Based on 1T1C IGZO eDRAM for Efficient LLM Inference

Siyuan Chen; Ruiqi Chen; Jiaqi Li; Shiyang Li; Ao Shi; Yiyang Chen; Yi Xiao; Zhuohua Tang; Lifeng Liu; Peng Huang

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

Substantial weight and KV cache transfers between the GPU and HBM severely constrain Large Language Model (LLM) inference system performance. While LLM compression techniques can reduce memory footprint and transfer volume, existing methods are often task-specific and compromise model accuracy. This work proposes a mega-buffer architecture (MBA) that integrates gigabyte-scale on-chip buffers utilizing the embedded DRAM (eDRAM) technology recently introduced by TSMC. To fully leverage this large on-chip memory, we introduce a KV cache prioritized mapping (KVP) scheme that minimizes inefficient KV cache traffic between the HBM and the chip. Furthermore, a highly efficient pipeline integrating double buffering mechanism (DB) is co-designed with an iteration-aware eviction strategy (IA) to enhance data reuse and sustain high compute utilization. Evaluation results show that MBA attains 5.99x and 3.63x end-to-end speedup, and 8.86x and 4.93x energy efficiency improvement against GPU and LLM accelerator baselines, demonstrating a highly efficient architectural solution for LLM inference.

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH513

Asymcheck: Asymmetric Partitioned Checkpointing for Efficient Large Language Model Training

Zhangqiang Ming; Yuchong Hu; Zhiyuan Luo; Patrick P. C. Lee; Yuanhao Shu; Xinjue Zheng; Wenxiang Zhou; Dan Feng

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

Distributed large language model (LLM) training faces prevalent failures and requires efficient checkpointing. State-of-the-art approaches employ partition-based pipelined checkpointing, splitting checkpoints into partitions for concurrent processing. However, existing solutions rely on a fixed partition size, which our analysis reveals is suboptimal for LLM training: large partitions cause bandwidth stalls during forward passes, while small partitions incur substantial startup overhead during backward passes. We propose AsymCheck, which employs asymmetric partitioning: small partitions for forward passes and large partitions for backward passes, plus selective partition compression and batched flushing optimizations. Evaluation on 64 GPUs shows AsymCheck reduces training time by 20.1%-48.2% over state-of-the-art methods, approaching no-checkpointing efficiency.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH514

The Phantom of PCIe: Constraining Generative Artificial Intelligences for Practical Peripherals Trace Synthesizing

Zhibai Huang; Chen Chen; James Yen; Yihan Shen; Yongchen Xie; Zhixiang Wei; Kailiang Xu; Yun Wang; Fangxin Liu; Tao Song; Mingyuan Xia; Zhengwei Qi

Date:Monday, July 27 Location:Mtg Room 101A Session:Self-Driving EDA: Agents, Benchmarks, and Evolving Toolchains +1
Abstract

Peripheral Component Interconnect Express (PCIe) is the de facto interconnect standard for high-speed peripherals and CPUs. The development of PCIe devices for emerging applications requires realistic Transaction Layer Packet (TLP) traces that accurately simulate device-CPU interactions. While generative AI offers a promising avenue for synthesizing complex TLP sequences, it is prone to a critical challenge inherent in all generation tasks: hallucination. Naively applying these models often produces traces that violate fundamental PCIe protocol rules, such as ordering and causality, rendering them unusable for device simulation. To resolve this, our work introduces a methodology to bridge the gap between generative AI and high-fidelity device simulation. This paper presents Phantom, a framework that systematically addresses AI-generated hallucinations in TLP synthesis. Phantom achieves this by coupling a generative backbone with a novel post-processing filter that enforces PCIe-specific constraints, effectively eliminating invalid TLP sequences. We validate Phantom's effectiveness by synthesizing TLP traces for an actual PCIe network interface card. Experimental results show that Phantom produces practical, large-scale TLP traces, significantly outperforming existing models, with improvements of up to 1000$\times$ in task-specific metrics and up to 2.19$\times$ in Fréchet Inception Distance (FID) compared to backbone-only methods. The prototype implementation has been made open-source.

AIAI2-I. AI/ML Algorithms and ModelsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH518

Flexicts: CPPR-Aware 3D Clock Tree Synthesis for Face-to-Face Bonded ICs

Shanyi Li; Leilei Jin; Siyuan Liang; Zhen Zhuang; Rongmei Chen; Bei Yu; Tsung-Yi Ho

Date:Monday, July 27 Location:Mtg Room 201B Session:Advances in 3D/2.5D Physical Design +1
Abstract

Face-to-Face 3D ICs enable advanced vertical integration but pose critical challenges for clock tree synthesis (CTS). Existing pseudo-3D flows rigidly partition clock paths across dies, fragmenting common paths and crippling Common Path Pessimism Removal (CPPR), resulting in excessive skew and inefficient hybrid bonding terminal (HBT) utilization. We present FlexiCTS, a correct-by-construction CPPR-aware framework featuring adaptive cross-die buffer assignment to maximize path sharing and optimize HBT allocation. Experimental results show that FlexiCTS achieves 4.1$\times$ skew reduction and 88\% fewer HBTs than state-of-the-art methods, while simultaneously matching 2D timing quality with superior resource efficiency.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageSYS4. Embedded System Design Tools and Methodologies
RESEARCH520

Duplication-Aware Retiming and Cell Interface Redesign for Superconductor Circuit Minimization

Panagiotis Papanikolaou; Alex Vanasse; Haoran Jin; Georgios Tzimpragos; Jennifer Volk

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

Superconductor electronics have increasingly shifted away from RSFQ and its variants toward logic families that eliminate explicit gate-level clocking. While this transition enables simpler circuits and more efficient architectures, it also introduces an implicit reliance on dual-rail codes, resulting in inherent gate duplication. This work presents a duplication-aware retiming methodology for Josephson junction (JJ) count minimization, co-optimizing register placement and polarity assignment. The approach applies beyond SFQ to any monotonic circuit. We further identify cell interfaces—specifically, interconnect drivers, receivers, and fanout (FO) elements—as dominant contributors to JJ count in each cell. A new amplifier design is introduced to reduce these costs, integrated within existing SFQ cells, experimentally verified, and characterized to form a new SFQ cell library. Our results demonstrate a 63-71% JJ count reduction in single-cycle implementations and 41-66% reduction in multi-cycle implementations compared to the prior state-of-the-art. The latter establishes a new Pareto frontier, achieving both shorter critical paths and lower JJ counts than the best-to-date single-cycle implementations.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH523

Legend: A Data-Driven Framework for Lemma Generation in Hardware Model Checking

Mingkai Miao; Guangyu Hu; Wei Zhang; Hongce Zhang

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

The IC3/PDR algorithm is a cornerstone of hardware model checking, and machine learning (ML) has been explored to guide its critical inductive generalization step. However, prior ML methods are built upon a per-clause graph analysis paradigm, requiring repetitive and costly graph processing for every clause, creating a severe scalability bottleneck. Therefore, we introduce LeGend, a framework that completely replaces this paradigm with one-time global representation learning. LeGend architects a domain-adapted self-supervised learning task to generate latch embeddings that encode global circuit properties. These pre-computed embeddings enable a lightweight model to predict high-quality lemmas with negligible overhead, effectively decoupling expensive learning from fast inference. Experiments show our approach accelerates two state-of-the-art IC3/PDR engines across a diverse set of benchmarks, presenting a promising path to scale up formal verification.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH529

Onyx: Efficient Transaction Processing with Real Processing-in-Memory Prototypes

Menglei Chen; Yixiao Wang; Yu Hua

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

A transaction ensures an all-or-nothing guarantee for a set of read/write operations. While modern transaction processing systems provide atomicity and consistency for data center applications, their performance is fundamentally limited by a critical memory bandwidth bottleneck, caused by massive numbers of read/write operations. Processing-in-memory (PIM) architectures offer a promising solution with their high aggregate bandwidth. However, directly applying PIM becomes inefficient in practice due to load imbalance and costly data transfers among PIM processors. To bridge the gap between transaction processing and PIM architectures, this paper presents Onyx, a high-throughput transaction processing system that efficiently executes transactions on a real UPMEM PIM prototype. To fully leverage the high parallelism and bandwidth of PIM, Onyx orchestrates the transaction execution workflow to maximize local accesses while maintaining load balance across PIM processors. It further employs rank-level asynchronous data transfer to reduce communication overhead. Extensive evaluation using real-world YCSB and TPC-C benchmarks shows that Onyx significantly outperforms state-of-the-art transaction processing systems.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH535

Mappingevolve: LLM-Driven Code Evolution for Technology Mapping

Rongliang Fu; Yi Liu; Qiang Xu; Tsung-Yi Ho

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

Technology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimization scripts, their potential for core algorithm enhancement remains untapped. We introduce MappingEvolve, an open-source framework that pioneers the use of LLMs to directly evolve technology mapping code. Our method abstracts the mapping process into distinct optimization operators and employs a hierarchical agent-based architecture, comprising a Planner, Evolver, and Evaluator, to guide the evolutionary search. This structured approach enables strategic and effective code modifications. Experiments show our method significantly outperforms direct evolution and strong baselines, achieving 10.04\% area reduction versus ABC and 7.93\% versus mockturtle, with 46.6\%--96.0\% $S_{overall}$ improvement on EPFL benchmarks, while explicitly navigating the area--delay trade-off. We have open-sourced our framework to foster reproducibility and further research.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH538

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services

Haoyu Chen; Xue Li; Kun Qian; Yu Guan; Jin Zhao; Xin Wang

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

In Large Language Model (LLM) inference services, it is challenging to make a parallelism strategy configuration, to efficiently process the requests of variance context lengths. Requests of long context require high degree of parallelism to provide more memory for Key-Value (KV) Cache, while requests of short context prefer low degree of parallelism to increase concurrency, thus improving throughput. To maintain high throughput while supporting large context lengths on demand, we propose Amoeba, a runtime Tensor Parallel (TP) transformation for online LLM inference services, which adaptively adjusts the TP of running instances to align with the dynamics of incoming requests. Evaluations using real-world traces show that Amoeba improves throughput by 1.75×-6.57× compared to state-of-the-art solutions.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH544

DRCA: Reliable Bitwise Logic in DRAM via Dual-Rail Compute and Access

Yuxuan Qin; Chuxiong Lin; Shuo Yan; Guoming Rao; Weiguang Sheng; Weifeng He

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

DRAM-based processing-in-memory (PIM) emerged as a promising approach to alleviate the memory wall by executing massively parallel bitwise operations inside DRAM arrays. However, most prior designs operate on a single bitline and leave the dual-rail complementary signals naturally available on each bitline pair underutilized. We present DRCA, a novel dual-rail compute-and-access scheme that exploits both rails of a bitline pair for computation and data access. DRCA integrates two dual-contact compute cells, DRCA-OR and DRCA-XOR, which leverage full-swing dual-rail signaling on the bitline pair to perform bitwise logic with high reliability. Furthermore, the design enables concurrent operations on a single bitline pair, improving the efficiency of complex bitwise logic processing. Our evaluation shows that DRCA reduces failure rate by 2.29x compared with the most robust prior PIM design, while delivering superior average performance on basic bitwise operations.

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH550

Epicell: Electro-Physical Co-Modeling for Standard Cell PPA Prediction

Wenbo An; Kairong Guo; Haoyi Zhang; Yibo Lin

Date:Tuesday, July 28 Location:Mtg Room 101B Session:AI Graces the Cell Party: Transformers, LLMs and Zero Drama Extraction +1
Abstract

In advanced process nodes, the pursuit of extreme PPA optimization has driven an explosion in the demand for customized standard cells. To satisfy this demand, automated layout synthesis has been increasingly adopted to explore vast design spaces. However, this paradigm shifts the bottleneck from design creation to verification, as characterizing the massive volume of generated variants via SPICE is computationally prohibitive. Meanwhile, conventional geometric heuristics fail to proxy PPA at advanced nodes due to dominant layout effects. Existing learning-based surrogates often lack the fidelity to capture these complex dependencies. To bridge this gap, we propose EPiCell, an electro-physical co-modeling framework for rapid PPA estimation. EPiCell features a Heterogeneous Graph Transformer (HGT) that explicitly models transistors, routing metals, and supply rails as distinct entities, unifying circuit topology with fine-grained layout geometry. By employing relation-aware attention, it effectively captures the non-local electro-physical interactions governing cell performance. Validated on a dataset of over 18,000 auto-generated layouts based on ASAP7, EPiCell achieves high fidelity against SPICE simulations, with low average prediction errors of 1.82% for leakage power, 4.01% for internal power, 3.06% for delay, and 3.29% for transition. Crucially, it demonstrates superior ranking consistency with SPICE, attaining a median Spearman Rank Correlation Coefficient of 0.90 for internal power, 0.81 for delay, and 0.70 for transition. This offers a scalable surrogate model to enable efficient design space exploration.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH552

UCME: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

Jinpeng Ye; Chongxi Wang; Wenqing Li; Bin Yuan; Shiyi Wang; Fenglu Zhang; Junyu Yue; Jianan Xie; Yunhao Ye; Haoyu Deng; Yingkun Zhou; Xin Cheng; Fuxin Zhang; Jian Wang

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

Matrix extensions are essential in modern CPUs, but existing implementations with tight pipeline coupling and fine-grained synchronous instructions hinder integration and high-performance kernel implementation. We propose an adaptable matrix extension that carefully decouples matrix-unit from the CPU pipeline for low-overhead integration, unified software support, and reuse of existing compute and memory resources. It supports asynchronous operations and enables efficient matrix–vector overlap. Integrated into four open-source RTL CPUs, the design achieves over 90% utilization and up to 2.31× speedup on ResNet50, BERT, and Llama3 over Intel AMX, with 30% gains from matrix–vector overlap, while a 4TFLOPS@14nm implementation occupies 0.53mm².

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH553

From Fluid Dynamics to Chip Design: PDE Foundation Model Address Data Bottleneck in 3D-ICs Thermal Simulation

Zhen Huang; Xin HaiYang; Dongming Jiang; yangbo wei; Hong Wang; yang wenkai; yu zhang; Wei Xing; Ting-Jung Lin; lei he

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

Machine learning can reduce 3D-ICs thermal analysis runtime from hours to seconds, yet most existing ML-based thermal models train each chip independently from scratch, demanding large datasets while failing to exploit the mathematical equivalence between heat conduction and diffusion-type PDEs. We demonstrate that foundation neural operators pretrained on diverse PDE families enable effective transfer to 3D-IC thermal simulation. Building on this insight, we develop PNO-Therm, a specialized model obtained through targeted fine-tuning, which surpasses the previous state-of-the-art method using less than 20% of the training data. At equal dataset sizes, our method achieves 6–10× lower MAE, 3.5× reduced GPU memory consumption, and over 3× faster training while maintaining approximately 940× speedup versus FEM solvers. Validated across three representative 3D-IC designs, PNO-Therm establishes that pretrained neural operators provide a scalable pathway for high-accuracy thermal modeling under limited data.

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH561

Cupilot: A Strategy-Coordinated Multi-Agent Framework for CUDA Kernel Evolution

Jinwu Chen; Qidie Wu; Bin Li; Lin Ma; Xin Si; Yang Hu; Jun Yang

Date:Monday, July 27 Location:Mtg Room 101A Session:Self-Driving EDA: Agents, Benchmarks, and Evolving Toolchains +1
Abstract

Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$\times$ over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at https://anonymous.4open.science/r/cuPilot-Kernels-1656.

AIAI2-I. AI/ML Algorithms and ModelsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH574

Multi‑GPU Tensor‑Level Fusion and Adaptive Memory Management for Private Inference

Homer Gamil; Michail Maniatakos

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

Privacy-preserving machine learning via Fully Homomorphic Encryption (FHE) offers strong data confidentiality guarantees but suffers from prohibitive computational overhead. In this work, we improve the performance of private inference via fully homomorphic encryption on multi‑GPU environments. We achieve this through three key innovations: (1) an optimized tensor‑level FHE library based on the CKKS scheme that fuses low‑level encrypted primitives into high‑throughput GPU kernels, (2) an adaptive memory manager that dynamically orchestrates encrypted‑tensor placement to control peak memory usage, and (3) a multi‑GPU execution engine that partitions and balances workloads across devices to maximize utilization under constrained memory budgets. We evaluate our method on representative encrypted‑ML benchmarks and compare against state‑of‑the‑art CPU- and GPU‑based FHE systems, showing comparable single-GPU results since no existing work provides a multi-GPU implementation. Our eight-GPU configuration achieves strong scaling of 7.14x over the single-GPU execution, which in turn yields a 9.7x end-to-end cumulative improvement over the best prior work. We additionally implement all fused tensor-level operators within a full ResNet-20 inference pipeline on CIFAR-10, demonstrating that the method generalizes beyond MNIST-scale workloads. Results show efficient parallel scaling, with 8-GPU execution on ResNet-20 reaching 7.1x speedup over the state of the art under realistic memory constraints.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH577

Closing the Loop: Hybrid NAS Guided by Analytical and Hardware-Calibrated Quantum Cost Modeling

Muhammad Kashif; Alberto Marchisio; Muhammad Shafique

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Pioneering the Foundations of Quantum Circuit Synthesis, Compilation and Algorithmic Optimization +1
Abstract

Estimating the true hardware cost of quantum machine learning (QML) models is challenging due to repeated circuit evaluations affected by noise, decoherence, and routing delays. Conventional metrics like gate count overlook such hardware-dependent effects. We propose an analytical quantum cost model that estimates required quantum hardware resources using real device calibration data, incorporating gate durations, routing overheads, and noise-induced inefficiencies. Complementing this, a classical cost model converts FLOPs into equivalent units, providing a unified hardware-aware hybrid cost metric. Integrating both, we then propose Hyb-HANAS framework which employs multi-objective NAS (NSGA-II) to jointly optimize accuracy, execution time, and parameter count in hybrid quantum–classical networks.

DesignQuantumDES6. Quantum ComputingAI
RESEARCH588

Florella: Accelerating Graph Neural Networks by Sliding Reduction Convention and Flexible Architecture

Xiaobo Lu; Jianbin Fang; Lin Peng; Yang Liu; Yixiang Di; Chun Huang

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

The rapid evolution of graph neural networks (GNNs) has introduced diverse computational patterns beyond the message-passing paradigm. To support end-to-end inference, existing accelerators need to integrate heterogeneous, loosely-coupled components, incurring significant inter-accelerator communication and low performance density. This paper introduces Florella, an acceleration framework for unifying inconsistent computational patterns in end-to-end GNN inference. We first propose the sliding reduction convention, a declarative language that provides a flexible and hardware-friendly representation for diverse GNN operations. Building upon this, we design a versatile architecture that enables atomic mapping of macro-operations. This architecture is centered around a novel Jacobian-logarithm unit, which enables high hardware reuse across operators by leveraging logarithmic transformation and approximation. Evaluated across a range of GNN models, Florella achieves an average speedup of 2.2x and reduces memory traffic by 2x compared to four state-of-the-art accelerators, while improving performance density by 3.3x.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH607

COLA: Enabling Low-Latency Reads for Flash-Based SSDs via Code Length Adaptation

Zhengyao Ding; Dingxin Wang; Xiaolu Li; Patrick P. C. Lee; Lingling Song; Rui Lu; Yichen Zhang; Yuchong Hu; Dan Feng

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

Flash-based SSDs are prone to bit errors in flash cells. To ensure reliability, SSDs employ error correction codes (ECCs) that encode user data bits into codewords composed of data and redundant parity bits, enabling correction of a limited number of bit errors. In practice, however, the raw bit error rate (RBER) of SSDs fluctuates over time. We observe that for ECCs with a fixed code rate (the fraction of user data bits per codeword), longer codewords provide higher reliability, while shorter codewords offer lower read latency. To this end, we propose COLA, an adaptive coding framework that optimizes the code length (i.e., total number of user data and parity bits per codeword) for individual SSD pages to achieve low-latency reads while maintaining reliability guarantees and constant storage overhead. COLA adopts a failure-aware read mechanism that selectively transfers and decodes failed codewords, and integrates a failure-aware read-latency model to determine actual read operations based on current RBERs and select the optimal code length for each write operation. Evaluation using the MQSim simulation shows that COLA significantly reduces average and tail read latencies compared to the default fixed-code-length approach.

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH609

Knowledge-Driven Hybrid SSD Management Enhanced by Fine-Tuned LLMs

Qian Wei; Yi Li; Zehao Chen; Tianren Zhou; Zhaoyan Shen; Dongxiao Yu; Bingzhe Li

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

A hybrid Solid-State Drives (SSDs) integrates different modes of flash cells (e.g., single-level cell (SLC) and Quad-Level Cell (QLC)) and enables them to convert between each other, achieving both high performance and storage capacity. However, this hybrid design introduces a significantly larger design space than traditional SSDs with additional design factors such as flash conversion and data migration across different flash modes, leading to higher optimization complexity. Efficient management of such complexity requires deep hybrid SSD knowledge and dynamic adjustment mechanisms. Large language models (LLMs) offer a promising solution through their contextual reasoning and adaptive coordination capabilities. In this work, we explore the potential of using LLMs in understanding and efficiently managing hybrid SSD design space. We find that leveraging LLMs for knowledge-guided optimization of management parameters enables substantial performance gains. Building on these insights, we propose LLM-hybridSSD, an integrated optimization framework that formulates hybrid SSD management as a parameter-tuning problem, employs an LLM-based tuner for adaptive configuration, and applies reinforcement learning-based fine-tuning to align local lightweight models with domain-specific knowledge. Experimental results show an average 58.92% increase in throughput and a 28.56% reduction in write amplification (WA) compared with state-of-the-art schemes under different real-world workloads.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH616

Stateful Embedded Fuzzing Using Peripheral-Accurate Systemc Virtual Prototypes

Chiara Ghinami; Igor Tresolavy; Luis Seibt; Nils Bosbach; Rainer Leupers

Date:Monday, July 27 Location:Mtg Room 203C Session:Streamlining Silicon: From Real-Time Scheduling to Efficient GPU Execution +1
Abstract

The increasing complexity of embedded software has made comprehensive manual testing impractical, motivating the use of automated techniques such as fuzzing. Coverage-guided fuzzers like AFL++ have shown strong results for conventional software but remain challenging to apply effectively in embedded contexts, where peripheral behaviors play critical roles. Existing approaches either use fast user-mode simulators, sacrificing peripheral realism, or rely on full-system simulators with manual instrumentation, limiting applicability to large-scale software. In this work, we present a novel framework that integrates AFL++ with a stateful SystemC-TLM virtual prototype to enable realistic fuzzing of embedded software. Fuzzer-generated inputs are injected directly into peripheral models, allowing peripherals to trigger natural side effects such as interrupts and FIFO updates. By integrating fuzzing with full-system simulation, our framework advances the effectiveness of pre-silicon testing for embedded systems. Results on embedded workloads show that our approach eliminates false positives while maintaining comparable code coverage and execution performance as state-of-the-art tools.

SystemsSYS4. Embedded System Design Tools and MethodologiesChipletEDA
RESEARCH617

UniNL: Unifying Fragmented Non-Linear Operators for Efficient Edge LLM Inference

Zhengxuan Hu; zhihua fan; Shantian Qin; Yudong Mu; Wenming Li; Xiaochun Ye

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

Deploying large language models (LLMs) on edge devices is critical for user privacy and low-latency inference. However, existing matrix-centric LLM accelerators suffer from limited throughput when handling fragmented non-linear operators due to tight edge-side resource constraints, and the absence of a general fusion mechanism frequently results in severe PE array underutilization during non-linear execution phases. To address these challenges, we propose UniNL, a unified execution framework designed to efficiently handle non-linear operators in LLMs. First, we introduce an abstraction model that decomposes non-linear operators into a sequence of primitives, leveraging polynomial approximation to enable point-wise operations to be executed on the PE array. Second, we implement lightweight microarchitectural extensions to the accelerator to efficiently support these primitives. Finally, we devise a primitive-aware fusion mechanism that effectively hides non-linear latency behind matrix operations. Experimental results demonstrate that UniNL achieves a 6.89× speedup and 7.26× energy-efficiency improvement on average compared to a Jetson AGX Orin baseline. Furthermore, UniNL outperforms state-of-the-art designs by 1.46× and 1.35× in average performance, respectively.

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH625

Attestllm: Efficient Attestation Framework for Billion-Scale On-Device LLMs

Ruisi Zhang; Yifei Zhao; Neusha Javidnia; Mengxin Zheng; Farinaz Koushanfar

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Robust and Trusted AI Systems: Attacks, Defenses, and Hardware Reliability +1
Abstract

This paper presents AttestLLM, the first attestation framework to protect device vendors' hardware-level intellectual property by ensuring that only authorized large language models (LLMs) can execute on target platforms. To overcome the scalability and efficiency limitations of prior work, AttestLLM leverages an algorithm/software/hardware co-design approach to embed robust watermarks onto the activation of LLM layers. In addition, it optimizes the attestation protocol within the trusted execution environment, providing efficient ownership verification without compromising inference throughput. Evaluations on various on-device LLMs demonstrate AttestLLM's attestation reliability, fidelity preservation, and efficiency. Furthermore, AttestLLM exhibits resilience against forgery, replacement, and system attacks.

SecuritySEC1. AI/ML Security/PrivacyAIQuantum
RESEARCH628

Faster-MoA: Low-Latency Tree-Structured MoA Serving with Early Exit and Agent-Aware Prefill-Decode Overlap

Zijun Wang; Yijiahao Qi; Hanqiu Chen; Zishen Wan; Gongjin Sun; Dongyang Li; Shuyi Pei; Cong "Callie" Hao

Date:Monday, July 27 Location:Mtg Room 101B Session:Computing Without Waste: Lean and Scalable Neural Architectures +1
Abstract

Mixture-of-Agents (MoA) is a widely adopted multi-agent paradigm, but existing MoA systems face two major challenges: excessive agent-to-agent connectivity and poor hardware efficiency. To address these two issues, we propose Faster-MoA, a unified algorithm-system co-design for efficient MoA serving. Faster-MoA has three innovations. First, we replace the conventional all-to-all topology with a hierarchical tree structure that introduces structured sparsity in agent connections. Second, we develop a run-time dynamic agent early-exit mechanism that prunes unnecessary agent connections basing on output semantic similarity and answer confidence. Third, we propose an agent-dependency-aware incremental prefilling mechanism that overlaps prefilling and decoding among agents with data dependencies to reduce inference latency. Together, these three innovations enable Faster-MoA to reduce end-to-end serving latency by up to 90% while achieving similar (only ±1% variation) or even higher task accuracy compared with MoA baselines using all-to-all agent connection.

AIAI3-I. AI/ML Application and InfrastructureChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH640

AMBER: A Unified Accelerator for Multi-Precision LLM Inference Exploiting Bit-Level Redundancy and Reconfigurability

Yubin Qin; Yang Wang; Zhiwei Lin; Yushu Zhao; Xiaolong Yang; Zhiheng Yue; Huiming Han; Yang Hu; Shouyi YIN

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

Modern LLMs use diverse integer, floating-point, and microscaling (MX) precisions, but most accelerators are optimized for only a few formats. We propose AMBER, a general-purpose LLM accelerator for plug-and-play multi-precision deployment and compression of weights and KV caches in cloud and edge. AMBER introduces bit-transposed encoding that exploits bit-level statistical concentration for lossless compression across all of INT, FP, and MX formats. A precision-agnostic, stage-pipelined bit-serial PE further reuses bit-level redundancy for efficient versatile precision computation. Evaluated on 8 LLMs and 10 formats, AMBER boosts memory efficiency 1.17× and compute 3.16×, surpassing Olive and Tender in throughput and energy.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH641

Torr: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-Design

Hyunwoo Oh; SungHeon Jeong; Suyeon Jang; Hanning Chen; Sanggeon Yun; Tamoghno Das; Mohsen Imani

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

Task-oriented object detection on CLIP enables open-vocabulary, prompt-driven semantics but dense alignment and memory traffic block real-time edge use. We propose TorR, a brain-inspired algorithm–architecture co-design that replaces CLIP-style dense matching with hyperdimensional associative reasoning and exploits temporal coherence. TorR combines HDC similarity, graph reasoning with query caching and delta updates, and a lane-scalable, precision-gated item memory with RT-30/RT-60 control. A 28 nm TorR accelerator delivers real-time detection with millijoule energy and competitive AP at orders-of-magnitude lower energy.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH648

Differentiable Fill Insertion with Explicit Delay Optimization

Jinoh Cho; Jinmo Ahn; Jakang Lee; Jaeseung Lee; Seonghyeon Park; Seokhyeong Kang

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Differentiable Dreams: GPU-Powered Litho Gets Its Gradient Groove Back +1
Abstract

Fill insertion need to consider not only density uniformity but also delay. Existing algorithms reduce such delay implicitly by minimizing proxies (fill amounts, overlays between fills, etc) but there remains misalignment between the reduction of proxies and improvement of the signal delay. We propose DiffFill, a novel differentiable framework for fill insertion that explicitly optimizes both uniformity and delay. At the heart of DiffFill is the CapFormer, a Transformer-based capacitance extractor that estimates capacitance values used to construct a differentiable delay objective based on the Elmore delay formulation. DiffFill significantly outperforms the state-of-the-art methods.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH649

Mask-Guided Diffusion with Process-Aware Metric Optimization for Long-Tailed Wafer Defect Analysis

Yumeng Liu; Qian Jin; Xiaotian Qiu; Qi Sun; Cheng Zhuo

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Leveraging AI for Silicon Health: Test, Yield & Fault Tolerance +1
Abstract

In integrated circuit manufacturing, the detection of nanoscale wafer defects is critical for yield improvement and root-cause analysis. However, scanning electron microscopy (SEM) defect datasets in modern production lines are typically long-tailed, with pronounced intra-class diversity and subtle inter-class differences. Most existing methods do not take these properties into account, leading to low recall on tail defects and frequent confusion between categories. To address this, we propose DiffTail, a framework that combines mask-guided diffusion with process-aware metric optimization. The mask-guided diffusion uses defect masks together with SEM-specific prompts that encode defect knowledge to synthesize tail defects at specified locations and of specified types, while latent-space fusion with normal SEM images preserves process-consistent background textures. The process-aware metric optimization module groups class prototypes based on image features that are correlated with process steps. It then applies inter-cluster separation and margin-based constraints to easily confused class pairs, making visually similar defects easier to distinguish. Extensive experiments show that DiffTail improves tail-class detection and segmentation over state-of-the-art methods. To facilitate reproducibility, a subset of the data is available at \url{https://anonymous.4open.science/r/Dataset-107B}, and the full dataset will be released upon acceptance.

EDAEDA9. Test, Validation and Silicon Lifecycle ManagementAIQuantum
RESEARCH651

Cache Where It Counts: Towards Workload-Aware I/O Coordination for NUMA Storage Systems

Wenda Tang; Yanwen Wang; Yiduo Wang; Jie Wu

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Scaling the Bedrock: OS-Hardware Co-Design for High-Reliability Storage +1
Abstract

Modern Non-Uniform Memory Access (NUMA) servers equipped with distributed Non-Volatile Memory Express (NVMe) storage present new challenges for I/O coordination. First, conventional page cache allocators ignore storage topology, resulting in inefficient cross-node cache placement and performance issues. Second, CPU schedulers overlook cache and storage locality, causing frequent cross-node cache accesses and further degrading performance. Third, the OS page cache employs a rigid eviction policy that fails to adapt to diverse application workloads, while even programmable alternatives often require complex manual tuning. To address these challenges, we propose Laelaps, a workload-aware I/O coordination framework for NUMA storage systems. Laelaps introduces three key techniques: (1) storage-topology-aware page cache placement, co-locating cache pages with their backing NVMe devices; (2) storage-topology-aware I/O thread scheduling that aligns thread placement with cache distribution to minimize cross-node accesses; and (3) workload-aware adaptive page caching that automatically selects eviction policies based on observed application access patterns. Evaluation shows that Laelaps achieves 1.41× geometric mean throughput improvement with 3.5% runtime overhead.

SystemsSYS5. Embedded Memory and Storage SystemsQuantumDES3. Emerging Models of Computation +1 more
RESEARCH656

ASG-MOENN: Adaptive Sparsely-Gated Mixture of Experts Neural Network for Digital Predistortion in Quadrature Switched-Capacitor Power Amplifiers

Jiayu Yang; Wending Zhao; Yicheng Li; Yinyin Lin; Yun Yin; Hongtao Xu

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Taming the Physical World: AI Strategies for Circuits and Cores to Quantum Machines +1
Abstract

For the first time, this paper introduces an adaptive sparsely-gated mixture of experts neural network (ASG-MOENN) as a novel digital predistortion (DPD) framework for wide dynamic power range quadrature switched-capacitor power amplifiers (SCPAs). State-of-the-art (SOTA) DPD models suffer from fixed, high computational complexity, leading to inefficiency at lower power levels. To overcome this limitation, the proposed ASG-MOENN employs a dual-stage adaptive mechanism. First, it embeds the underlying physical principles of SCPA power variation to condition the input signal. Second, it incorporates an adaptive sparse gating mechanism that dynamically determines both which experts to activate and how many are required based on a cumulative confidence criterion, allowing the model to flexibly scale its run-time complexity according to the PA's operating state. Experimental results demonstrate that, compared with the SOTA baseline, the proposed model achieves superior linearization performance while reducing the average computational load by up to 50% across a 30-dB dynamic power range.

AIAI3-II. AI/ML Application and InfrastructureQuantumDES3. Emerging Models of Computation +1 more
RESEARCH660

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

Hyunwoo Oh; Hanning Chen; Sanggeon Yun; Yang Ni; Suyeon Jang; Behnam Khaleghi; Fei Wen; Mohsen Imani

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

Multimodal stacks mixing ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms due to heterogeneous compute and tight real-time constraints. We present TRINE, a single-bitstream FPGA accelerator and compiler that runs end-to-end multimodal inference without reconfiguration. It unifies layers as DDMM/SDDMM/SpMM on a mode-switchable PE array supporting weight/output-stationary systolic, 1×CS SIMD, and a routable adder tree with in-stream top-k token pruning. Dependency-aware layer offloading overlaps independent kernels across RPUs. On Alveo U50 and ZCU104, TRINE achieves up to 22.57× and 6.86× lower latency than RTX 4090 and Jetson Orin Nano at 20–21 W, with <2.5% accuracy loss and state-of-the-art efficiency.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH663

Emudrop: A Minimalistic, Hardware-Based Emulation Detection Approach Through Intel Reserved Opcodes

Kaiyuan Rong; Ke Fang; Pengfei Qiu; Dapeng Ju; Dongsheng Wang

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

Emulation-based dynamic analysis detects malicious software by observing runtime behavior in a controlled environment. Malware increasingly adopts evasion techniques to recognize such environments and hide malicious activities. In this paper, we propose EmuDRop, a minimalistic emulation-detection attack that exploits Intel reserved opcodes. EmuDRop leverages microarchitectural differences between real hardware and emulators to identify emulated execution. We reverse-engineer reserved opcodes, characterize their microarchitectural effects, and evaluate EmuDRop on five Intel CPU cores and QEMU, a widely used and representative emulator. The results show that EmuDRop reliably identifies emulated environments.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH670

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning; Guangda Liu; Qihao Jin; Chengwei Li; Wenchao Ding; Minyi Guo; Jieru Zhao

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

Video LLMs have achieved remarkable performance in processing hour-long videos, yet they suffer from severe memory overhead and latency due to expanding KV caches—critical bottlenecks for real-world online applications. To tackle this, LiveVLM is proposed: a training-free, query-agnostic framework featuring two key mechanisms. The Vision Sink Bucketing (VSB) processes video streams in real time, retains long-term details, and eliminates redundant KVs, while the Position-agnostic KV Retrieval (PaR) decouples positional embeddings to reduce irrelevant context interference via efficient page-level retrieval. Extensive experiments confirm LiveVLM delivers state-of-the-art accuracy on LLaVA-OneVision, outperforming both training-free query-agnostic methods and training-based online models.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH675

SAEM: Stage-Aware Expert Management for Memory-Efficient MoE Inference in Chain-of-Thought Reasoning

Yujie Zhang; Bin Gao; Tulika Mitra

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

Chain-of-thought (CoT) prompting enhances LLM reasoning by decomposing tasks into intermediate steps, but it increases decoding latency and memory usage. Mixture-of-Experts (MoE) models improve scalability via sparse expert activation, yet expert weights often exceed GPU capacity. Existing runtimes treat tokens uniformly, missing structural regularities where consecutive reasoning stages exhibit coherent expert activation patterns. This causes inefficient expert caching and excessive GPU-CPU data movement. We present SAEM, a stage-aware inference runtime that detects reasoning stage boundaries and exploits stage-level coherence in expert activation. SAEM applies stage-aware caching for GPU-CPU placement, combining token packing with in-situ CPU execution. SAEM achieves 1.54× speedup over state-of-the-art baselines under memory constraints.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH676

CLUE-ECC: Low-Cost LPDDR6 Reliability Enhancement by Integrating System ECC with On-Die ECC

Youngbae Kong; Jae-Youn Hong; Jongwoo Jeon; Junki Lee; Young Jung Kang; Joon-Sung Yang

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

DRAM scaling increases vulnerability to peripheral faults, making symbol level ECCs essential. LPDDR6 introduces a 12-bit data beat that misaligns with Reed Solomon codes with 8 or 16-bit symbol. An 8-bit RS code meets on-die parity budgets but cannot guarantee correction. A 12-bit RS code guarantees correction but exceeds the budget. CLUE-ECC pools the 16-bit on-die and the system ECC parity for proper correction with less parity. Evaluation shows superior error handling with 33.3% on-die parity storage, 14.83% area and 24.18% power savings with the same bandwidth impact compared to a conventional RS code.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH680

SPADE: An Input-Adaptive Sparse Attention Engine for Fast Video Diffusion Models Inference

Shanghao Liu; Renze Chen; Size Zheng; Yuanqiang Liu; Yun (Eric) Liang; Hailong Yang

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

Video diffusion transformers (vDiTs) generate high quality but pay quadratic self-attention cost, making inference prohibitive at video-token scales. The challenge is input-adaptive sparsity: selecting critical Q/K/V tokens with negligible overhead and executing them for end-to-end gains. We present \textsc{SPADE}, a training-free sparse-attention engine of three parts: (i) \textsc{vDiT-SSR}, specification defining 3D blocking candidates and formalizing dynamic masks via \emph{Summarizer}/\emph{Estimator} expressions; (ii) runtime \textsc{scheme generation} using SICS and a head-wise policy; and (iii) an executor with low-overhead index search, flash block-sparse attention, and kernel grouping. Across Hunyuan-Video, Wan~2.1/2.2, \textsc{SPADE} raises sparsity and speed and preserves quality, accelerating attention $2.26{\times}$--$3.40{\times}$ and end-to-end $1.49{\times}$--$1.80{\times}$.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH681

Mistletunnel: Attacking KASLR on Apple M-Series Silicon with a Novel TLB Side Channel

Hanyin Liu; Jin Wu; Rihui Sun; Kaiyuan Rong; Hongpei Zheng; Yun Chen; Jian Dong; Dongsheng Wang

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

KASLR is a critical defense on macOS, whose implementation on M-series silicon and the latest macOS is further strengthened by hardware–software mitigations, with no practical bypass. We perform a systematic reverse-engineering of the TLB on M-series silicon and uncover a novel user–kernel TLB side-channel, TLBarge, that enables precise monitoring of kernel instruction execution. Leveraging TLBarge, we demonstrate MistleTunnel, the first attack that compromises macOS 26's 15-bit KASLR by recovering its lower 8 bits. Operating solely through normal syscalls, MistleTunnel collapses the effective KASLR entropy to 0.39%, significantly reducing its intended security.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH684

Zipfloat: An Ultra-Fast and Lossless Floating-Point Compressor for LLM Inference Systems

Shihao Wang; Xiangyu Zou; Wen Xia; Hao Hu

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

LLM inference stresses GPU memory with large model weights and KV caches, which requires efficient compression techniques. But existing approaches either sacrifice accuracy (e.g., quantization) or apply serial entropy coding (e.g., DFloat11) that limits throughput. Both constrain the overall inference efficiency. In this work, we propose ZipFloat, an LLM-oriented floating-point compressor that achieves massively parallel com-/decom-pression while preserving full precision. We observe that weights and KV-cache values in LLMs exhibit strong statistical redundancies, yet existing floating-point formats mask this redundancy and hinder effective compression. To address this, ZipFloat employs (1) Exponent Sparsification, which redefines the binary representation of floating-points to restore compressibility, and (2) Bit-Matrix Packing, which leverages this restored structure with GPU-native parallelism to deliver extreme throughput. Evaluations show that ZipFloat delivers up to 700 GB/s com-/decom-pression throughput, outperforming SOTA methods (e.g., DFloat11) by several orders of magnitude while maintaining comparable compression ratios, thereby reducing over 95\% TTFT and improving over 4x inference throughput in LLM systems.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH690

Idealsim: Efficient Ideal Scenario Modeling for Processor Upper-Bound Performance Analysis

Yibo Zhang; Lei Gong; Cheng Li; Teng Wang; Wenqi Lou; Xuehai Zhou

Date:Monday, July 27 Location:Mtg Room 201B Session:Advances in 3D/2.5D Physical Design +1
Abstract

While microarchitectural exploration for well-established processors relies on cycle-accurate simulators (CAS) for accurate evaluation, exploring without knowing both a microarchitectural optimization's capability to eliminate critical-path latency and its theoretical upper bound often leads to blind iteration. To address this, we propose IdealSim, an efficient and general simulation framework for constructing ideal scenarios on CAS, enabling early-stage identification of promising designs. Applied to the RTL-validated CAS of the XiangShan processor to explore the ideal critical-path value predictor, IdealSim achieves 7.93% higher modeling accuracy than prior work and shows that the critical-path value predictor can deliver an average performance gain of 5.61% (and up to 37.92%) in ideal scenarios. Our results provide an accurate performance upper bound, guiding subsequent microarchitectural trade-offs and optimization of the XiangShan processor.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageSYS4. Embedded System Design Tools and Methodologies
RESEARCH692

DiffDEG: Diffusion-Enhanced Design Evolution Graph Representation Learning for Post-Layout Optimization

Jiajie Xu; Leilei Jin; Ziyue Han; Yanlong Mao; Liangji Wu; Chenpu Shi; Yunfan Zuo; Lizheng Ren; Hao Yan; Xingquan Li; Longxing Shi

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Taming the Physical World: AI Strategies for Circuits and Cores to Quantum Machines +1
Abstract

Post-layout optimization is a critical step in modern chip design. However, existing Machine Learning (ML)-assisted methods struggle to capture the cross-stage reasoning dependencies that underlie the optimization challenges. While large language models (LLMs) excel at semantic reasoning, the lack of structural and physical awareness limits their effectiveness in post-layout optimization. To address these limitations, we propose DiffDEG, a diffusion-enhanced, reasoning-aware foundation model that bridges LLM-based semantic reasoning with circuit-level structural and physical representations. DiffDEG reformulates the conventional timing graph into a Design Evolution Graph (DEG), enabling cross-stage reasoning through text-annotated netlists. By leveraging directional diffusion and self-supervised pretraining, DiffDEG jointly interprets semantic, timing, and physical information, forming a unified representation adaptable to diverse optimization tasks. Experimental results show that DiffDEG consistently enhances optimization outcomes across multiple optimization paradigms, achieving an average 12.5% performance improvement and 4x runtime speedup over commercial tools.

AIAI3-II. AI/ML Application and InfrastructureQuantumDES3. Emerging Models of Computation +1 more
RESEARCH693

MALIWAN: Performance and Energy Efficiency Co-Optimization of GEMM on Versal ACAP Architectures

Ilias Papalamprou; Dimosthenis Masouros; Ioannis Loudaros; Francky Catthoor; Dimitrios Soudris

Date:Monday, July 27 Location:Mtg Room 203C Session:Streamlining Silicon: From Real-Time Scheduling to Efficient GPU Execution +1
Abstract

General Matrix Multiplication (GEMM) is a fundamental kernel in scientific computing and deep learning and often dominates both performance and energy consumption, particularly in edge deployments with strict power and resource constraints. AMD's Versal ACAP offers heterogeneous components (AIEs, PL, PS) that can address these challenges, but identifying efficient mappings across these units is challenging, with prior work largely overlooking power-performance trade-offs. We introduce MALIWAN, an automated framework that generates Pareto-optimal GEMM mappings on Versal ACAP devices. MALIWAN combines fast analytical-model–based sampling with data-driven ML, using on-board measurements to train a surrogate that drives large-scale Design Space Exploration. Based on a collection of ≈6,000 on-board experiments, we first provide a comprehensive analysis of how different mapping configurations affect performance and power. We then evaluate MALIWAN on the Versal VCK190, demonstrating geomean improvements of 1.23× (up to 2.5×) in throughput and 1.25× (up to 2.7×) in energy efficiency over state-of-the-art works. Compared to NVIDIA GPUs, MALIWAN achieves up to 2.5× energy efficiency

SystemsSYS4. Embedded System Design Tools and MethodologiesChipletEDA
RESEARCH698

Freezone: An On-Device Out-of-Order Writes Reordering Scheme for Consumer Zoned Storage

Dingcui Yu; Mengyang MA; Tianyu REN; Hongyu Zhu; Xinghui Duan; Yina Lv; Lei Qiao; Liang Shi

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Scaling the Bedrock: OS-Hardware Co-Design for High-Reliability Storage +1
Abstract

The zone interface has emerged as a promising standard in consumer devices to support high random read performance, as its design reduces read amplification and enables efficient read parallelism. However, to support the underlying zone interface, the kernel's I/O subsystem must enforce strictly sequential writes per zone by logically ordering requests, which results in reduced write parallelism at the kernel level. Although prior work mitigates this by offloading the request reordering process to the on-device write buffer, the limited buffer size in consumer devices cannot accommodate the growing volume of requests that require reordering. This paper proposes FreeZone, an on-device writes reordering solution that extends the write buffer size by using the high-performance SLC flash region of storage. FreeZone allows the kernel to submit I/O requests freely, improving parallel write performance. Evaluation demonstrates that FreeZone achieves up to a 2.3× IOPS improvement compared with existing zoned storage architectures.

SystemsSYS5. Embedded Memory and Storage SystemsQuantumDES3. Emerging Models of Computation +1 more
RESEARCH701

RAGNMP: Leveraging Elimination Tree to Accelerate RAG with Near-Memory Processing

Dan Chen; Huize Li; Zhaoying Li; Huiying Lan; Xiang Liu; Zerui Li; Dan Wu; Bin Gao; Tulika Mitra

Date:Monday, July 27 Location:Mtg Room 202C Session:Intelligent Fetch & Match Architectures +1
Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating information retrieval from knowledge databases, significantly improving the generated results' accuracy, relevance, and contextual richness. An in-depth analysis of RAG reveals that its diverse operations are primarily constrained by memory bottlenecks. The diversity and continual evolution of RAG algorithms further increase system design complexity. In this paper, we introduce RAGNMP, a general-purpose Near-Memory Processing (NMP) accelerator designed for RAG. Specifically, we first propose an enhanced and quantified elimination tree variant that simultaneously explores data placement, task parallelism, and pipelining to better support RAG workloads on NMP architectures. It also remains adaptable to algorithm changes in RAG. We further propose a general-purpose NMP architecture with a flexible processing unit that efficiently supports diverse memory-bound operations in RAG. Experimental results show that RAGNMP outperforms the state-of-the-art RAG system and accelerator.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH706

Early Design Stage Thermal Prediction and Optimization: Machine-Learning Driven Approach

Chanhee Jeon; Munwon Lee; Taewhan Kim

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

As semiconductor technologies continue to scale toward advanced nodes, the sharp increase in transistor density has led to a substantial rise in on-chip power density, making thermal effects a first-order design concern. Traditional thermal mitigation techniques (e.g., thermal-aware coarse-grained floor-planning, thermal-aware (post-route) cell adjustment, and structural cooling enhancements) suffer from thermal prediction inaccuracy or incur high fabrication cost, limiting their practical applicability to modern SoC designs. To overcome the limitation, in this work, we present an ML-based early-stage thermal prediction and mitigation framework that enables proactive thermal management in the course of physical design process. Precisely, our approach (1) predicts first the power density map which is the underlying source of thermodynamic behavior during the global placement stage using machine learning models, and then (2) accurately estimates the steady-state thermal map through a physics-guided thermal interpolation. The predicted temperature is subsequently leveraged by (3) an ML-model based optimization engine that adjusts the placement solution to minimize thermal hotspots without timing degradation. Experimental results demonstrate that our proposed model achieves 49.0% and 34.9% accuracy improvement in power density and thermal prediction, respectively, compared to tool estimation. When applied to thermal-aware placement optimization, the framework successfully reduces the maximum chip temperature by 9.97◦C while maintaining equivalent timing and area. These results confirm the effectiveness of our proposed early-stage thermal modeling and optimization framework in improving thermal reliability for modern power-hungry SoCs.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH713

Alphaplacer: Analog Placement Enhanced by Monte Carlo Tree Search

Yapeng Li; Peng Xu; Mingzi Wang; Tinghuan Chen

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

Analog layout design remains heavily dependent on manual expertise, with placement being the most critical stage that requires significant development time and domain knowledge. Current automated placement techniques face challenges in capturing expert design practices and fall short of practical deployment. To address this limitation, we present AlphaPlacer, a novel MCTS-based analog placement framework that learns from historical layouts to provide expert-guided placement optimization. Our approach formulates analog placement as a hierarchical sequence pair search with a two level MCTS structure, embedding a pretrain learning framework that captures sequence pair distributions from expert layouts to guide the search toward high quality solutions. Experimental results demonstrate that our method significantly outperforms existing baselines across multiple key metrics including area, wirelength, and post-layout performance.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH720

Provably Probabilistic Safe Controller Synthesis for Vision-Based Neural Network Control Systems

Qianlong Hu; Hanrui Zhao; Liqian Chen; Zhengfeng Yang; Banghu Yin; Ji Wang

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

The prevalence of nonlinear systems in safety-critical domains calls for controllers with safety guarantees, while vision-based control relies on high-dimensional images complicates both decision-making and formal safety analysis under uncertainty. This paper proposes a provably safe controller synthesis method for vision-based neural network control systems. We first employ a conditional generative adversarial network (cGAN) to approximate the mapping from system states to visual observations and combine it with RL-based pretraining to build a verifiable closed-loop structure. A data-driven model quantifies uncertainties from environmental perturbations, while martingale theory guides the learning of a stochastic barrier certificate (SBC) to provide rigorous probabilistic safety bounds. Furthermore, counterexamples from verification are used to alternately refine both the controller and certificate networks, ultimately yielding a controller with formally provable probabilistic safety guarantees. Experimental results on widely studied benchmarks demonstrate the efficiency and effectiveness of our approach.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH727

A Correlation-Aware GKP Decoder for Fault-Tolerant Continuous-Variable Photonic Quantum Computers

Ryosuke Matsuo; Junichiro Kadomoto

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Charting the Architecture of Tomorrow’s Fault Tolerant Quantum Machines by Harmonizing Codes to Systems +1
Abstract

The Gottesman–Kitaev–Preskill (GKP) encoding is a promising approach for realizing fault-tolerant continuous-variable (CV) photonic quantum computers. Recent studies have investigated various GKP decoding algorithms, among which correlation-aware methods that exploit inter-qubit correlations achieve significantly improved error-correction performance. However, hardware implementation of GKP decoding remains largely unexplored. Error decoding in CV photonic quantum computers must operate within tens of nanoseconds to match the optical clock frequency. A straightforward hardware implementation of correlation-aware decoding introduces substantial computational latency due to its arithmetic complexity. To address this challenge, this paper proposes a high-accuracy and low-latency accelerator architecture optimized for correlation-aware GKP decoding. Logic synthesis using 7-nm FinFET technology demonstrates that our decoder can complete correlation-aware GKP decoding with in 12.96 ns. In addition, comprehensive design space exploration is conducted to evaluate tradeoffs among decoding accuracy, latency, and circuit area, leading to design guidelines for future correlation-aware GKP decoder implementations.

DesignQuantumDES6. Quantum ComputingDES3. Emerging Models of Computation +1 more
RESEARCH732

Unified Rail-Signal Co-Design for Multilayer PCBs: Resource Allocation, IR-Drop, and Routability

Wei-Kai Huang; Jhih-Jie Lee; Yao-Wen Chang; Yang Lu; Jerry Bai; Bin-Chyi Tseng

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

The growing demand for high-performance yet energy-efficient systems has elevated the need for coordinated power and signal routing in multilayer printed circuit boards (PCBs). While prior work optimizes either the power-delivery network (for DC IR-drop compliance) or signal routing (for routability), these separated flows often induce spatial contention and inefficient layer utilization. We propose the first unified rail-signal co-design framework for multilayer PCBs that jointly optimizes routing quality and power integrity. Our approach (1) generates routing guides to profile and allocate resources, (2) performs iterative rail-signal co-optimization that explicitly models crossing, overflow, and current density, and (3) conducts detailed routing via a guided, crossing-aware A* search that aligns with globally optimized guides. Experiments demonstrate that our proposed framework substantially improves routability while satisfying DC IR-drop constraints, delivering robust solutions with efficient resource usage.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH733

PIANO: Physics-Informed Admittance Neural Operator for Fast Power Integrity Analysis

Xiangqiao Meng; Siyuan Miao; Yiming Zhang; Min Gao; Zhen Zhuang; Wei Xing; Chen Wu; Lei He

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

Design of large-scale integrated circuits requires rigorous power integrity (PI) analysis to ensure reliable and robust operation. In particular, Direct Current (DC) PI analysis entails solving massive symmetric positive definite (SPD) systems derived from Kirchhoff's current law. Traditional PI evaluation methods are computationally expensive, making it difficult to satisfy the stringent turnaround-time requirements of modern design iterations. To address these challenges, we introduce PIANO, a physics-informed admittance neural operator for fast and high-fidelity DC PI analysis in 3D IC. PIANO first extracts an equivalent resistive network from the PDN layout and represents it as a graph. A graph neural network then learns the equivalent port admittance matrix between voltage-source ports and current sinks. The resulting reduced SPD system, formed from the predicted admittance matrix, is efficiently solved by a lightweight numerical solver to obtain port voltages. Experiments on industrial-scale PDNs show that PIANO achieves a 12.6x speedup over commercial simulators with only 1.95% voltage error, and a 17.2x acceleration when integrated into a PI-constrained design space exploration flow.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH734

Smoe: Elastic MoE-Based Inference with Serverless Computing via Fine-Grained Expert Scaling

Xiaofei Yue; Ziming Zhao; Jiongchi Yu; Huidong Ma; Zhaoxuan Li; Tingting Li; Jianwei Yin

Date:Monday, July 27 Location:Mtg Room 101B Session:Computing Without Waste: Lean and Scalable Neural Architectures +1
Abstract

Mixture-of-Experts (MoE) models are attractive for large-scale inference, but sparse activation and skewed, time-varying expert popularity lead to cost-inefficiency. Serverless computing has become an attractive way due to its fine-grained resource elasticity and billing. Even so, achieving cost-efficient and SLO-compliant scaling for each expert with intra- and inter-layer dependencies under concurrent requests remains fraught with challenges. In this paper, we present sMoE, a topology-aware elastic auto-scaler for serverless MoE inference. Our insight is treating the deployment of a MoE inference pipeline as a DAG of experts and non-MoE segments. Building on this, sMoE is designed as a deep reinforcement learning-based solution. Specifically, it encodes expert- and layer-level runtime features with DAGNN by propagating cross-layer semantics. Coupled with a layer-wise pointer network, sMoE captures intra-layer semantics to jointly decide vertical resources, replica counts, and concurrency settings for each of all experts at runtime. Experimental results show that sMoE reduces serving cost by 21.4%–39.2% compared to state-of-the-art solutions, while meeting stringent end-to-end P95 latency SLOs.

AIAI3-I. AI/ML Application and InfrastructureChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH739

Stochastic Diffusion Prior: Knowledge Elicitation and Local Adaptation for Transistor Sizing

Wei Xing; Zhuohua Liu; Weilun Xie; Yuxuan Zhang; Yuan Yao; Yuanqi Hu

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

Automatic transistor sizing remains a significant challenge in modern analog and mixed-signal circuit design. While recent AI-driven methods show promise, they remain sample-inefficient, largely because they struggle to effectively incorporate expert knowledge and generalize across different designs. We propose Stochastic Diffusion Prior (SDP), a framework that systematically bridges this gap. SDP elicits domain expertise through an intuitive interface based on the established gm/ID methodology, formally integrating this knowledge into the optimization. Critically, SDP is not a static prior; it enables dynamic adaptation through a novel mutual information-guided diffusion process, allowing the prior to be refined and corrected by new simulation data. SDP can be integrated with virtually any optimization method, enhancing its sample efficiency and exploration capabilities. Experimental validation on practical analog circuits demonstrates SDP's superiority over state-of-the-art approaches. When integrated with a simple Vanilla BO, SDP achieves up to 92.3× speedup (27.9× on average) while improving key performance metrics by up to 2.0× (1.5× on average). The approach maintains high effectiveness even with imperfect prior knowledge and successfully enables knowledge transfer between technology nodes, positioning it as a versatile and robust enhancement for circuit optimization.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH752

CQ-CiM: Hardware-Aware Embedding Shaping for Robust CiM-Based Retrieval

Xinzhao Li; Alptekin Vardar; Franz Müller; Navya Goli; Umamaheswara Tida; Kai Ni; X. Sharon Hu; Thomas Kämpfe; Ruiyang Qin

Date:Monday, July 27 Location:Mtg Room 202C Session:Intelligent Fetch & Match Architectures +1
Abstract

Deploying Retrieval-Augmented Generation (RAG) on edge devices is in high demand, but is hindered by the latency of massive data movement and computation on traditional architectures. Compute-in-Memory (CiM) architectures address this bottleneck by performing vector search directly within their crossbar structure. However, CiM's adoption for RAG is limited by a fundamental ``representation gap,'' as high-precision, high-dimension embeddings are incompatible with CiM's low-precision, low-dimension array constraints. This gap is compounded by the diversity of CiM implementations (e.g., SRAM, ReRAM, FeFET), each with unique designs (e.g., 2-bit cells, 512x512 arrays). Consequently, RAG data must be naively reshaped to fit each target implementation. Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of CiM for RAG but also confuses hardware designers, making it unclear if a failure is due to the circuit design or the degraded input data. As a result, CiM adoption remains limited. In this paper, we introduce CQ-CiM, a unified, hardware-aware data shaping framework to jointly perform Compression and Quantization, flexibly shaping data to fit various CiM designs. To the best of our knowledge, this is the first work to shape data for comprehensive CiM usage on RAG.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH754

Delay-Constrained Area Optimisation Using SAT-Based Local Improvement

Franz-Xaver Reichl; Christoph Scholl

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

This paper presents a new approach for minimising Boolean circuits subject to delay constraints. The proposed approach extends the efficient but purely area-based Boolean circuit minimiser eSLIM. The eSLIM minimiser reduces circuits by iteratively computing optimal replacements for small subcircuits on the fly. We extend the SAT encoding used, for synthesising these replacements, to take account for delay. While the additional constraints on delay restrict the set of possible candidates for the replacement, we can still harness the flexibility of making use of Boolean relations and multi-output circuits. We implemented the proposed method as part of the industrial-strength tool ABC. Surprisingly, in an experimental evaluation our implementation showed only a rather small deterioration in terms of area improvement, but substantial improvements in terms of delay compared to the purely area-based eSLIM approach on different benchmark sets.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH760

SP2RINT: Spatially-Decoupled Physics-Constrained Progressive Inverse Optimization for Diffractive Optical Neural Network Training

Pingchuan Ma; Ziang Yin; Qi Jing; Zhengqi Gao; Nicholas Gangi; Boyang Zhang; Tsung-Wei Huang; Rena Huang; Duane Boning; Yu Yao; Jiaqi Gu

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

DONNs leverage light propagation for efficient analog AI and signal processing. Advances in nanophotonic fabrication and metasurface-based wavefront engineering have opened new pathways to realize high-capacity DONNs across various spectral regimes. Training such DONN systems to determine the metasurface structures remains challenging. Heuristic methods are fast but oversimplify metasurfaces modulation, often resulting in physically unrealizable designs and significant performance degradation. Simulation-in-the-loop optimizes implementable metasurfaces via adjoint methods, but is computationally prohibitive and unscalable. To address these limitations, we propose SP2RINT, a spatially decoupled, progressive training framework that formulates DONN training as a PDE-constrained learning problem. Metasurface responses are first relaxed into freely trainable transfer matrices with a banded structure. We then progressively enforce physical constraints by alternating between transfer matrix training and adjoint-based inverse design, avoiding per-iteration PDE solves while ensuring final physical realizability. To further reduce runtime, we introduce a physics-inspired, spatially decoupled inverse design strategy based on the natural locality of field interactions. This approach partitions the metasurface into independently solvable patches, enabling scalable and parallel inverse design with system-level calibration. Evaluated across diverse DONN training tasks, SP2RINT achieves digital-comparable accuracy while being 1825 times faster than simulation-in-the-loop approaches. By bridging the gap between abstract DONN models and implementable photonic hardware, SP2RINT enables scalable, high-performance training of physically realizable meta-optical neural systems.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH776

Diffsp: Differentiable Sequence Pair-Based Analog Placement

Peng Xu; Mingzi Wang; Yapeng Li; Yuyang Ye; Tinghuan Chen; Tsung-Yi Ho; Bei Yu

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

Analog placement with compact representations is traditionally solved using heuristic or simulated annealing methods that are difficult to integrate with a differentiable optimization engine. This paper introduces DiffSP, a differentiable sequence-pair-based analog placement method that bridges discrete combinatorial representation with continuous gradient optimization. We derive a smooth relaxation of the sequence pair constraint graph via Gumbel–Sinkhorn relaxation, which allows area, wirelength, and symmetry objectives to be jointly optimized via automatic gradient calculation. A MILP-based legalization stage then enforces exact geometric and symmetry constraints. Experiments on industrial-level OTA benchmarks show that DiffSP achieves better placement quality and post-layout performance metrics than state-of-the-art analog placers with significantly reduced runtime.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH791

Semcom: Semantic-Enhanced On-Board Satellite Image Compression at Ultra-Low Bitrates

ZIYUAN ZHANG; Jun Liu; Hewu Li; Tianwei Zhang; Han Qiu

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

Although increasingly deployed for in-orbit earth observation (EO) imagery, existing neural image compression, following the design for natural images, struggles due to (1) a clear semantic structures gap and (2) computational inconsistencies between different in-orbit and ground hardware induce decoding errors. Observing EO imagery is dominated by large, semantically coherent regions (e.g., forest, urban), we propose SemCom, which uses a semantically partitioned codebook for efficient, region-aware bit allocation. SemCom introduces semantic-adaptive entropy coding with offline probability estimation, ensuring a fixed, platform-agnostic probability model for bit-exact decoding. Experiments on real-world satellite payloads show SemCom outperforms SOTA baselines with low in-orbit complexity.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH793

RTL-3D: Timing-Aware Tier Partitioning for 3D ICs Using Pre-Synthesis Timing Analysis

Haoyang Xu; Zheng Yang; Zhen Zhuang; Leilei Jin; Bei Yu; Sung Lim; Tsung-Yi Ho

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

RTL-3D enables early-stage cross-tier timing analysis and optimization in 3D ICs. The framework utilizes a Transformer-based pre-synthesis timing analysis model to directly steer its core partitioning method, which features differentiable register tier partitioning. This creates a tight feedback loop between timing prediction and physical implementation, closing the loop for co-optimization. The experimental evaluation demonstrates that RTL-3D achieves significant improvements in sign-off timing metrics by a 59.3% reduction in WNS and a 75.4% reduction in TNS compared to state-of-the-art methods.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH795

GSR-GNN: Training Acceleration and Memory-Saving Framework of Deep GNNs on Circuit Graph

Yuebo Luo; Shiyang Li; Yifei Feng; Vishal Kancharla; Shaoyi Huang; Caiwen Ding

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Graph neural networks (GNNs) show strong promise for circuit analysis, but scaling to modern large-scale circuit graphs is limited by GPU memory and training cost, especially for deep models. We revisit deep GNNs for circuit graphs and show that, when trainable, they significantly outperform shallow architectures, motivating an efficient, domain-specific training framework. We propose Grouped-Sparse-Reversible GNN (GSR-GNN), which enables training GNNs with up to hundreds of layers while reducing both compute and memory overhead. GSR-GNN integrates reversible residual modules with a group-wise sparse nonlinear operator that compresses node embeddings without sacrificing task-relevant information, and employs an optimized execution pipeline to eliminate fragmented activation storage and reduce data movement. On sampled circuit graphs, GSR-GNN achieves up to 87.2\% peak memory reduction and over 30$\times$ training speedup with negligible degradation in correlation-based quality metrics, making deep GNNs practical for large-scale EDA workloads.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH796

Prism: Reducing Arithmetic for High-Recall Approximate Nearest Neighbor Search on Processing-in-Memory

Weihan Kong; Shengan Zheng; Yingxue Zhou; Yifan Hua; Yuheng Wen; Cong Zhou; Guifeng Wang; Linpeng Huang

Date:Monday, July 27 Location:Mtg Room 202C Session:Intelligent Fetch & Match Architectures +1
Abstract

Approximate Nearest Neighbor Search (ANNS) at scale is constrained by the memory wall. By moving compute to memory, Processing-in-Memory (PIM) offers ample internal bandwidth but limited on-die compute, making arithmetic reduction crucial. We propose Prism, a PIM-based ANNS system co-optimizing vector pruning, distance evaluation, and host-PIM orchestration. It employs a proximity-aware vector pruner to leverage high intra-PIM bandwidth and dual-cluster affiliations to filter out distant vectors. Prism then performs sensitivity-ordered distance computation, prioritizing high-impact dimension segments and early terminating candidates once exclusion criteria are met. A stall-free host-PIM pipeline overlaps query preparation, PIM execution, and global ranking to sustain high throughput. Experiments show that Prism achieves 2.7-19.8x higher throughput over state-of-the-art systems.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH797

Targeted Bit-Flip Attacks on LLM-Based Agents

Jialai Wang; Wen Ya; liu zhongmou; Yuxiao Wu; Bingyi He; Zongpeng Li; Ee-Chien Chang

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Robust and Trusted AI Systems: Attacks, Defenses, and Hardware Reliability +1
Abstract

Targeted bit-flip attacks (BFAs) exploit hardware faults to manipulate model parameters, posing a significant security threat. While prior work targets single-step inference models (e.g., image classifiers), LLM-based agents with multi-stage pipelines and external tools present new attack surfaces, which remain unexplored. This work introduces Flip-Agent, the first targeted BFA framework for LLM-based agents, manipulating both final outputs and tool invocations. Our experiments show that Flip-Agent significantly outperforms existing targeted BFAs on real-world agent tasks, revealing a critical vulnerability in LLM-based agent systems.

SecuritySEC1. AI/ML Security/PrivacyAIQuantum
RESEARCH799

Mixed-Structure Double-Sided Redistribution Layer Routing for Glass Interposer-Based 5.5D ICs

Haoyang Xu; Zhen Zhuang; Leilei Jin; Zheng Yang; Chen Wu; Lei He; Sung Lim; Tsung-Yi Ho

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

Glass interposers enable 5.5D chiplet stacking, offering an advance beyond silicon-based technology. However, routing in such dense, double-sided environments is challenging, and existing methods do not account for glass interposer design rules. To address this, we propose a 5.5D RDL routing algorithm. It begins by allocating routing resources, employs double-sided global routing for efficient guidance, and uses a novel pixel-based detailed routing to finalize results. Experiments show the proposed method achieves 100% routability and reduces wirelength by 37% compared to a 2.5D RDL router.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH800

E³-CODE: Embedded and Efficient Error-Correcting Code for Error-Resilient Neural Networks

Venkata Nithin Kamineni; Habibur Rahaman; Ovishake Sen; Baibhab Chatterjee; SWARUP BHUNIA; Rickard Ewetz

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

Neural networks are widely deployed at the edge to process high- dimensional sensor data, but they are susceptible to burst errors that can corrupt weights and degrade inference accuracy. Conventional error-correcting codes (ECC) mitigate errors but incur significant memory overhead. Recent ECC methods for neural networks over- write the least significant bits of the model weights with parity bits, providing zero-overhead resilience at the expense of slightly re- duced inference accuracy. In this paper, we propose a framework for Embedded and Efficient Error-Correcting Code for Error-Resilient Neural Networks called (E3-CODE). The proposed method embeds multi-bit parity within the entire weight representation, which is different from only modifying the LSBs of the weights. To mini- mize the negative impact from the parity embedded ECC, weight and parity assignments are jointly optimized via a mixed-integer linear programming (MILP) formulation. We also propose a hybrid ECC scheme that combines the embedded ECC with conventional ECC to trade-off minor memory overhead for significantly im- proved reliance. The experimental evaluation on the ImageNet and CIFAR-10 datasets using ResNet, MobileNetV2, and EfficientNet-B0 demonstrates that E3-CODE maintains software-level accuracy in the presence of burst errors. Compared with prior methods, the lifetime of the edge system is extended by 4.8𝑋 with no memory overhead and 10𝑋 with less than 2% memory overhead.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH816

DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference

jinqi wen; Tong Xie; Runsheng Wang; Meng Li

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

Diffusion model deployment has been suffering from high energy consumption and inference latency despite its superior performance in visual generation tasks. Dynamic voltage and frequency scaling (DVFS) offers a promising solution to exploit the potential of the underlying accelerators. However, existing approaches often lead to either limited efficiency gains or degraded output quality because they overlook the inherent fault tolerance of the diffusion model. Therefore, in this paper, we propose DRIFT, a novel algorithm-architecture co-optimization framework that harnesses the fault tolerance for efficient and reliable diffusion model inference. We first perform a comprehensive resilience analysis on representative diffusion models. Building on these observations, we introduce a fine-grained, resilience-aware DVFS strategy that selectively protects error-sensitive network blocks, and a rollback-ABFT mechanism that adaptively corrects only critical errors by reverting to previous timesteps. We further optimize offloading intervals and reorganize data layouts to reduce memory overhead. Experiments across diverse models and datasets show that DRIFT can achieve on average 36% energy savings through voltage underscaling or 1.7x speedup via overclocking while maintaining generation quality.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH821

Runtime-ADAR: Runtime Anomaly Detection and Attack Recovery in Encrypted DNNs

Jie Xiao; jiajun guo; yuhao huang; hao ying; zhanhui shi; Jianping Mei; Fan Zhang

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

Fault injection attack (FIA) poses a severe threat to accelerators executing privacy-preserving homomorphic encryption (HE) neural networks. Defending against FIAs in the ciphertext domain is challenging due to opaque internal states and the difficulty of distinguishing inherent encryption noise from malicious perturbations. We propose Runtime-ADAR, a lightweight runtime framework enabling closed-loop detection and recovery fully within the ciphertext domain. By repurposing intrinsic HE noise as a high-fidelity diagnostic signal, Runtime-ADAR integrates a sparse-projection noise detector for statistical anomaly sensing and a gradient-field repair engine for in-situ weight compensation. This synergy transforms encryption noise from a vulnerability into a defensive asset. Evaluations on multiple encrypted models show that Runtime-ADAR mitigates over 90% of attack successes and restores up to 97% baseline accuracy with modest performance overhead on CNNs.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH827

DR. OPC: Differentiable Rasterization for Advanced Optical Proximity Correction

Su Zheng; Xinyun Zhang; Bei Yu; Martin Wong

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Differentiable Dreams: GPU-Powered Litho Gets Its Gradient Groove Back +1
Abstract

Optical proximity correction (OPC) is essential for mitigating lithographic distortions in semiconductor manufacturing. A standard OPC iteration involves the rasterization, lithography simulation, and correction of mask patterns. However, the non-differentiable nature of rasterization limits both optimization efficiency and flexibility. In this paper, we propose DR. OPC, a fully differentiable OPC pipeline enabled by differentiable rasterization. It natively supports advanced features like curvilinear patterns, multi-segment solving, process window improvement, and mask rule violation correction. Experiments demonstrate that DR. OPC achieves reductions of 52.1\%, 16.0\%, and 21.1\% in L2 error, PVB, and EPE, respectively.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH837

Credix: A Credit-Driven Distributed Buffer Management for Large-Scale Switching Chips

Xu Wang; Yang Zhang; Danfeng Shan; Fengyuan Ren

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

With the scaling of switching chips, modular architectures are becoming mainstream. As a result, buffers become physically distributed across tiles, leading to poor utilization under imbalanced traffic. To address this, we propose CrediX, a lightweight dynamic buffer management mechanism that aggregates distributed buffers into region-based shared pools and allocates usage on demand through a Regional Credit Allocator (RCA). The design leverages existing credit-based flow control and preserves packet ordering via a simple VC-to-path mapping. Evaluations show that CrediX reduces backpressure by 65.6% under synthetic Hotspot traffic and mitigates severe hotspot occupancy in a realistic workload.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH838

Separate: Segmenting Patterns from Defects in Wafer Manufacturing Using Weak Supervision

Dain Kwon; Changmin Shin; Sunjong Park; Kanghyun Choi; Hyeyoon Lee; Jaewon Jang; MINSEOK CHOI; Jinho Lee

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Taming the Physical World: AI Strategies for Circuits and Cores to Quantum Machines +1
Abstract

In semiconductor manufacturing, defect analysis is essential, but manual inspection cannot scale. However, existing automated inspection methods remain insufficient for root-cause analysis and process optimization. To this end, we present SePArate, a weakly supervised wafer defect segmentation method. SePArate enables pixel-level separation of patterns by leveraging only image-level annotations. It consists of a three-phase training: encoder pretraining, knowledge transfer to learn spatial cues, and training on synthetic mixed-defect data for accurate segmentation. Experiments demonstrate that SePArate significantly outperforms the baselines.

AIAI3-II. AI/ML Application and InfrastructureQuantumDES3. Emerging Models of Computation +1 more
RESEARCH843

Disentangled Differentiable Timing-Power Co-Optimization with Quad-Gradient Gate Sizing

Zixuan Pan; Zizheng Guo; Yufan Du; Runsheng Wang; Yibo Lin

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

Co-optimizing timing and power in modern VLSI designs remains challenging under realistic static timing analysis and standard-cell libraries. Classical gate sizing often scales poorly, while learning-based sizers behave as expensive black boxes with limited generality. Recent differentiable physical optimization enables gradient-based design flows, but existing approaches still struggle to stay aligned with library-based implementations and to provide controlled timing–power trade-offs. We propose a library-native quad-gradient gate sizing framework that leverages differentiable timing to derive structured guidance for timing and power, enabling more systematic and interpretable co-optimization in the standard-cell sizing space. On the ICCAD 2025 contest benchmarks, our framework achieves, on average, 40.4 percent points (%pt) larger reduction in TNS and 16.2 %pt better total power change than the 1st-place contest flow.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH848

TT-SEAL: TTD-Aware Selective Encryption for Adversarially-Robust and Low-Latency Edge AI

Kyeongpil Min; Sangmin Jeon; Jae-Jin Lee; Woojoo Lee

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

Cloud–edge AI deployments must balance compression and security on resource-limited devices. While Tensor-Train Decomposition (TTD) is widely used to compact models, encryption research targets dense networks, leaving the practicality of selective encryption under compression unclear. We introduce TT-SEAL, the first selective encryption scheme tailored to TT-decomposed networks. TT-SEAL ranks TT cores using a core-wise importance criterion and encrypts a minimal set of critical cores with AES, cutting decryption cost while retaining robustness to adversarial transfer comparable to full encryption. FPGA-based experiments show encrypting as little as 4.89% of parameters preserves robustness while reducing the share of AES decryption overhead in end-to-end inference to low single digits, enabling secure, low-latency inference.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH859

Soberdse: Sample-Efficient Design Space Exploration via Learning-Based Algorithm Selection

Lei Xu; Shanshan Wang; Chenglong Xiao

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

High-Level Synthesis (HLS) is a pivotal electronic design automation (EDA) technology that enables the generation of hardware circuits from high-level language descriptions. A critical step in HLS is Design Space Exploration (DSE), which seeks to identify high-quality hardware architectures under given constraints. However, the enormous size of the design space makes DSE computationally prohibitive. Although numerous algorithms have been proposed to accelerate DSE, our extensive experimental studies reveal that no single algorithm consistently achieves Pareto dominance across all problem instances. Consequently, the inability of any single algorithm to dominate all benchmarks necessitates an automated selection mechanism to identify the best-performing DSE algorithm for each specific case. To address this challenge, we propose the SoberDSE framework, which recommends suitable algorithm based on benchmark characteristics. Experimental results demonstrate that our SoberDSE framework significantly outperforms state-of-the-art heuristic-based DSE algorithms by up to 5.7 \times and state-of-the-art learning-based DSE methods by up to 4.2 \times. Furthermore, compared to conventional classification models, SoberDSE delivers superior accuracy in small-sample learning scenarios, with an average enhancement of 35.57\%. Code and models are available at \url{https://anonymous.4open.science/r/Sober-4377}.

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH863

PRICE: An Operator Recomputation and Fission Framework for Computation Graphs via Attainable Arithmetic Intensity

Minghua Shen; Fuyu Wang; Zhaoyun Qin

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

Computation graphs are fundamental for GPUs to efficiently deploy workloads. They also bring a challenge of growing memory demands. Operator recomputation and operator fission are promising directions to mitigate peak memory usage. However, existing work relies on offline decisions in recomputation and fission, leading to a increase in computation latency for only a marginal reduction in memory usage. In this paper, we propose PRICE, which is a novel operator recomputation and fission framework for computation graphs. Central to this framework is the attainable arithmetic intensity that can effectively trade off between memory usage and computation latency. We design a compile-time modeling approach to parameterize arithmetic intensity with GPU workload characteristics. Then we develop a runtime instantiation approach to attain the arithmetic intensity for online decision-making. Experiments on NVIDIA RTX 4090 and A100 demonstrate that PRICE achieves 1.08× to 1.31× speedup while delivering a comparable reduction in peak memory usage, compared to the state-of-the-art work.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH865

Paretopilot: Global Optimization Reasoning on HLS Design Space Exploration with LLMs

Jia Xiong; Runkai Li; Haowen Fang; Cheng Ni; Ziran Zhu; Nan Guan; Zhe Jiang; Xi Wang

Date:Monday, July 27 Location:Mtg Room 101A Session:Self-Driving EDA: Agents, Benchmarks, and Evolving Toolchains +1
Abstract

The combinatorial explosion of directives in high-level synthesis (HLS) creates an expansive design space. Previous efforts to construct design space exploration (DSE) methods with heuristics struggle to identify Pareto-optimal designs under tight search budgets. Therefore, we introduce ParetoPilot, an automated DSE framework, which leverages a dedicated LLM to adopt global optimization reasoning to navigate exploration. Guided by optimization strategies, ParetoPilot integrates directive scheduling and quality awareness to rapidly shape broad and concave Pareto fronts. Compared to a SOTA heuristic-based DSE method, ParetoPilot improves DSE performance by 49.5% and achieves 1.8× efficiency gain, outperforming the auxiliary LLM DeepSeek-R1 by 81.4%.

AIAI2-I. AI/ML Algorithms and ModelsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH868

FSGen: Agile Fused and Sparse Accelerator Generator with Accurate Power Model for LLM Applications

Jay Zhe-An Mok; Qijun Zhang; Zhiyao Xie

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

To perform design space exploration (DSE) of spatial architectures for LLM workloads, we propose FSGen, an agile generator and estimation framework. FSGen targets the key optimization spaces of fused-operator dataflows and multi-level sparsity. Fused operator boosts performance by more than 10x with minimal effect on other PPA metrics, and power is improved by 1.4x with similar performance. We propose early-stage models with less than 12.8% error, which drastically reduce DSE runtime. We identify optimal designs and compare their optimized performance against several hardware generators, achieving 58x better efficiency. The source code of FSGen is available online.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH874

RTL-Benchmt: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

Jing Wang; Shang Liu; Hangan Zhou; Zhiyao Xie

Date:Monday, July 27 Location:Mtg Room 101A Session:Self-Driving EDA: Agents, Benchmarks, and Evolving Toolchains +1
Abstract

This paper introduces RTL-BenchMT, an agentic framework for dynamic maintenance of RTL generation benchmarks. Large Language Models (LLMs) for automated RTL generation are one of the most important directions in EDA research, yet current RTL benchmarks face two critical challenges: (1) flawed cases in the benchmarks and (2) overfitting to the benchmarks. Both challenges are difficult to resolve purely by manual engineering effort. To address these issues and systematically reduce human maintenance cost, we propose an automated agentic framework, RTL-BenchMT. RTL-BenchMT focuses on two key applications: (1) automatically identifying and revising flawed benchmark cases and (2) automatically detecting and updating overfitting cases. With the assistance of RTL-BenchMT, we conduct a thorough, in-depth analysis of flawed and overfitting cases and produce a refined benchmark suite that will be open-sourced to the community.

AIAI2-I. AI/ML Algorithms and ModelsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH875

Jsplace: A Shape-Controllable and Length-Matching Placement for Rapid Single-Flux-Quantum Circuits

Rongliang Fu; Minglei Zhou; Huilong Jiang; Da Wang; Fei Wang; Huawei Cao; Junying Huang

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

Superconducting Rapid Single Flux Quantum (RSFQ) logic is a promising candidate for high-performance computing due to its ultra-high speed and ultra-low power. Since most RSFQ logic gates require synchronization with a single clock pulse, RSFQ circuits have a clock-driven gate-level pipelined architecture in nature. However, the gate-level pipelined architecture leads to layouts with a high width-to-height ratio, especially in large circuits, because the layout width increases with the number of logic stages while the height is determined by the tallest column. To address this, we propose JSPlace, a shape-controllable and length-matching placement algorithm for RSFQ circuits. Our algorithm folds the multi-stage pipelined placement into three segments and merges columns via mixed-integer linear programming (MILP) to achieve the target width-to-height ratio. Subsequently, dynamic programming determines the vertical positions of nodes within each column, followed by connectivity-driven repositioning to minimize total vertical wirelength. Experimental results on ISCAS85 and EPFL benchmarks demonstrate that JSPlace achieves effective control of layout width-to-height ratio while significantly outperforming the state-of-the-art in layout area and wirelength.

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH876

Tdsnap: Enabling Secure Function-as-a-Service with Trusted Domain Snapshots

Seong-Joong Kim; Seungwon Shin; Myoungsung You

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

Function-as-a-Service (FaaS) has become a de-facto paradigm for deploying cloud applications, yet it exposes sensitive data to untrusted infrastructures. Running functions on confidential virtual machines (CVMs) such as Intel TDX provides strong isolation for data but significantly increases cold-start latency due to CVM initialization. This paper presents TDSnap, the first TDX-aware function snapshot system that records a verified function state with its TD and restores it on demand. Using snapshot-bounded attestation and TD-aware working-set identification, TDSnap preserves TDX isolation while reducing cold-start latency. Evaluation shows that TDSnap reduces cold-start latency by up to 40x and matches non-TDX snapshot systems.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH883

RTL-Sequencer: Towards Scalable RTL Timing Prediction with the Sequence-Based Paradigm

Ziyan Guo; Wenji Fang; Wenkai LI; Yuchao Wu; Shang Liu; Zhiyao Xie

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

Accurate timing prediction at the register-transfer level (RTL) is a longstanding challenge in design automation. Existing graph-based methods struggle with limited receptive fields, high complexity, and a lack of signal directionality. We present RTL-Sequencer, a novel sequence-based paradigm that enables scalable RTL timing prediction via linearizing logic cones by breadth-first traversal and applying modern linear sequence models. Furthermore, sequence models are customized by four synergistic techniques, including sequence shuffling, bidirectional modeling, differentiable modeling, and a hybrid graph-sequence architecture. Extensive experiments demonstrate significant improvements of RTL-Sequencer over state-of-the-art baselines, advancing early-stage timing optimization.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH885

Chimera: A Unified FHE Accelerator with Enhanced Compatibility for TFHE

Tenghui Hua; Huawei Li; Jianan Mu; Jing Ye; Liang Kong; Mingzhe Zhang

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

Fully homomorphic encryption (FHE) allows direct computation on encrypted data and thus enables privacy-preserving computation in the cloud. Among FHE schemes, CKKS efficiently supports arithmetic computation, while TFHE excels in logic operations. Recent studies have attempted to develop unified FHE accelerators that support diverse computational tasks and leverage the complementary strengths of the two schemes. However, existing unified FHE accelerators are predominantly CKKS-centric. At the operator layer, TFHE is mapped onto CKKS-oriented operators, causing significant performance degradation. Furthermore, the invocation to functional units and memory is entirely dominated by CKKS, resulting in hardware underutilization for TFHE. In this paper, we propose three coordinated co-optimizations to address these challenges. First, we unify the computation of Fast Fourier Transform (TFHE core operation) and Number Theoretic Transform (CKKS core operation), and design a dual-mode computational flow to efficiently support both computations. Second, we propose an aggressive strategy of bootstrapping key unrolling to maximize TFHE hardware utilization. Third, we design a hardware-oriented support for FFT Shrinking KeySwitch to mitigate the performance bottleneck of TFHE KeySwitch. Accordingly, we present Chimera, a unified accelerator that redesigns the functional units and memory organization for TFHE to efficiently support these optimizations. Lastly, we develop an error, memory, and bandwidth-constrained auto-tuning framework that derives optimal TFHE configurations to maximize TFHE hardware utilization. Compared with the state-of-the-art unified FHE designs, Chimera achieves an average performance improvement by 14.84x across TFHE workloads, while maintaining comparable performance on CKKS workloads, with only a 9.6% area increase.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH896

Parallelizing Complementary Approximate Reachability

Runxuan Fang; Geguang Pu; Jianwen Li; Hongtai Zhu; Yechuan Xia; Yi Gao

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

Model checking, especially symbolic algorithms such as IC3/PDR, is a powerful verification technique but is increasingly constrained by single-threaded performance. Existing parallel model checking approaches either rely on redundant portfolio executions or suffer from scalability bottlenecks caused by centralized scheduling. We propose a new parallel paradigm for model checking based on fine-grained task decomposition and a decentralized Work-pulling architecture. A distributed concurrent task repository replaces the central coordinator, enabling worker threads to autonomously acquire tasks and naturally achieve dynamic load balancing. We present the framework design, analyze its parallel mechanisms, and evaluate its performance on representative benchmarks. Our approach achieves notable speedup and demonstrates solid scalability over serial execution.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH897

MACH: Memory-Aware Configuration Generation for Homomorphically Encrypted Neural Networks

Honghui You; Linjie Xiao; Yuhang Fan; Wenzhe Wang; Tianxiang Sui; Zhuoran Ji; Lei Ju

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

Fully homomorphic encryption (FHE) enables direct computation on encrypted data that ensures robust privacy protection for machine learning systems. However, the substantial computational overhead of FHE necessitates efficient acceleration techniques for practical deployments. In GPU-accelerated homomorphically encrypted neural network (HENN) inference, we observe that the primary performance bottleneck shifts from computation to memory transfer due to huge memory footprints w.r.t. the limited device memory. To address this, we propose a memory-aware design framework to fully utilize GPU memory by adaptively trading computation for reduced memory footprint. Our framework introduces a static model to estimate memory footprint and latency for HENN inference. We propose an automatic design space exploration framework to generate optimal cryptographic, bootstrapping, and encoding configurations, thereby effectively minimizing execution latency with the given GPU memory capacity. Experiments with homomorphically encrypted ResNet-20 and ResNet-18 across various GPU devices show up to 4.97$\times$ speedup with a 91.88\% reduction in memory footprint. It also demonstrates the first deployment of full-fledged homomorphically encrypted ResNet-20/18 at 128-bit security level on an RTX~4060~Ti GPU with 16 GB of device memory.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH900

Integrated Timing-Driven Placement for Hybrid-Bonding-Based Face-to-Face 3D ICs

Yuhao Ji; Yunqi Shi; Tianshu Hou; Yuxuan Zhao; Chunyuan Zhao; Peiyu Liao; Zizheng Guo; Chao Qian; Yibo Lin; Bei Yu

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

With the deceleration of Moore's Law, three-dimensional (3D) integration via hybrid bonding enables higher performance.However, existing 3D placers exhibit limitations: pseudo-3D placers suffer from modeling inaccuracies, while true-3D placers neglect timing closure.Prior 3D static timing analysis (STA) lacks accurate parasitic modeling, impeding timing optimization.This paper presents the first timing-driven true-3D placement framework for hybrid-bonding-based integrated circuits.We develop a 3D STA engine with accurate parasitic modeling and integrate a sample-based 3D wirelength model with hybrid weighting for global placement, followed by Lagrangian-based detailed placement.Results demonstrate 30-1002x reduction in TNS and 3-14x reduction in WNS over state-of-the-art placers.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH906

LEAP: A Self-Supervised Per-Cycle Toggle Propagation Model Supports Fast, Transferable, and Early Analysis of Layout Power

Wenkai LI; Yuchao Wu; Ziyan Guo; Yao Lu; Wenji Fang; Mengming Li; Zhiyao Xie

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

Accurate power analysis is critical in VLSI design, as it directly impacts power optimization strategies. However, traditional approaches are often hindered by the substantial runtime required for per-cycle toggle propagation in the netlist, which propagates register toggle information through combinational logic. To address this, we propose LEAP, the first work to enable per-cycle toggle propagation prediction with both high accuracy and efficiency. This is achieved through a novel, linear-complexity graph transformer capable of simulating toggle propagation, along with specially designed self-supervised pre-training tasks that enable the model to capture circuit structure and functionality. LEAP achieves a 7.6x speedup over the EDA tool in toggle propagation, and attains a near-perfect area under the Precision-Recall curve (PR-AUC) of 0.99 for prediction results. Moreover, LEAP can be seamlessly integrated with other machine learning based power models into LEAP‑Power. This integration enables precise per‑cycle layout power prediction directly from post‑synthesis netlists, achieving a mean absolute percentage error (MAPE) of only 4.55%. By bypassing toggle propagation in the netlist, LEAP‑Power delivers substantial runtime gains, running 5.3x faster than the model without LEAP.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH907

DRCGen: A Controllable DRC Violation Case Generator with Cross-PDK Transferability for Rule-Deck and Tool Validation

Shunjie Chang; Yuxuan Dong; Haodong Lu; Youran Wu; Jianli Chen; Jun Yu; Kun Wang

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

VLSI layout pattern generation plays a crucial role in design for manufacturability (DFM). Patterns that incorporate specific design rule checking (DRC) violations are valuable for applications such as foundry design-rule deck development, EDA tool validation, and generating training data for AI-based DRC research. However, existing research primarily focuses on producing DRC-clean layouts by enforcing rule compliance through constraints or filtering, while ignoring the generation of controllable violations. To address this gap, we propose DRCGen, a controllable framework for generating DRC violation patterns with cross-PDK transferability. DRCGen is based on a diffusion model with conditional control, generating structured control hints to encode spatial and semantic violation information for fine-tuning the model. The model can generate violation patterns of user-specified types within a target region based on a natural-language prompt, while maintaining DRC-compliance outside the region. Additionally, we incorporate few-shot learning to facilitate rapid transfer across different PDKs. Experimental results show that, compared to state-of-the-art methods, our approach achieves a 2.27× increase in topological diversity and a 1.02× increase in geometric diversity. Compared to Calibre LSG, we achieve 3.31× and 1.15× improvements in topological and geometric diversity, respectively.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH912

Hades: Harnessing Architecture Design Automation for Application-Specific FHE Accelerators

Silin Liu; yinghao yang; Fuping Li; Hang Lu; Xiaowei Li

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

Abstract—With the advancement of fully homomorphic encryption (FHE), encrypted applications exhibit diverse encryption parameters and computational characteristics. However, existing FHE accelerator architectures are typically fixed and lack the flexibility to efficiently support this diversity. The diversity among FHE applications imposes urgent demands on the flexibility and efficiency of accelerator design. To address this challenge, we propose Hades, an automated framework for FHE accelerator generation. Hades analyzes the dataflow graph and encryption parameters of a given FHE application, establishes a mapping from the application to hardware architecture, and automatically searches for accelerator configurations optimized for the target workload. The automation capability of Hades allows for flexible hardware realization on FPGAs and also provides design guidance for ASIC-based accelerators. We evaluate Hades on a range of FHE applications with diverse characteristics. Experimental results demonstrate that Hades can effectively exploit application-specific features to automatically generate efficient hardware architectures. We highlight the following results: (1) compared with state-of-the-art FPGA accelerators, Poseidon and FAB, speedup achieves 1.99× to 6.58× while reducing 50% resource consumption; (2) compared with state-of-the-art ASIC accelerators, SHARP and CraterLake, speedup achieves more than 3×; (3) achieves 65%-94% hardware utilization on multiple FHE applications, more than a 2× improvement over manual accelerators.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH920

Adaptive Spiking Neural Networks for Real-Time Multi-Object Detection Tasks

Donghwa Kang; Woojin Shin; Cheol-Ho Hong; Byunghoon Kang; Jinkyu Lee; Hyeongboo Baek

Date:Monday, July 27 Location:Mtg Room 203AB Session:Advancing the Frontier of Neuromorphic Learning Systems +1
Abstract

Providing deterministic timing guarantees, beyond merely optimizing the accuracy-latency trade-off, is a mandatory yet unaddressed challenge for ultra-low-power spiking neural networks (SNNs) in resource-limited, safety-critical systems. In this paper, we propose RT-SNN, a novel adaptive SNN methods that integrates a system-level scheduling framework for SNN-based multi-object detection that, for the first time, co-optimizes inference accuracy while providing these strict timing guarantees. RT-SNN orchestrates SNN inference at both frame and timestep levels, introducing flexible timestep control and a novel membrane potential reuse mechanism to enhance accuracy without increasing latency. Evaluations on the KITTI dataset show that RT-SNN significantly improves the accuracy and energy efficiency compared to both state-of-the-art SNNs and traditional ANNs. Furthermore, a case study on a ROS-based F1/10 autonomous vehicle testbed demonstrates its real-time efficacy, validating its practical deployment in safety-critical systems.

DesignDES3. Emerging Models of ComputationChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH931

WILL: Write Invalidation Skipping for QLC NAND by Leveraging Data Lifespan and Latency Awareness

Beichen Ning; Yina Lv; Tianyu Ren; Bin Gao; Xiangyu Yao; Jie Zhang; Qiao Li; Chun Jason Xue

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Scaling the Bedrock: OS-Hardware Co-Design for High-Reliability Storage +1
Abstract

3D QLC NAND flash suffers from a slow two-step programming process. The problem is exacerbated by write invalidation on wordlines, which occurs when pages are updated between the two steps, wasting programming effort and degrading performance. While prior work offers partial mitigation, it fails to fundamentally address the root cause: the misalignment of page invalidation time within a wordline. In this paper, we propose WILL, a proactive page mapping scheme that enables write invalidation skipping for QLC NAND flash. The basic idea is to group pages with similar invalidation time into the same wordline and turn partially-invalid wordlines into fully-valid or fully invalid ones, thereby skipping the programming on fully-invalid wordlines. To achieve this, WILL first predicts page invalidation time, then organizes pages accordingly to maximize the opportunities of write invalidation skipping. Finally, WILL schedules short-lifespan data to wordlines requiring high-latency fine-step programming to increase the probability of skipping high programming latencies for further programming performance enhancement. In addition, read parallelism is maintained by identifying and excluding read-hot data from this mapping. Experimental results show that WILL reduces programming execution time by 10.0% compared with the state-of-the-art, along with skipping an additional 5.03% of fully-invalid wordlines and lowering the partially-invalid ratio by 15.9% on average, at a minimal read performance cost of 2.59%.

SystemsSYS5. Embedded Memory and Storage SystemsQuantumDES3. Emerging Models of Computation +1 more
RESEARCH934

ATOM-3D: Analytical Three-Dimensional Orientation-Aware Mixed-Size Placer

Yutao Wang; Liang Xiao; Bangqi Fu; Lixin Liu; Evangeline Young

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

In this paper, we propose an analytical method for 3D mixed-size placement that incorporates macro-orientation awareness in face-to-face (F2F) terminal-bonded heterogeneous 3D ICs. This work introduces a novel one-pass placement framework that bypasses the conventional 2.5D co-optimization, effectively narrowing the gap between partitioning and placement and achieving significant time savings. Our method constructs a differentiable distributed macro-rotation system that unifies partitioning, placement, and macro-orientation within 3D global placement. To compensate for deviations arising from wirelength estimation in 3D placement, we further apply an FM-based post-partitioner. Experimental results on the ICCAD 2023 contest benchmark show that our approach achieves an 8% higher quality than the first-place contest entry and a 2% improvement against the best published method with 1.6× runtime speedup.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH935

SALUT: Efficient Sparse LU Factorization via Asynchronous Task Triggering on HBM FPGAs

Xin Xu; Zhenhua Wu; Dan Niu; Zhou Jin; Cheng Zhuo

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

Sparse LU factorization is a critical kernel in scientific computing. Given its irregular data dependencies and complex computation patterns arising from inherent high sparsity, its acceleration remains largely unexplored on FPGAs. The high concurrency capability offered by High Bandwidth Memory (HBM) has recently presented new opportunities. Nevertheless, accelerating sparse LU factorization on HBM-equipped FPGAs remains non-trivial due to rigid dependency synchronization and imbalanced load. In this paper, we propose SALUT, a high-performance sparse LU factorization accelerator on HBM FPGAs. An asynchronous task activation mechanism is developed to pre-compile complex runtime data dependency resolution into fine-grained dependency management, reducing synchronization overhead and maximizing computation parallelism. Additionally, we design a hardware architecture facilitating stream-aware parallel scheduling by decoupling burdensome dependency resolution from task activation, eliminating serial bottleneck of a centralized scheduler. Finally, a locality-aware dual-queue load balancing strategy ensures data locality and high hardware utilization. Evaluations on 15 sparse matrices show that SALUT's geometric mean throughput and energy efficiency surpass the NVIDIA cuDSS solver by 4.0x and 6.5x on RTX A6000 GPU, and 3.7x and 4.7x on Tesla V100 GPU, respectively.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH946

UNICON: A Unified Reconfigurable Nonlinear Architecture for Efficient Neural Network Inference

Xiaofeng Zou; Cen Chen; Qinyu Wang; Zhao Liu; Huiping Zhuang; Jingcai Guo

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

Nonlinear activation functions (NAFs) are critical to deep neural networks (DNNs), yet their diverse and complex computational forms are inherently hardware-unfriendly, incurring substantial latency, area, and energy overheads. This work presents UNICON, a unified reconfigurable hardware architecture that efficiently supports diverse NAFs through a logarithmic-domain computing paradigm. UNICON uncovers the intrinsic correlations among NAFs and decomposes complex nonlinear operations into lightweight shift–add operations. With modular design and dynamic reconfigurable dataflows, UNICON achieves high functional flexibility and resource efficiency without hardware duplication. As the first algorithm–architecture co-designed solution that unifies diverse NAFs within a logarithmic-domain framework, UNICON gains an average 1.55x speedup, 2.56x energy efficiency, and 2.74x area efficiency over state-of-the-art NAF architecture.

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH951

Exploiting Function-Family Structure in Analog Circuit Optimization

Zhuohua Liu; Kaiqi Huang; Qinxin Mei; Yuanqi Hu; Wei Xing

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

Analog circuit optimization is typically framed as black-box search over arbitrary smooth functions, yet device physics constrains performance mappings to structured families: exponential device laws, rational transfer functions, and regime-dependent dynamics. Offthe-shelf Gaussian-process surrogates impose globally smooth, stationary priors that are misaligned with these regime-switching primitives and can severely misfit highly nonlinear circuits at realistic sample sizes (50–100 evaluations). We demonstrate that pretrained tabular models encoding these primitives enable reliable optimization without per-circuit engineering. Circuit Prior Network (CPN) combines a tabular foundation model (TabPFN v2) with Direct Expected Improvement (DEI), computing expected improvement exactly under discrete posteriors rather than Gaussian approximations. Across 6 circuits and 25 baselines, structure-matched priors achieve R2 ≈ 0.99 in small-sample regimes where GP-Matérn attains only R2 = 0.16 on Bandgap, deliver 1.05–3.81× higher FoM with 3.34–11.89× fewer iterations, and suggest a shift from handcrafting models as priors toward systematic physics-informed structure identification. Our code will be made publicly available upon paper acceptance.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH952

Deltasight: Architecting Efficient Video AIoT Systems via Cross-Stage Consistent Redundancy Elimination

Qinyu Wang; Cen Chen; Jingkai Huang; Yawen Qiu; Hongen Shao; Xiaofeng Zou; Ziqian Zeng

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

The high energy cost of video AIoT systems stems from redundant operations across sensing, transmission, and computation. While prior work optimizes individual stages, the lack of cross-stage coordination forces each stage to re-detect redundancy, limiting overall efficiency. We present DeltaSight, an algorithm–hardware co-design architecture that establishes a sensor-side unified redundancy criterion serving as the common basis for redundancy elimination throughout the pipeline. Algorithmically, we generate this criterion via sensor-side block-level redundancy detection whose output matches the granularity of downstream computation, complemented by a semantic-aware sampling strategy that adapts precision to task relevance. An architecture is designed to support this algorithm with minimal hardware additions. DeltaSight gains 2.4x sensor-side and 1.7x end-to-end energy efficiency, with slight accuracy improvements.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH954

Module-Level Placement Using Force-Based Deformation for Efficient Early Design Estimation

Mengen Chen; Xu He; Tong Shen; Yao Wang

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

In modern very large-scale design, cell-level placement is prohibitively slow and memory-intensive, hindering rapid design-space exploration for integrated circuits. Cell-level placement consistently shows that cells within the same functional module tend to group together, making module-level placement a practical abstraction for fast yet accurate design estimation. To this end, we introduce a module-level placement framework with a force-directed soft-module deformation scheme, where functional modules are modeled as deformable entities that adapt aspect ratios under area preservation. This reduces the size of the problem by several orders of magnitude, while retaining high fidelity to cell-level placement. On industrial-scale benchmarks, our approach achieves a 56x average speedup over conventional placers, incurs only about 3% inter-module HPWL error, and recovers 93% of the top-10% longest nets. These results establish a practical and scalable solution for early-stage placement estimation.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH959

Timing-Driven Scheduling and Placement Under Pressure-Path Interference for Fully Programmable Valve Array Biochips

Jingyi Wang; Yuhan Zhu; Youlin Pan; Zhisheng Chen; Zhiwen Yu; Genggeng Liu; Xing Huang

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Where Biology Meets Computation: Next‑Gen Bio‑Cyber‑Physical Innovations +1
Abstract

Fully Programmable Valve Array (FPVA) biochips offer high flexibility and reusability for executing complex biochemical assays. However, their reliability and performance are constrained by the time-sensitive nature of bioassays and the inherent complexity of pressure-driven fluid routing. Existing methods often overlook timing constraints and the effects of pressure paths during early design stages, leading to suboptimal scheduling and placement. To address these limitations, we propose a timing-driven scheduling and placement framework that explicitly incorporates pressure-path interference to optimize fluidic operations, thereby reducing delays and resource overhead. Experimental results demonstrate that the framework significantly decreases bioassay completion time, fluid path length, and delays, enhancing both operational efficiency and system robustness.

DesignDES3. Emerging Models of ComputationQuantumEDA
RESEARCH962

Energyhdc: Intermittent Hyperdimensional Inference on Energy-Harvesting IoT

YUSHENG LIU; Yong-Cheng Liaw; Shuo-Han Chen

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

Ambient energy harvesting technologies offer the promise of perpetual operation for batteryless Internet of Things (IoT) devices; however, their execution is frequently halted by unpredictable power failures. Traditional methods to ensure correctness, such as software checkpointing (e.g., Mementos) and newer systems for deep neural networks (e.g., DynBal), are burdened by substantial energy and latency overheads for progress preservation, limiting their use on ultra-low-power platforms. This paper introduces EnergyHDC, an energy-aware inference system designed for intermittent operation by integrating Hyperdimensional Computing (HDC). The system pairs HDC's algorithmic robustness with fine-grained energy adaptation using three key contributions: (1) an energy-proportional checkpointing that optimizes preservation granularity against available energy; (2) a dimension-first reordering that slashes per-checkpoint storage; and (3) a priority-based pruning that allows for compile-time energy-accuracy trade-offs. Evaluated on an MSP432P401R microcontroller platform under intermittent power, EnergyHDC demonstrates up to 214x speedup compared to Mementos. It also achieves 19.6x faster inference than DynBal under similar accuracy and realistic energy-harvesting conditions. These results validate that a co-design approach, coupling energy-aware execution with HDC's intrinsic robustness, can reframe intermittence from a system constraint into an opportunity for efficient edge intelligence.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH976

Greenllm: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving

Qunyou Liu; Darong Huang; Marina Zapater; David Atienza

Date:Monday, July 27 Location:Mtg Room 101B Session:Computing Without Waste: Lean and Scalable Neural Architectures +1
Abstract

Large Language Models (LLMs) are rapidly becoming the backbone of modern cloud services, yet their inference costs are dominated by energy consumption on GPUs. Unlike traditional GPU workloads, LLM inference consists of two distinct stages with different characteristics: the prefill phase, which is latency-sensitive and scales quadratically with prompt length, and the decode phase, which progresses token by token with undetermined length. Current GPU power governors (for example, NVIDIA default) overlook this asymmetry, treating both phases uniformly. The result is mismatched voltage/frequency settings, leading to suboptimal voltage/frequency configurations, head-of-line blocking, and excessive energy consumption. We introduceGreenLLM, a Service-Level Objectives (SLO) aware serving framework that minimizes GPU energy by explicitly separating prefill and decode control. At ingress, requests are routed into length‑based queues so short prompts avoid head‑of‑line blocking, tightening TTFT. For prefill, GreenLLM collects short traces on a GPU node, fits compact latency–power models over SM frequency, and solves a queueing‑aware optimization to pick energy‑minimal clocks per class. During decode, a lightweight dual‑loop controller tracks throughput (tokens-per-second) and adjusts frequency with hysteretic, fine‑grained steps to hold tail TBT within target bounds. Across Alibaba and Azure trace replays, GreenLLM achieves up to 34% reduction in total energy consumption compared to the default DVFS baseline in Alibaba/Azure trace replays, with no loss of throughput and only less than 3.5% SLO violations increase, demonstrating its effectiveness in the efficient LLM service.

AIAI3-I. AI/ML Application and InfrastructureChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH983

ATTINA: Spatial Acceleration of Additive Number Theoretic Transform for Energy-Efficient zkSNARKs at the Edge

Yanze Wu; Devin Park; Md Tanvir Arafin

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

Modern zero-knowledge proof (ZKP) schemes are moving from costly elliptic curve-based systems towards algebraic proof systems. This transition is ushering in a new era of broad adoption of hardware-friendly ZKPs, implementable even on resource-constrained edge devices. Algebraic structures, such as towers of binary fields, are core components of such schemes, and they require hardware and algorithmic innovation to realize advanced ZKPs, such as Binius and Binius-FRI, efficiently. Hence, this paper introduces the ATTINA framework, a high-throughput additive number theoretic transform (ANTT) accelerator operating over extended binary fields. ANTT is the most computationally intensive task for realizing Binius. ATTINA partitions ANTT tasks into multiple subtasks and strategically allocates them to processing elements (PEs) and hardware programmable logic (PL) kernels in an interleaved computing pattern. ATTINA features a scalable architecture that enables dynamic parameter changes for different cryptographic applications. To understand its applicability, this paper implements ATTINA on an edge-deployable heterogeneous Versal adaptive system-on-chip (ASoC) platform. ATTINA on Versal ASoC maintains a throughput of 10 Gb/s. ATTINA outperforms a benchmark CPU implementation of ANTT on a high-end processor (i9-14900K) by \textbf{139$\times$} and a PE-only implementation by \textbf{38$\times$} in terms of latency while operating at $\ leq$5W for a 4096-point ANTT. ATTINA's code is published at \url{https://anonymous.4open.science/r/ATTINA} for evaluation and reproducible research.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH986

Procyon: Promoting Fine-Grained Multi-Tenancy to Optimize Sparse Streaming Accelerators

Ubaid Bakhtiar; Jeremy Xu Sha; Helya Hosseini; Bahar Asgari

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

Sparsity has become a defining characteristic of modern workloads, motivating the development of specialized accelerators to mitigate the challenges inherent to sparsity. Sparse streaming accelerators have proved to be an effective solution, yet current designs are restricted to single-workload execution. Additionally, they exhibit significant underutilization in processing elements (PEs) due to their underlying non-zero scheduling strategies. To address these limitations, we propose Procyon, a fine-grain multi-tenancy framework that fuses the PE instruction streams of multiple workloads into a unified execution schedule. This allows instructions from different workloads to execute on the same PE, thereby, improving its utilization and enabling concurrent execution of multiple workloads on a single accelerator instance. We evaluate Procyon on AMD Alveo U55C FPGA using workloads from the SuiteSparse dataset and show that it substantially reduces PE underutilization that results in 3x speedup over state-of-the-art sparse streaming accelerators (Serpens and Chason), and reaches a peak throughput of 61.2 GFLOP/s.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH989

TAMI-MPC: Trusted Acceleration of Minimal-Interaction MPC for Efficient Nonlinear Inference

Zhuoran Li; Hanieh TotonchiAsl; Yifei Cai; Ebrahim Nouri; Danella Zhao

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

Secure multi-party computation (MPC) offers a practical foundation for privacy-preserving machine learning at the edge. However, current MPC systems rely heavily on communication and computation-intensive primitives-such as secure comparison-for nonlinear operations, which are often impractical on resource-constrained platforms. To enable real-time secure inference, we introduce a highly efficient, TEE-accelerated framework for secure comparison. Specifically, we reduce communication cost by redesigning the core primitives-leaf comparison and merge-so that each completes in a single round of interaction, reducing the round complexity from log(n) to just 1 per operation. Furthermore, unlike prior work that heavily relies on Oblivious Transfer (OT), a well-known computational bottleneck, we leverage synchronized seeds inside the TEE to eliminate OT for the vast majority of our designs, along with a correlated-randomness reuse technique that keeps new designs computationally lightweight. To fully realize the potential, we design a specialized accelerator that restructures the dataflow across stages to enable continuous, fine-grained streaming and high parallelism, reducing memory overhead. Our design achieves up to 4.86x speedup on ResNet-50 inference, compared with state-of-the-art CNN frameworks, and achieves up to 7.44x speedup on bert-base inference, compared with state-of-the-art LLM frameworks.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH999

Galaxydit: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

Zhiye Song; Steve Dai; Ben Keller; Brucek Khailany

Date:Monday, July 27 Location:Mtg Room 101B Session:Computing Without Waste: Lean and Scalable Neural Architectures +1
Abstract

Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve 1.87× and 2.37× speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).

AIAI3-I. AI/ML Application and InfrastructureChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1000

Can Asymmetric Tile Buffering Be Beneficial?

Chengyue Wang; Qiran Pang; Xinrui Wu; HyeGang Jun; Luis Romero; Endri Taka; Diana Marculescu; Tony Nowatzki; Pranathi Vasireddy; Joseph Melber; Deming Chen; Jason Cong

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

General matrix multiplication (GEMM) is the computational back- bone of modern AI workloads, and its efficiency is critically depen- dent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input 𝐴 along the dimension 𝑀 matches the output tile size of 𝐶. In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2™ AI Engine (AIE), achieving up to a 4.54× speedup, from 4.8 to 24.6 TFLOPS on mixed-precision BFP16–BF16 GEMM, establishing a new performance record for XDNA2™ AIE.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH1011

Towards Fast and Robust Split Federated Learning over Satellite-Based Computing Networks

Yuxin Zhang; Haoyu Chen; Zheng Lin; Wenjun Zhu; Ju Ren; Jin Zhao; Yue Gao; Zhe Chen

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

Training satellite-edge deep learning models remains constrained by limited onboard resources and high data download latency. Although split federated learning (SFL) offers a potential solution through model partitioning, its convergence and robustness are fundamentally compromised by the intermittent and asymmetric satellite–ground links. To address these issues, we introduce SatSFL as a novel SFL system for satellite-based computing networks. SatSFL employs an interpolated gradient approximation to emulate ground feedback during disconnections, markedly accelerating convergence while maintaining robustness under heterogeneous data. In addition, we design adaptive uplink compression under asymmetric bandwidth to ensure that balanced and critical gradients are reliably transmitted back to satellites. We implement and evaluate SatSFL on real-world LEO satellite systems and datasets, demonstrating superior accuracy and convergence speed compared to state-of-the-art methods.

AIAI5-I. AI/ML System and Platform Design
RESEARCH1016

A Fully Analog Continuous-Time CIM Neural ODE Solver for Flow-Matching-Based Fluid Dynamics Generation

Songqi Wang; Meng Xu; Jichang Yang; Zhexu Chen; Hegan Chen; Sishuo Liu; Xinyuan Zhang; Kwun Hang WONG; Ning Lin; Yi Li; Zhongrui Wang; Han Wang

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Are We There Yet? That is the Question ... for Computing near and in Memory. +1
Abstract

The generation of fluid-dynamics fields is essential for understanding complex nonlinear systems and enabling real-time scientific computing. Conventional computational fluid dynamics pipelines rely on finite-element or finite-volume solvers on von Neumann architectures, which discretize continuous physical evolution into many iterative updates, leading to prohibitive latency and energy consumption. Inspired by neural dynamical systems in the brain, we propose a biologically inspired continuous-time hardware–software co-design framework for flow-matching–based turbulent-flow generation. (1) The flow-matching model adopts an MLP-Mixer architecture that emulates cortical-style information integration and hierarchical signal mixing, providing a compact backbone that naturally aligns with closed-loop analog computation. (2) A fully analog continuous-time RRAM CIM neural ordinary differential equation (ODE) solver is developed to physically realize neural-like continuous-time latent dynamics, enabling high-speed and low-power flow generation. (3) Noise-aware training and decoder retraining are jointly introduced to ensure robust generation quality in the presence of RRAM read/write noise. Experiments on three turbulent-flow datasets show that the MLP-Mixer backbone matches convolutional and attention-based flow-matching models in velocity-field accuracy while mapping efficiently to CIM hardware, and that the proposed analog ODE solver reduces energy consumption by 98.24% and latency by 99.99% compared with an NVIDIA A100 GPU, while maintaining stable generation fidelity under realistic RRAM read/write noise. This work establishes a new paradigm for high-speed, energy-efficient physical process generation and scientific AI acceleration using neuromorphic continuous-time CIM computing.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIAI5-I. AI/ML System and Platform Design
RESEARCH1028

Ctxmem: OS-Hardware Co-Designed Tiered Memory with DRAM-SSD Hybrid CXL Devices

Longyi Zhou; Kan Zhong; Haorui He; Zhenhua Tan; Duo Liu

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Scaling the Bedrock: OS-Hardware Co-Design for High-Reliability Storage +1
Abstract

The "memory wall'" problem continues to constrain modern applications, making scalable memory expansion via Compute Express Link (CXL) increasingly critical. While DRAM-Flash hybrid CXL systems offer a capacity solution, their performance remains hampered by inefficient access models. This paper presents CtxMem, an OS-hardware co-designed memory expansion system built on byte-addressable CXL interconnect and inexpensive flash storage. Unlike existing approaches, CtxMem introduces an asynchronous page fault-based mechanism that offloads the entire data plane of page fault handling to the CXL device. This is achieved through three key innovations: a dynamic direct-mapped cache that maximizes DRAM utilization, a lightweight hardware profiling unit enabling accurate and rapid hot/cold page identification with minimal overhead, and a OS-hardware co-designed page fault handling procedure that efficiently offloads data migration. Our GEM5-based evaluation shows CtxMem outperforms conventional CXL-SSD by 1.4x in execution time and doubles system throughput, demonstrating efficient large-scale memory expansion.

SystemsSYS5. Embedded Memory and Storage SystemsQuantumDES3. Emerging Models of Computation +1 more
RESEARCH1036

NEAT: A Neutral-Atom Transpiler for Joint Mapping and Scheduling of Syndrome Extraction Circuits

Dingchao Gao; Kai Zhang; Sanjiang Li; Shenggang Ying; Fangming Liu; Jianxin Chen

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Charting the Architecture of Tomorrow’s Fault Tolerant Quantum Machines by Harmonizing Codes to Systems +1
Abstract

Quantum Error Correction (QEC) demands efficient implementation of syndrome extraction circuits. However, existing compilers for neutral-atom processors largely miss the opportunity to co-optimize these circuits by exploiting both the structural properties of Quantum Error-Correcting Codes (QECCs) and the physical constraints of neutral-atom architectures. In this work, we introduce NEAT, an SMT-based compiler that jointly optimizes qubit mapping and syndrome extraction scheduling for a broad class of stabilizer-based QECCs, achieving depth-optimal execution with minimal shuttling overhead on neutral-atom platforms. Across a wide range of QECCs, NEAT consistently achieves near-optimal circuit depth and reduces atom movement by 3×–30× compared the baseline compiler Enola. Logical-level simulations further demonstrate 2×–20× lower logical error rates under realistic hardware noise. A hierarchical symmetry-breaking formulation and relaxed parallel-motion constraints substantially improve solver scalability, yielding up to 100× speedup in compilation time. Together, these results show that NEAT produces depth-optimal, movement-efficient, and logically robust syndrome extraction schedules, while scaling effectively to large QECCs on neutral-atom hardware.

DesignQuantumDES6. Quantum ComputingDES3. Emerging Models of Computation +1 more
RESEARCH1037

Seeing the Unseen: Reverse Engineering Hidden PMU Events by Differential Analysis and Simulation Validation

Yihao Yang; Pengfei Qiu; Xiaojie Zhang; Yihao Deng; Chunlu Wang; Gang Qu; Dongsheng Wang

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

The Performance Monitoring Unit (PMU) is a significant hardware module of modern processors. Previous studies have disclosed numerous hidden PMU events. However, their underlying microarchitectural semantics remain unexplored, which largely limits their utilization. This paper proposes a reverse engineering framework for deeply understanding the microarchitectural semantics of hidden PMU events. For demonstration, we successfully utilize the framework to reveal microarchitectural behaviors of branch-related events and discover a hidden pattern of UMask Combination in Intel CPUs. Based on the reverse-engineering results, we implement a PMU-based covert channel for Spectre-v1 in SGX and enable a high-precision website fingerprinting attack under resource-constrained conditions.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH1042

Gearls: Generalizable Reinforcement Learning Framework for Logic Optimization via Policy Similarity Metric

Yusen Qin; Jiaqi Lyu; Zhi Li; Peng Cao

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

Logic optimization is a crucial step in digital circuit synthesis, directly impacting the final power, performance, and area (PPA) of integrated circuits. While reinforcement learning (RL) has shown promise in generating high-quality optimization flows, its limited generalization ability necessitates retraining for each new circuit, hindering practical deployment. To address this, we propose a novel RL-based framework that achieves strong zero-shot generalization across unseen circuits. Our work introduces three key innovations: (1) logic cone extraction to reduce input complexity and enable efficient reward estimation; (2) a cut-weighting mechanism that models global timing effects from local subgraphs; and (3) the integration of Policy Similarity Metric (PSM) to enhance state representation and improve zero-shot transfer. Evaluated on a set of unseen benchmark circuits, Our work outperforms state-of-the-art methods—achieving 31.7% lower worst negative slack (WNS) and 32.5% lower total negative slack (TNS)—while running in only 13% of the time required by prior approaches. This work demonstrates that generalizable RL can enable fast, high-quality logic optimization without circuit-specific retraining.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH1047

GenDSE: Generative CPU Design Space Exploration via Sequential Decision Making

Meng Wu; Mingyu Yan; Duo Wang; Shijun Zhou; Guangyu Sun; Xiaochun Ye

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Interconnect-Driven System Architecture Innovation +1
Abstract

The prevailing surrogate-guided paradigm for CPU design space exploration (DSE) suffers from a fundamental credit assignment problem, leading to high sample complexity. This work introduces GenDSE, which reformulates DSE as a sequential decision process. Its contributions are: (1) A Markov decision process that decomposes design configuration generation into context-dependent steps. (2) A GFlowNet-based configuration generator for implicit credit assignment, linking individual decisions to final outcomes. (3) Progressive Sketching, a training paradigm to overcome GFlowNet's data hunger. Experiments show GenDSE reduces simulation costs by 90% on average while achieving superior solution quality.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageQuantum +1 more
RESEARCH1052

Fast Template Matching for Quantum Circuits Using Hypergraph

Xin Hong; Jintao Yu; Runhong He; Shenggang Ying; Sanjiang Li; Mingsheng Ying

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Pioneering the Foundations of Quantum Circuit Synthesis, Compilation and Algorithmic Optimization +1
Abstract

Quantum circuit optimization is a key step toward the efficient execution of quantum algorithms. Template matching has emerged as one of the dominant approaches for simplifying quantum circuits, yet it confronts the intrinsic challenge of accommodating gate commutativity. The state-of-the-art template-matching algorithm relies on a directed acyclic graph (DAG) representation. While this DAG-based technique handles gate commutativity satisfactorily, it fails to preserve the local connectivity inherent to quantum circuits, thereby leading to relatively high matching complexity. In this paper, we introduce a hypergraph representation (HG), a commutativity-aware representation that collapses commuting gates into a single super-node while retaining all local connectivity. This enables matches to be extended locally without revisiting the rest of the circuit, with incremental updates limited to the immediate neighborhood. Experimental results demonstrate that our HG matcher achieves 11--606x speed-up over the DAG-based implementation in Qiskit (Iten et al., 2022) on both random and arithmetic benchmarks, while maintaining the same optimization quality. The acceleration increases with circuit size, confirming that preserving locality and connectivity is the key to scalable quantum circuit optimisation.

DesignQuantumDES6. Quantum ComputingAI
RESEARCH1054

Icarus's Wings: Disabling MoE Offloading Acceleration via a Universal Hidden Prefix Attack

Qichao Ma; Zecheng Hao; Zihao Zheng; LING LIANG; Xing Hu; Zhaofei Yu; Tiejun Huang

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

Deploying MoEs for multi-user inference demands low latency with tight cost. MoE acceleration strategies cache and prefetch experts, assuming temporal locality and predictable routing. When these fail, wrong experts inflate latency, enabling DoS. We expose this vulnerability in GPU-centric MoE offloading and present Icarus, a gradient-based universal attack injecting an adversarial prefix embedding to disable such acceleration. Icarus combines Temporal-Locality Minimization and Expert-Prefetch Misleading to perturb decoding, plus a scheduler balancing targets with active exploration. Across models/devices, Icarus raises cache replacements by 85.4%, cuts prefetch accuracy by 12.5%, slows decoding to 0.7×, and drives outputs to maximum length.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1061

OpenSUN: An Open Platform for Exploring Scale-Up Network Systems

Yiqi Chen; Bizhao Shi; Tao Qian; Ying Liu; Xiaotong Sun; Mingtao Han; Cheng Zhang; Guojie Luo; Guangyu Sun; Zhe Zhou

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

The deceleration of Moore's Law, combined with the rising computational demands of big-data applications and generative AI, is accelerating the development of scale-up systems. Unlike traditional scale-out architectures, scale-up systems typically extend shared memory across multiple nodes using high-bandwidth, load/store-based interconnects such as CXL, UALink, NVLink, and Unified-Bus. Designing and optimizing such interconnected systems poses a multi-faceted challenge spanning computer architecture, interconnect technology and software system. Although industry has introduced numerous commercial solutions, the academic community still lacks an open-source, hardware-based platform to support cross-stack research, such as exploring new protocols, topologies, I/O controller or switch architectures, and system-level software optimizations. To address such a huge gap, we introduce OpenSUN, a novel FPGA-based multi-node system that supports customizable architecture and software systems. Its baseline architecture integrates a full-stack design, encompassing I/O controllers, switches, protocol stacks, and software systems. These components are all accessible through user-friendly interfaces at each layer to enable agile prototyping of novel hardware and software designs. By providing this open, community-accessible platform, OpenSUN aims to foster exploration and innovation in next-generation scale-up interconnect systems.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH1065

FLASH3D: A Fast Layered Analytical Solver for High-Accuracy Steady-State Thermal Simulation of 3D ICs

Hao Ai; Liang Chen; Jianhua Zhang; Wenxing Zhu

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

High-accuracy thermal simulation is essential for modern 3D integrated circuits (ICs), but its high computational cost often hinders early-stage, thermal-aware design. To address this, we propose FLASH3D, a fast and versatile analytical simulator for 3D steady-state thermal analysis. FLASH3D integrates spectral modal decomposition, the transfer matrix method, and an accelerated power decomposition algorithm to compute 3D temperature distributions efficiently and accurately. Compared with COMSOL, FLASH3D achieves over four orders of magnitude speedup, reducing computation time from minutes to milliseconds while maintaining a maximum absolute error below 0.5 K. Compared to the state-of-the-art machine learning (ML) method DeepOHeat, within a single inference time, FLASH3D can compute the temperature distribution of roughly 2000 slices and attains approximately 10× lower error. Furthermore, FLASH3D supports complex boundary conditions and fine-grained power maps, including curved-edge and standard-cell-level distributions, overcoming the limitations of conventional analytical methods. These features make FLASH3D an efficient, reliable, and scalable tool for early-stage thermal-aware design, providing a solid foundation for thermal optimization of large-scale 3D ICs.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1074

PHAROS: Pipelined Heterogeneous Accelerators for Real-Time Safety-Critical Systems With Deadline Compliance

Shixin Ji; Jinming Zhuang; Sarah Schultz; Zhuoping Yang; Xingzhen Chen; Zheng Dong; Alex Jones; Yihui Ren; Peipei Zhou

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

Spatially partitioned heterogeneous accelerators (HAs) are increasingly adopted in embedded systems for their performance and flexibility. Yet most existing HA design frameworks optimize primarily for throughput or quality-of-service (QoS) metrics, they often overlook safety-critical real-time requirements, including hardware support for predictable execution, real-time-aware design space exploration (DSE), and rigorous schedulability analy- sis. These requirements are essential in safety-critical applications such as smart transportation, where timing guarantees directly affect system safety. To address this gap, we present PHAROS, a real-time-centric HA design framework. PHAROS introduces preemption mechanisms and scheduler designs for spatially partitioned HAs under FIFO and earliest-deadline-first (EDF) policies. Leveraging modern real-time theory, we further develop a soft real-time (SRT) schedulability-oriented DSE with objectives and constraints tailored to timing correctness. Through comprehensive modeling, analysis, and evaluation across diverse applications, we show that PHAROS's schedulability-oriented DSE discovers more feasible configurations for a broader range of task sets than throughput-oriented DSE baselines, while delivering improved real-time performance. We also provide response-time analyses for the supported scheduling algorithms.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH1080

AccDRC: FPGA Acceleration for VLSI DRC with Cell-Aware Layout Partitioning

Ruping Zhou; Zexu Zhang; You Hu; Jiahao Wang; Haodong Lu; Genggeng Liu; Jianli Chen; Kun Wang

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

Design Rule Checking (DRC) is a critical yet computation-intensive stage in modern very large scale integration design. As processes evolve, the high computational cost of DRC has become a significant bottleneck for design efficiency. Current acceleration approaches do not fully leverage the parallelism of DRC, resulting in limited performance gains. To address this challenge, we propose AccDRC, an FPGA-accelerated DRC based on software–hardware co-design. On the software, we design a cell-aware partitioning strategy with a data preparation and task encapsulation mechanism, which reorganize layouts into balanced task units tailored for FPGA processing. On the hardware, we implements an acceleration architecture consisting of a locality-preserving data-loading module, a unified and reconfigurable check core, and a sparse result writeback module. This architecture exploits DRC's locality, structural commonality across rules, and sparse violation outcomes, enabling high-throughput dataflow execution with multi-level parallelism. Experimental results show that AccDRC achieves 522.11x ~1071.03x speedup over the CPU-based DRC tool KLayout, and 9.62x ~ 26.07 x speedup over the state-of-the-art GPU-based DRC tool OpenDRC.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1081

KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning

Zebin Yang; Tong Xie; Baotong Lu; Shaoshan Liu; Bo Yu; Meng Li

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Scaling Intelligence for the Physical World +1
Abstract

We propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. To reduce cache invalidation caused by memory updates, we propose Static-Dynamic Memory Construction, which groups memory by update frequency and manages memory at different granularities. To reduce the impact of cross-attention ignorance between different groups, we propose Multi-hop Memory Re-computation, which dynamically identifies and recovers critical memory interactions through iterative memory importance propagation. We also propose Layer-balanced Memory Loading, a scheduling strategy that eliminates unbalanced KV loading and computation overhead between different layers caused by KV recomputation.

SystemsSYS1. Autonomous Systems (Automotive, Robotics, Drones)AIQuantum
RESEARCH1085

DM-MARK: Software and Hardware Co-Design of Watermarking Accelerator for Authorized Diffusion Model Usage on Edge Devices

Shiyu Guo; Jie Gu

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

With the rapid growths of generative models like diffusion models (DMs), distinguishing authentic images from forfeited ones has become increasingly challenging, raising privacy, security, and ethical concerns. Watermarking offers an effective solution for authentication and traceability of generated images. Unlike traditional methods, emerging watermarking techniques for DMs enable marking AI-generated content with resilience to commonly used watermark weakening or erasing techniques. However, these methods demand high computational resource and latency, posing challenges for practical use, especially on edge devices.This work presents DM-MARK, a software-hardware co-optimized diffusion framework supporting efficient watermark generation and reverse detection. DM-MARK is implemented in 12nm FinFET technology, achieving robust watermarking with improved quality and reduced overhead. Evaluations show 11.33% higher detection accuracy, 18× latency speedup over GPU, 3.1× on-chip memory savings, an average of 4.56× EMA reduction. It also achieves 212.5× speedup over the baseline ASIC design with negligible accuracy loss. The proposed DM-MARK scheme offers a scalable and practical solution for protecting AI-generated content in real-time on edge devices.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH1095

STEM-CAM: Segmented Thermometer-Encoded Multibit CAM with Wildcard Matching for Efficient Arbitrary-Order Minkowski Distance Metrics

Zeyu Zhang; Weikai Xu; Changyue Hao; Qianqian Huang; Ru Huang

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

Minkowski distance metrics (Lₚ) are widely used in AI. For hardware, digital designs incur limited parallelism and high energy/latency. Alternatively, content addressable memory (CAM) supports large-scale parallel distance metrics. However, existing CAM-based designs mainly support L₀/L₁, with limited scalability/configurability. In this work, we propose a segmented thermometer-encoded multibit CAM (STEM-CAM) architecture, enabling arbitrary-order Lₚ metrics. Segmenting distance norms with wildcard-augmented thermometer encoding, STEM-CAM enables a structured, scalable encoding for Lₚ. Moreover, an ultra-compact FeFET-based encoding-aware multibit CAM cell is proposed, providing efficient, high-precision implementation. Experiments demonstrate high throughput and area/energy efficiency for Lₚ, showing great potential for distance-based AI.

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH1096

Is 2D Gaussian Enough? A Sort-Free 3D Gaussian Splatting Inference Architecture with Geometry Guidance

Yuanfang Wang; Yu Li; Runzhou Zhang; Jun Yu; Kun Wang

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

3D Gaussian Splatting (3DGS) has emerged as the state-of-the-art (SOTA) technique for novel view synthesis. This paper introduces an algorithm that employs a two-stage pipeline. First, a 2D Gaussian (2DGS) model is rendered via the nearest Gaussian's color. Subsequently, this depth map is utilized for the second stage: a sort-free, weighted rendering of the complete 3DGS model. We further propose CODA, a unified hardware accelerator co-optimized to execute this 2D-3D hybrid pipeline. Experimental results demonstrate that CODA exhibits an 8.4× to 11.6× speedup over the NVIDIA RTX 3060M and a 2.7× to 3.9× speedup over the NVIDIA RTX 4090.

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH1101

HYPER-CIM: Hierarchical Predictive Exploration and Realizable Design Flow for High-Efficient Digital CIM

An Guo; Yuhui Shi; Yuhuai Zhang; Tianhui Jiao; Xiaoxue Zhong; Defa Wu; Arindam Basu; Xin Si; Jun Yang

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

Computing-in-Memory (CIM) is a promising solution to the memory wall, yet most prior studies optimize only one level—macro, accelerator, or architecture—rather than the full stack. This work presents HYPER-CIM, a hierarchical predictive exploration and realizable design flow that integrates all three levels. We build a scalable, fully digital CIM template in 28 nm with tunable accumulation length, local storage length, precision, parallel channels, and pipeline depth. More than 8k silicon-consistent design points were used to train a multi-head hierarchical circuit-optimization (MHCO) surrogate model, which predicts power, performance, and area (PPA) across 297M configurations. The resulting CIM "white-box" model offers circuit-faithful visibility for architecture-level design-space exploration (DSE) and Pareto search. Guided by this flow, we fabricated and silicon-verified four processing-element (PE)-flow CIM macros and one cross-level-flow CIM (CF-CIM) macro in 28 nm CMOS technology. Based on chip test results, the best-performing macro achieves 90.8 TOPS/W and 1.23 TOPS/mm², yielding a figure-of-merit (FoM) improvement of 31.12×–3.8×10⁶× over prior CIM designs. Under identical specifications, the CF-CIM improves energy efficiency from 52.13 TOPS/W to 67.9 TOPS/W and area efficiency from 0.41 TOPS/mm² to 0.51 TOPS/mm².

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1110

LLM-Based Automatic Architecture Optimization for AI Model HLS Implementation on FPGA

Zhe Xiao; Mingyu Liu; Xu He; Haoying Wu; Jie Zhao; Yao Wang; Libo Huang

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

High-Level Synthesis (HLS) plays a pivotal role in AI accelerator design, but significant challenges remain in achieving full automation, particularly in areas like module partitioning, dataflow orchestration, and throughput maximization. Current solutions, such as compiler-based optimizations and large language models (LLMs), face limitations in adaptability, system-level optimization, and handling complex computational models. We propose a novel LLM-powered framework that automates the generation and optimization of system-level HLS architectures. The framework generates modular dataflow architectures, refines them through pattern-based optimizations, and produces synthesizable hardware implementations, demonstrating an average performance improvement of 11.78× over current SOTA approaches.

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1113

Achieving Intra- and Inter-Operator Communication Lower Bounds in Near-Memory Processing

Lei Xu; Chen Yin; Zelong Yuan; Weiguang Sheng; Jianfei Jiang; Qin Wang; Naifeng Jing

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

Near-memory processing (NMP) is a promising way to overcome the memory wall in large language models (LLMs). However, dataflow optimization in NMP is fundamentally constrained, as existing analyses cannot efficiently handle the new distributed vault/channel organization. We propose the Remote-Access-Free (RAF) dataflow, which uses tensor rotation to abstract this distributed organization and eliminate all intra-operator remote access. On top of RAF, we apply analytical optimization to minimize intra-operator local access, thereby achieving the intra-operator communication lower bound. We then introduce data partitioning that removes inter-operator remote access and enable operator fusion to minimize inter-operator local access, so that the inter-operator communication lower bound is also reached. Experimental results show that RAF reduces energy by 50.4%, 39.0%, and 37.8%, and delivers speedups of 3.98×, 1.72×, and 1.57× over IANUS, H^2LLM, and OptiPIM, respectively.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH1138

ARTEMIS: Adaptive RL Test-Time Ensemble for Model Inference in SSD Testing

Kyunghwan Son; Hyukil Kwon; Nayeon Kim; Sohyun Kim; Hyelyun Kim; Sunghee Lee; Jinje Park; Hyungdal Kwon; Yoon Hyeok Lee

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

Solid-State Drives (SSDs) now power data centers, high-performance computing, and artifical intelligence workloads, but the increasing complexity of modern SSD controllers has surpassed the capabilities of traditional rule-based verification, necessitating scalable, data-driven testing methods. In this paper, we present ARTEMIS, a reinforcement‑learning (RL)‑based framework for automatic SSD test case (TC) generation that operates directly on commercial devices. Our contributions are threefold. First, we formulate a novel RL problem, enabling seamless deployment across heterogeneous commercial products without requiring any device-specific customization. Second, to cope with the highly dynamic and non‑stationary operating environments of SSDs, we introduce an ensemble‑based inference mechanism that aggregates policies learned under diverse workload distributions, thereby improving generalization and robustness. Third, we validate the approach on two representative stress‑testing tasks: Meta and User Garbage Collection. We show that the RL‑driven tester discovers high-impact TCs that trigger critical blocking garbage collection events more frequently than conventional methods in diverse tasks.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1139

From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU

Jie Zhang; Jiapeng Guan; Hao Zhou; Xiaomeng Han; Tinglue Wang; Ran Wei; Zhe Jiang

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

BFP is emerging as an attractive data format for edge NPUs, combining wide dynamic range with high hardware efficiency. However, its behavior under hardware faults and its suitability for safety-critical deployments remain largely underexplored. Here, we present the first in-depth empirical reliability study of BFP-based NPUs. Using RTL-level fault injection on NPUs, our bit- and path-level analysis reveals pronounced heterogeneous vulnerabilities and shows that the conventional end-to-end check becomes largely ineffective under nonlinear block scaling. Guided by these insights, we design a fault-tolerant BFP-based NPU microarchitecture that aligns the BFP computational semantics with reliability constraints. The design uses a row/column-wise blocking strategy to decouple the fixed-point mantissa computations from the scalar exponent path, and introduces ultra-lightweight protection mechanisms for each. Experimental results demonstrate that our design achieves near–dual modular redundancy reliability with only 3.55% geometric mean performance overhead and less than 2% hardware cost.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH1152

Noisexplore: A Device-Aware Hierarchical Noise Modeling and DTCO Framework for Analog In-Pixel Computing Architectures

Jinghan Xu; Zheng Zhou; Yi Xiao; Nan Tang; Shuhan Wang; Ligong Zhang; Xiaoyan Liu

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

Analog in-Pixel computing (IPC) enables energy-efficient edge AI but faces critical noise challenges limiting computing accuracy. Existing design flows lack comprehensive device-aware noise modeling and cross-layer optimization. We introduce NoiseXplore, the first end-to-end design-technology co-optimization framework integrating: (1) physics-grounded noise models (thermal, shot, RTN, device mismatch); (2) device-aware SNR estimation and neural network co-simulation; (3) circuit-level performance modeling integrated into NeuroSim; and (4) automated Bayesian DTCO for multidimensional design space exploration. Demonstrated on noise-sensitive 1T FDSOI-based IPC, optimized configurations achieve 50 fps with 86.8% CIFAR-10 classification accuracy on a three-class subset (airplane/car/ship).

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1156

FUSE-TCAD: A Diffusion-Based Multi-Field Surrogate for High-Fidelity TCAD Device Simulation

Xulin Zhang; Yupeng Hu; Zhuoran Song; Shucheng Huang; Qi Sun

Date:Monday, July 27 Location:Mtg Room 202AB Session:Generative AI and Graph-Based Learning for Next-Gen Circuit & Device Simulation +1
Abstract

Technology Computer-Aided Design (TCAD) solves coupled partial differential equations (PDEs) to obtain spatially resolved physical fields that are essential for comprehending device behavior. However, the computationally intensive numerical solution procedures in traditional TCAD make the iterative simulation-optimization loop prohibitively costly. Existing machine learning surrogates typically regress scalar metrics directly from device conditions, consequently discarding the field-level physical information crucial for further analysis. To address this challenge, we propose FUSE-TCAD, a diffusion-based surrogate that generates the joint distribution of physical fields while supporting continuous control over geometry and bias through conditional injection mechanisms. By leveraging the probabilistic generative formulation of diffusion models, FUSE-TCAD captures the underlying statistics of coupled physical fields and preserves cross-field consistency, thereby achieving highly accurate field predictions across diverse device conditions. A physics-aware Sobolev-edge regularization strategy enforces gradient consistency to ensure high fidelity in junction regions during sampling. On SOI-FET datasets, FUSE-TCAD demonstrates robust transferability and produces high-fidelity fields using only 20% target-domain data through efficient transfer learning. Extensive experiments demonstrate that FUSE-TCAD achieves more than 30× speedup over commercial TCAD tools and maintains relative error within 0.8%, thereby supporting scalable, near-real-time device design exploration.

EDAEDA6. Analog CAD, Simulation, Verification and TestChipletSYS4. Embedded System Design Tools and Methodologies
RESEARCH1180

ESCAPE: Elegant Scan Chain Activity Probability Establishment for Programmable Low-Power LBIST

Hairui Cai; Yumei Hu; Xiaohui Xue; Yaning Wang; Yu Huang; Zhouxing Su; Zhipeng Lv; Zezhong Wang; Xing Wang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Leveraging AI for Silicon Health: Test, Yield & Fault Tolerance +1
Abstract

This paper presents an Elegant Scan Chain Activity Probability Establishment (ESCAPE) architecture for programmable low power LBIST. In the designed low-power control circuit, a programmable probability generator periodically outputs a user-configurable sequence of probabilities to drive hold registers. A 2D AND-gate array formed by two groups of hold registers then produces a rich set of low-power control signals. By modulating the enable probability of the lockups situated before the phase shifter, ESCAPE achieves precise per-chain power management. Experimental results on industrial-scale designs demonstrate that, the proposed method achieves high coverage, while reducing peak power consumption and incurring lower hardware overhead.

EDAEDA9. Test, Validation and Silicon Lifecycle ManagementAIQuantum
RESEARCH1182

Fusedot: A Multiplication-Fused Dot Product Accelerator for Efficient LLM Inference

Wenju Zhao; Jianhui Yue; Pengcheng Yao; Yujia Cui; Qinggang Wang; Yufei Sun; Jiaqi Zhai; Hai Jin; Xiaofei Liao

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

Weight quantization enables efficient LLM inference but requires mixed-precision computation (e.g., FP16xINT4), which is not efficient in general-purpose processors. Existing accelerators integrate custom mixed-precision arithmetic units but follow the dot-product paradigm, which suffers from numerous multiplications. This work exploits multiplication fusion in low-bit quantization and introduces MFDP, a novel dot-product paradigm that fuses identical-weight multiplications into a single operation. Based on MFDP, we propose FuseDot, an accelerator that optimizes the fusion process by incorporating index-driven generalized matrix multiplication and a hierarchical dual-phase reordering algorithm. Additionally, a cross-PE multiplier-sharing architecture with a half-multiplier scheme further amortizes multiplier costs across multiple PEs. This work achieves 1.51–1.98x speedup and 1.19–1.51x higher energy efficiency over state-of-the-art accelerators.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH1183

Dynamic-Cost Area Recovery for Fracturable LUT-Based FPGAs

Xianfeng Cao; Jiangnan Li; Lingli Wang

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

Fracturable LUTs (FLUTs) creates variable logic consumption for LUT implementations since two LUTs can be merged to one FLUT under certain constraints. Traditional technology mapping algorithms fail to exploit this feature due to their static-cost area models. To bridge this gap, we introduce merging probability, a quantitative, mapping-stage metric that predicts the likelihood of LUT merging during the subsequent packing phase. Based on this, we present a dynamic-cost LUT area model, enabling area recovery better suited for FLUTs. Experimental results on EPFL benchmarks demonstrate that our method reduces the usage of FLUTs by at most of 10.3% on mainstream commercial FPGAs, compared to the state-of-the-art technology mapping algorithm, without any performance degradation.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH1185

CPA-BNN: A Secure and Efficient CIM and PUF Architecture for BNN Accelerator

Li Ni; Jinwei Pu; Xijun Huang; Wenhong Ma; Zhenyu Wang; Lei Liao; Wanli Chang; You Meng

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

Binary neural networks (BNNs) have emerged as promising models for lightweight intelligent inference by binarizing inputs and weights. Compute in memory (CIM) architectures, which reduce data movements through in-situ operations, have become a strong candidate for BNN accelerators. However, prior works often overlook the security of the model, leaving BNN weights exposed in memory cells and vulnerable to threats such as cloning, tampering, and reverse engineering. This work proposes CPA-BNN, a secure BNN CIM architecture based on resistive random access memory (RRAM). We propose a 4T2R RRAM cell design which implements in-situ encryption and ciphertext convolution operations to protect the weights. In addition, an in-memory batch normalization (BN) scheme is proposed to optimize area overhead and improve compute density. Besides, we propose an intrinsic physical unclonable function (PUF) entropy extraction method and a current tilt-based masking stratege, enabling reliable key extraction within a unified array. The results show that CPA-BNN achieves ~100% key reliability and effectively prevents model attacks. Compared to state-of-the-art SRAM/NVM BNN schemes, CPA-BNN achieves >1.4× compute density and >1.6× storage density improvement.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH1186

Xsearch: Exploring Microarchitecture Design Space Through Bayesian Optimization and Cross Evaluation on Multi-Fidelity Simulators

Honghua Zhu; Chunjie Luo; Xinke Zhao; Jianfeng Zhan

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

Design space exploration for microarchitecture parameters is a critical aspect of processor design, requiring a balance between evaluation cost and accuracy. New designs can be evaluated using either coarse-grained simulation (high efficiency, low accuracy) or fine-grained simulation (low efficiency, high accuracy). To address this efficiency-accuracy tradeoff, we propose XSearch, which employs multi-fidelity Bayesian optimization by cross-evaluating across simulators with varying fidelity levels. This approach leverages the efficiency of low-fidelity simulations to explore more design points while ensuring accuracy through calibration using high-fidelity simulation data. XSearch has been successfully applied to explore the design space of the large-scale open-source high-performance processor core XiangShan.

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH1187

FBS: Accelerating CNN Inference over RNS-CKKS with Fewer Bootstrapping Sparsity

Guiming Shi; Yuchen Wei; Zhanhong Tan; Muyang Li; Dapeng Cao; Jingwei Cai; Shuwen Deng; Kaisheng Ma

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

RNS-CKKS is a fully homomorphic encryption scheme supporting fixed-point arithmetic, widely used in privacy-preserving convolutional neural network (CNN) inference. However, its significant computational overhead, especially from bootstrapping—the most costly operation—raises deployment costs for CNN inference over RNS-CKKS. While sparsity has proven effective in reducing computational overhead for unencrypted CNN inference, its application to large datasets (e.g., ImageNet) with RNS-CKKS-based CNN inference remains under-explored, particularly in optimizing bootstrapping operations that dominate computation time. In this work, we observe that sparsity in CNN can be exploited to reduce the bootstrapping overhead in RNS-CKKS-based CNN inference. Based on this observation, we propose FBS, a framework that accelerates CNN inference over RNS-CKKS by leveraging Fewer Bootstrapping Sparsity to reduce bootstrapping costs. We propose two sparsity patterns: eliminate missing input sparsity pattern and channel sparsity pattern, to reduce the number of bootstrapping calls during CNN inference. An iterative latency optimization framework is then presented to identify the key layers for pruning and determine the sparsity patterns to achieve effective performance. Results show that FBS can accelerate CNN inference over RNS-CKKS by up to 1.91 times with negligible accuracy loss. FBS will be open-sourced.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1194

DIA-CIM: A Dynamic Iterative Aggregation Graph Neural Network and Its Hardware Co-Design for Energy-Efficient Graph Processing

Zhaoyang Zhang; Tianhao Zhao; Defa Wu; yingxuan zhou; Xiaodong Li; Yan Yan; Yang Jun; Xin Si

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

Dynamic-iterative aggregation (DIA) in graph neural networks updates node states sequentially and asynchronously. This expands the receptive field without adding layers and improves accuracy-per-operation versus layer-synchronous models. However, DIA introduces fine-grained serial dependencies and highly irregular sparse traffic that undermine conventional SIMD/accelerator designs. We present DIA-CIM, a 28-nm compute-in-memory (CIM) macro co-designed for DIA. DIA-CIM employs a CSR-driven, output-stationary dataflow to keep partial sums local while streaming edges, a sparsity-priority BF16 pipeline that exploits bit/value sparsity. Fabricated in 28-nm CMOS, DIA-CIM reaches 60.01 TFLOPS/W. On representative DIA workloads, it delivers >3.56 × lower energy and >2.32 × lower latency.

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1196

A Levelized Load-Balanced and Structure-Adaptive GPU LU Factorization Method for Circuit Simulation

Yunfan Zuo; Renjie Xia; Yongtian Ren; Jiajie Xu; Chenpu Shi; zhaohan wang; Hao Yan; Lixin Ge; Longxing Shi

Date:Monday, July 27 Location:Mtg Room 202AB Session:Generative AI and Graph-Based Learning for Next-Gen Circuit & Device Simulation +1
Abstract

As integrated circuit (IC) designs grow increasingly complex and transistor counts per chip exceed ten billion, post-layout SPICE simulations involve large-scale sparse linear systems, severely degrading simulation efficiency Current GPU acceleration methods, despite their promise, struggle with efficient load balancing and resource utilization, which restricts their effectiveness in ultra-large-scale circuit simulations. In this paper, we propose a levelized load-balanced and structure-adaptive LU factorization framework for GPU-based circuit simulation. Our method improves resource utilization and parallel efficiency by introducing computation-balanced dependency level partitioning, adaptive resource allocation, and a hybrid matrix indexing mechanism. These strategies ensure that both the memory and computational resources of the GPU are fully leveraged. We demonstrate significant acceleration over existing methods, achieving a 2.1X speedup compared to GLU3.0 and a 5.2X speedup over 16-thread PARDISO on circuit sparse matrices ranging from thousands to millions of dimensions. Additionally, our framework has been successfully integrated into the open-source SPICE simulator Ngspice, accelerating circuit simulations with promising results and showcasing its potential for large-scale IC design verification.

EDAEDA6. Analog CAD, Simulation, Verification and TestChipletSYS4. Embedded System Design Tools and Methodologies
RESEARCH1200

NIAQ: Adaptive Non-Ideality-Aware Qubit Readout for Long-Term Accuracy

Yuyang Du; Yi Sheng Chong; Wang Ling Goh; Benjamin Lienhard; Anh Tuan Do

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

High-accuracy qubit-state readout is critical for quantum computation. While existing hardware-based qubit-state discriminators assume stable readout statistics across measurement shots, realistic long-running tasks like chemistry simulation suffer from slow noise fluctuations and drift that severely degrade fidelity over time. We present NIAQ, an adaptive software–hardware co-design combining Kalman filtering for drift tracking with MLP discriminators for robustness against noise, together with degradation rate (%/hour) as a novel metric to quantify long-term stability. NIAQ achieves 88% qubit-state-readout accuracy in multi-qubit dataset that is equivalent to 1-day runtime, compared to the 61% baseline qubit-state discriminator, and yields 19× lower degradation rate. Optimized for low-power real-time operation on ASIC, post-synthesis results in 28-nm CMOS show 148mW power and 5ns delay, achieving 7.7× power and 6.4× latency improvements compared to the state-of-the-art.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH1206

DSPE: An Energy-Efficient Edge Processor for Deepseek Inference with Merkletree-Based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing

Yuhan Zhang; Zhou Wang; Zhou Shu; Jiuren Zhou; Yanqing Xu; Xiaonan Tang; shushan qiao; Tianchun Ye; Yang Liu; Anil Bharath; Emm Drakakis

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

In recent years, DeepSeek has achieved strong inference performance but remains hard to deploy on energy-constrained edge devices. This paper presents the DeepSeek Processing Element (DSPE), an edge-oriented architecture that alleviates the model's heavy computational and energy demands. DSPE introduces three techniques: the MerkleTree-based Incremental Pruning Scheme (MIPS) for secure redundant-vector reduction, the Multi-Stage Boothing Lookup Method (MBLM) for bit-flip–aware approximate multiplication, and the Dynamic Adaptive Posit Processing Mechanism (DAPPM), which introduces a new DA-Posit format and its corresponding hardware multiplication architecture. Implemented in TSMC 28nm CMOS, DSPE achieves 91.7 TFLOPS/W energy efficiency compared with state-of-the-art designs and offers a scalable foundation for edge deployment.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH1211

DCTS: Differentiable Clock Tree Synthesis Based on Probabilistic Graphical Model

han zhang; Junming Jiao; Zhi Li; Qixun Tian; Peng Cao

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

Clock Tree Synthesis (CTS) constitutes a complex, discrete, and combinatorial multi-objective optimization (MOO) problem, which is typically fragmented into sequential steps, including clustering, topology generation, and buffering in traditional flows, leading to suboptimal results due to local optima. Despite significant potential in MOO, differentiable methods are inherently limited to represent dynamic topological adjustments during CTS. To solve this, We propose an end-to-end differentiable CTS framework, DCTS, based on Probabilistic Graphical Model (PGM) to re-parameterize the discrete topological search into a continuous gradient-based problem, enabling co-optimization of clock tree topology and buffer sizing within a global design space. The proposed DCTS was evaluated on ISCAS'89 and OpenCores benchmark circuits under the ASAP7 technology node. Experimental results show that it achieves competitive power, performance, and area (PPA) metrics against a leading commercial tool, along with a 2.78$\times$ speedup on large-scale circuits. Furthermore, when compared to state-of-the-art academic solutions, DCTS guarantees minimum improvements of 20\% in delay, 17\% in skew, 1\% in power, and 17\% in area, while also achieving a minimum speedup of 1.98$\times$ on large-scale designs.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH1222

Attentioncap: Transformer Based Capacitance Matrix Learning Toward Full-Chip Extraction

Jiechen Huang; Hector Rodriguez; Dingcheng Yang; Zuochang Ye; Yibo Lin; Wenjian Yu

Date:Tuesday, July 28 Location:Mtg Room 101B Session:AI Graces the Cell Party: Transformers, LLMs and Zero Drama Extraction +1
Abstract

As capacitance extraction accuracy of rule-based pattern matching becomes difficult to sustain at advanced nodes, a growing trend emerges to develop deep-learning-based 2D capacitance models. However, existing MLP- and CNN-based methods constrain their input to fixed metal-layer combinations in a specific process node, limiting their usability in practice. Recognizing the inherent similarity between capacitance matrix and the prevailing attention mechanism, we propose AttentionCap, a customized Transformer for capacitance matrix learning, with a Gram representation framework, a physics-aligned symmetric-attention output layer, and a novel normalized Laplacian loss. We also introduce a process-node embedding to enable multi-node learning. Trained on synthetic data, AttentionCap attains 0.67%/3.99% self/coupling-capacitance error on unseen real designs under a multi-layer and multi-node setting, surpassing the CNN-Cap baseline with 4.6x/5.7x lower self/coupling error and 192x faster inference speed. A pretrained AttentionCap accurately transfers to an unseen node with only 5K samples and 4K finetuning steps. With sufficient accuracy on unseen real designs and strong transferability to new process nodes, AttentionCap offers highly practical value for modern EDA workflows. Code and data are available at https://anonymous.4open.science/r/AttentionCap-release-F698.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH1224

Capbench: A Multi-PDK Dataset for Machine-Learning-Based Post-Layout Capacitance Extraction

Hector Rodriguez; Jiechen Huang; Wenjian Yu

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

We present CapBench, a fully reproducible, multi-PDK dataset for capacitance extraction. The dataset is derived from open-source designs, including single-core CPUs, Systems-on-Chip, and media accelerators. All designs are fully placed and routed using 14 independent OpenROAD flow runs spanning three technology nodes: ASAP7, NanGate45, and Sky130HD. From these layouts, we extract 61,855 3D windows across three size tiers to enable transfer learning and scalability studies. High-fidelity capacitance labels are generated using RWCap, a state-of-the-art random-walk solver, and validated against the industry-standard Raphael, achieving a mean absolute error of 0.64% for total capacitance. Each window is pre-processed into density maps, graph representations, and point clouds. We evaluate 10 machine learning architectures that illustrate dataset usage and serve as baselines, including convolutional neural networks (CNNs), point cloud transformers, and graph neural networks (GNNs). CNNs demonstrate the lowest errors (1.75%), while GNNs are up to 41.4 times faster but exhibit the larger errors (10.2%), illustrating a clear accuracy–speed trade-off.

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH1233

RM-CIMA: An Analog Computing-in-Memory Accelerator for a Robust UAV Trajectory Tracking Framework

Pingdan Xiao; Yiliu Gu; Bingqian Zhang; Sichun Du; Wanli Chang; Hong Qinghui

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Scaling Intelligence for the Physical World +1
Abstract

UAV trajectory tracking demands robustness and real-time performance, but traditional architectures suffer high latency, failing to counter sudden disturbances. We propose an algorithm-hardware co-design framework to address this. We integrate reinforcement learning (RL) with model predictive control (MPC) for real-time online learning. Hardware-wise, we designed a dedicated analog compute-in-memory (ACIM) accelerator, RM-CIMA, mapping both RL learning and MPC optimization to the analog domain. RM-CIMA enables UAVs to counter fast disturbances, improving recovery times from strong wind by 7.9× over traditional NMPC. Furthermore, it reduces computational latency and energy consumption by 2995.6× and 983.1×, respectively, compared to traditional architectures.

SystemsSYS1. Autonomous Systems (Automotive, Robotics, Drones)AIQuantum
RESEARCH1242

BSGCN: Taming Diverse Sparsities in GCNs via Adaptive Band-Segmentation and Tailored Caching

Shangtong Zhang; Xueyan Wang; Yuan Xu; Yier Jin; Weisheng Zhao

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

Abstract Graph Convolutional Networks (GCNs), a foundational technology for relational data applications, are often bottlenecked by irregular memory accesses caused by sparse graph structures. Existing accelerators face two fundamental limitations: their rigid dataflows cannot adapt to varying graph sparsities, and their narrow focus on high-degree nodes (HDNs) leads to systematic neglect of memory traffic from the vast number of low-degree nodes (LDNs). To address these issues, this paper introduces BSGCN, an accelerator built on a novel band-segmentation strategy. This approach partitions the graph such that the majority of LDNs are grouped into large bands with similar data reuse characteristics, while the remaining nodes form small bands containing only a few LDNs. This segregation enables two co-designed innovations: 1) a Band-Segmented Dataflow (BSD), which applies a tailored traversal strategy to each band type to handle diverse sparsity patterns, and 2) a Band-Segmented Caching (BSC) hierarchy, combining a specialized caching policy to exploit the reuse potential of LDNs with a dedicated pinning mechanism for HDNs. Evaluation results show that BSGCN achieves average speedups of 1.41× (up to 4.80×) over state-of-the-art accelerators, while reducing DRAM traffic by 24.24% and improving energy efficiency by 1.30×.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH1256

DAP: A Software/hardware Co-Design for Accelerating Multi-Chip-Based Distributed Sparse-Dense Matrix Multiplication

Xiuhua Yang; Hao Jia; Chen Ding; Haoming Chu; Yufan He; Lirong Zheng; Ning Ma; Yuxiang Huan

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

Sparse-dense matrix multiplication (SpMM) is a core component in many critical applications such as deep learning and scientific computing. Existing SpMM accelerators employ the COO format to compress sparse matrices and rely on high-bandwidth off-chip memory for performance gains. However, four major challenges remain unaddressed: 1) Fixed 2-D partitioning strategies lead to workload imbalance among processing nodes due to irregular distribution of non-zero elements in sparse matrices. 2) Redundant metadata from the COO format results in a communication bottleneck that hinders the scalability of existing SpMM accelerators. 3) To accommodate irregular memory access patterns, the use of multiple data replicas significantly increases the pressure on on-chip storage resources. 4) Handling RAW hazards from floating-point adders in software incurs substantial pre-processing overhead. To address these challenges, we propose DAP, the first 2-D multi-chip-based architecture composed of dedicated accelerators, and SPU, a novel communication-friendly SpMM accelerator. To mitigate load imbalance caused by irregular non-zero distribution, we design a two-level matrix partitioning framework that effectively balances workloads across nodes in a 2-D computing array, achieving a performance improvement of 1.43x. Furthermore, the SPU adopts CSC over COO format to reduce communication traffic. SPU also minimizes redundant on-chip storage from data replication, with only a 0.067x performance penalty due to memory access conflicts. By employing a reservation buffer, the SPU resolves RAW dependencies without time-consuming pre-processing. Our simulation-based evaluation demonstrates DAP achieves geometric mean throughputs of 2.69x relative to NVIDIA A100 GPU at 500MHz frequency.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH1267

MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

Enrico Russo; Mohamed Hamdi; Alessandro Ottaviano; Francesco Conti; Angelo Garofalo; Daniele Jahier Pagliari; Maurizio Palesi; Luca Benini; Alessio Burrello

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH1268

Bident: Optical Logic Synthesis via Optimal Configuration of Binary-Combiner Trees

Jun-Wei Liang; Iris Hui-Ru Jiang; Kai-Hsiang Chiu

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

Despite decades of success, CMOS technology is increasingly constrained by power dissipation and propagation delay, thus motivating the exploration of photonic integrated circuits as an alternative for high-speed, energy-efficient computing. However, existing optical logic synthesis frameworks rely on an oversimplified efficiency factor model that fails to accurately capture physical attenuation and the hierarchical nature of signal degradation. This work presents Bident, a comprehensive optical logic synthesis framework that provides a physically accurate loss model and achieves optimal combiner configuration. We introduce the binary aggregate model, which explicitly reflects the binary-tree topology and input ordering in multi-input combiners. Based on this model, the Huffman tree optimization guarantees optimal Y-branch combiner configuration without requiring additional hardware resources. Meanwhile, a greedy algorithm optimally employs directional couplers at the leaf layer of a binarycombiner tree to achieve harmonic-mean efficiency at minimal hardware cost. Experimental results demonstrate that Bident achieves superior signal-attenuation reduction and lower switch cost than state-of-the-art methods, using both the conventional efficiency factor model and our binary aggregate model. A schematic-level optical circuit simulation further validates the feasibility and effectiveness of our binary-combiner configurations.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH1275

TRAC: Transparent Row Activation Counting for Efficient Rowhammer Monitoring

Sunggi Ahn; Saeid Gorgin; Jungrae Kim

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

RowHammer is one of the most critical security threats in modern DRAM. The industry has introduced Per-Row Activation Counting (PRAC) to monitor activations, but this approach degrades performance by lengthening critical DRAM timing parameters. We propose TRAC (Transparent Row Activation Counting), a lightweight monitoring architecture that eliminates these drawbacks. TRAC leverages ACT-triggered update, robust in-subarray processing, and a Linear Feedback Shift Register (LFSR)-based counter to track activations transparently, without introducing timing overhead. Circuit-level simulations and system-level evaluations demonstrate TRAC's correctness, stability, and negligible area cost, establishing it as a practical and scalable defense against RowHammer.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH1282

CPW-SMO: Generating Mask Sets and Common Source for Maximum Common Process Window

Hongquan He; Ziyang Yu; Jiaqi Liu; Bei Yu; Jingyi Yu; Hao Geng

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Differentiable Dreams: GPU-Powered Litho Gets Its Gradient Groove Back +1
Abstract

Source Mask Optimization (SMO), as a key enabler in Design-Technology Co-Optimization (DTCO), plays a vital role in enlarging the process window (PW) for advanced technology nodes and PDK development. Conventional SMO relies heavily on iterative optimization using lithography models to obtain a single source-mask pair for improved PW. In full-chip applications, however, a set of critical patterns are co-optimized under a common source, forming a one-source-to-many-masks optimization paradigm for maximum common process window (CPW). To overcome the limitations of existing methods, we introduce CPW-SMO, a novel single-shot generative flow that simultaneously generates an optimal source and a set of corresponding mask patterns using set-based attention mechanisms. Our approach formulates the source and mask patterns as a unified set and employs a highly parallelized generative simulator to enable efficient training. By transforming multi-objective, non-differentiable CPW into a single-objective CPW preference penalty and optimizing generators through an SMO-aware gradient response method, CPW-SMO achieves nearly 2x CPW compared to state-of-the-art methods, while delivering ~200x speedup in runtime. These improvements significantly boost the practicality and effectiveness of SMO for holistic lithography applications.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH1284

WFH-BFs: Dynamic Defense Against DNN Bit-Flips via a Weight Function Hierarchy

Jie Xiao; hao ying; Aizhu Liu; jiajun guo; Zhanhui Shi; Jungang Lou; Fuli Wu

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Robust and Trusted AI Systems: Attacks, Defenses, and Hardware Reliability +1
Abstract

Defending DNNs against bit-flip attacks (BFAs) typically incurs prohibitive computational overhead, a critical barrier for deployment in safety-critical systems. We challenge this "one-size-fits-all" defense paradigm by introducing the Weight Function Hierarchy (WFH), a framework that deconstructs DNNs into functionally distinct tiers: a compact Anchor Core for foundational representation, a Class-Sensitive layer for decision-making, and a Reservoir for redundancy. Based on WFH, our synergistic framework, WFH-BFs, couples offline preparation with online adaptation. Offline, we differentially harden the Anchor Core while embedding pruning tolerance elsewhere. Online, a lightweight integrity checker safeguards the core, while a dynamic pruning mechanism reconfigures the network to neutralize attacks. Experiments show WFH-BFs increases BFAs cost by 17.5x and maintains near-original accuracy (<1.4% drop) under high-rate random errors. Critically, our dynamic defense achieves this robust security while simultaneously reducing the overall computational load (GFLOPs), breaking the security-efficiency trade-off.

SecuritySEC1. AI/ML Security/PrivacyAIQuantum
RESEARCH1286

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Ruijia Yang; Zeyi Wen

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8× larger batch sizes and 6× larger models. In evaluations, SlideFormer achieves 1.40× to 6.27× higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining >95% peak performance on both NVIDIA and AMD GPUs.

AIAI5-I. AI/ML System and Platform Design
RESEARCH1287

From Fab to Mask Shop: A Differentiable MDP-ILT Framework for Co-Optimization of Wafer Pattern Fidelity and Mask Writability

Xu Jinchao; Xinyun Zhang; Kun Ren; Qi Sun; Cheng Zhuo

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Differentiable Dreams: GPU-Powered Litho Gets Its Gradient Groove Back +1
Abstract

Inverse lithography technology (ILT) generates curvilinear masks for optimal wafer patterns and process windows. In practice, fabs employ ILT based on optical lithography (OL) models to improve wafer pattern fidelity, while mask shops perform mask data preparation (MDP) to ensure mask writability. The MDP process applies geometry-level mask process correction (MPC) guided by electron-beam lithography (EBL) simulations, followed by shot fracturing. However, this disjointed workflow, both between fabs and mask shops, as well as within MDP between MPC and shot fracturing, often results in suboptimal wafer patterns and inefficient mask preparation. This paper presents a novel, unified, end-to-end differentiable framework for co-optimizing wafer pattern fidelity and mask manufacturability. We achieve differentiability by formulating a physics-aware EBL simulator using the inherently differentiable error function (erf), which precisely captures energy deposition at the VSB shot level, enabling exact gradient computation. With OL simulation embedded in the optimization loop, the framework enables refinement of shot parameters to achieve wafer-level optimality. The proposed method unifies the workflow from fab to mask shop, aligning wafer-side lithography objectives with mask-side writability constraints within an end-to-end optimization flow, thereby improving mask manufacturability while preserving wafer pattern fidelity.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH1290

Analytically-Derived Hybrid Net–pin Weighting for Timing-Driven Global Placement

Junjin Li; Qiang Yang; Hao Ai; Wenxing Zhu

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

Timing-driven global placement is critical in modern VLSI physical design for achieving timing closure. Among existing approaches, weighting is a mainstream technique, but traditional weights are often manually designed based on heuristics, which can lead to suboptimal timing performance. To address this, we present an analytically-derived model of interconnect and cell-delay contributions of pin pairs along critical paths, from which we obtain explicit and differentiable formulations approximating Total Negative Slack (TNS) and Worst Negative Slack (WNS). The formulations include three wirelength components: net wirelength, linear pin-to-pin wirelength, and quadratic pin-to-pin wirelength, where net and linear pin-to-pin wirelength are smoothed via the weighted-average (WA) model. Based on these formulations, we derive a hybrid net-and-pin weighting scheme and propose a novel timing-driven global placement framework that directly optimizes TNS and WNS. The weighting scheme features dynamic and cumulative updates, ensuring that consistently critical paths are prioritized throughout optimization. Experimental results on the ICCAD'15 benchmark demonstrate that our method achieves average improvements of 39% in TNS and 6% in WNS compared with state-of-the-art timing-driven placers, while maintaining competitive wirelength and runtime, validating the effectiveness of the analytically-derived weighting framework.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH1291

NASiC: 3D NAND-Based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference

Weikai Xu; Meng Li; Shuzhang Zhong; Tianyang Luo; Dongxue Zhao; LING LIANG; Zongwei Wang; Qianqian Huang; Yimao Cai; Ru Huang

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

The Mixture-of-Experts (MoE) models have emerged as the state-of-the-art paradigm for scaling up large language models (LLMs) without a proportional increase in computational cost. However, on-device deployment of MoE models still faces a critical challenge due to the large memory requirement for storing all expert parameters. In this work, we proposed NASiC, a 3D NAND-based CAM-selected multibit CIM architecture through algorithm-hardware co-optimization, tailored to the high-density storage and sparse computation requirements of MoE models. V-ASIC architecture achieves improved throughput, high area- and energy-efficiency, indicating its great potential for on-device MoE inference.

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1292

Impart: Integration of Memetic Operations into Multi-Level Framework for Large-k-Way Hypergraph Partitioning

Yugao Zhu; Zhicheng Guo; Shang Liu; Mengming Li; Jing Wang; Zhiyao Xie

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

The problem of k-way hypergraph partitioning is fundamental with significant applications in various fields, including VLSI design and scientific computing. State-of-the-art hypergraph partitioners commonly employ a multi-level framework encompassing coarsening, initial partitioning, uncoarsening, and refinement phases. However, many existing methods do not scale well to problems requiring a large number of partitions (i.e., large k). In pursuit of exceptionally high solution quality, existing memetic approaches often execute their two key operations, recombination and mutation, by invoking separate, standalone multi-level partitioners. This design choice, however, renders them significantly more time-consuming than standard multi-level partitioners. To make such memetic approaches more practical, we propose an advanced memetic framework, IMPart, which introduces novel recombination and mutation operators and integrates them directly into the uncoarsening phase of a single multi-level framework. This transforms the local searches of different granularities in the traditional multi-level framework into a sophisticated, collaborative search. Experimental results on multiple standard benchmarks demonstrate our framework more effectively escapes local optima and explores the global solution space for higher-quality solutions, substantially outperforming all existing hypergraph partitioners for large-k-way hypergraph partitioning. Our framework highlights a new paradigm for the development of advanced hypergraph partitioners.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH1298

FreqSEM: Frequency-Aware High-Precision Contour Extraction for Lithographic SEM Images

xiaoyan Luo; Yumeng Liu; Qian Jin; Qi Sun; Cheng Zhuo

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Leveraging AI for Silicon Health: Test, Yield & Fault Tolerance +1
Abstract

In semiconductor manufacturing, contour extraction from scanning electron microscope (SEM) images is essential for accurate lithographic metrology and process optimization. However, the low contrast, high noise, and complex structures of SEM images lead to blurred contours, making precise contour extraction extremely challenging. Existing methods struggle to capture detailed geometric features and fail to meet the high-precision requirements. In this paper, we propose FreqSEM, a frequency-aware contour extraction framework that achieves sub-nanometer-level precision, incorporating Fast Fourier Transform (FFT) for edge feature enhancement and Segment Anything Model 2 (SAM2) for edge localization. To exploit the characteristics of different frequency bands, we first apply FFT to extract the low- and mid-frequency components of the input image. Subsequently, the low-frequency component, which preserves the overall brightness distribution and large-scale structures, is used to generate SEM-adapted prompts for SAM2. Meanwhile, the mid-frequency component, which retains edges, textures, and other fine-scale details, is injected into the image encoder to enhance edge representation. In terms of the training strategy, to further enhance SAM2's focus on edges, we design an edge-aware loss function with a weight map emphasizing boundaries and a Sobel gradient loss. With limited training data, our method achieves an EPE mean of only 0.334 nm and a standard deviation of 0.661 nm at the minimum line width of 81 nm, significantly outperforming existing approaches.

EDAEDA9. Test, Validation and Silicon Lifecycle ManagementAIQuantum
RESEARCH1300

Compart: Community-Guided Post-Coarsening for High-Quality Hypergraph Partitioning

Yugao Zhu; Zhicheng Guo; Yuchao Wu; Mengming Li; Jing Wang; Zhiyao Xie

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

Hypergraph partitioning is a critical step in the design of complex embedded systems, essential for optimizing task mapping on heterogeneous MPSoCs and enabling multi-FPGA prototyping. Many existing methods rely on community detection to identify modules with dense internal and sparse external connections, typically utilizing them to constrain the coarsening phase—a widely adopted paradigm. In this work, we propose ComPart, a generalized framework that integrates diverse community detection methods to uncover high-quality clusterings throughout the post-coarsening stages (i.e., initial partitioning and uncoarsening). These discovered clusterings serve as distinct structural guides, enabling the refinement process to identify superior partitioning solutions. Our framework offers two key advantages: (1) it establishes a new paradigm that leverages community structures detected during uncoarsening to escape local optima and explore globally meaningful solution subspaces, transcending the limitations of standard local refinements; and (2) it flexibly accommodates both existing and future community detection methods. Furthermore, we theoretically generalize locally-dense decomposition—originally from graphs—to the hypergraph domain. We provide the formal extension and necessary proofs to apply this technique to hypergraphs, marking its first application in hypergraph partitioning. Specifically, we utilize this rigorously derived decomposition to guide the initial partitioning phase toward superior starting points. Experimental results on standard benchmarks demonstrate that our method consistently outperforms state-of-the-art methods in solution quality.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH1302

Active Imitation Learning for Thermal- and Kernel-Aware LFM Inference on 3D S-NUCA Many-Cores

Yixian Shen; Chaoyao Shen; Jan Deen; George Floros; Andy Pimentel; Anuj Pathania

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Taming the Physical World: AI Strategies for Circuits and Cores to Quantum Machines +1
Abstract

Large Foundation Model (LFM) inference is both memory- and compute-intensive, traditionally relying on GPUs. However, their limited availability and high cost have driven growing interest in high-performance general-purpose CPUs, particularly emerging 3D-stacked Static Non-Uniform Cache Architecture (3D S-NUCA) systems. While these architectures improve bandwidth and data locality, they introduce severe thermal constraints and non-uniform cache latencies caused by 3D Networks-on-Chip (NoC). Efficient management of thread migration and V/f scaling remains challenging due to diverse LFM kernels and hardware heterogeneity. We propose AILFM, an Active Imitation Learning (AIL)–based scheduling framework that learns near-optimal thermal-aware policies from Oracle demonstrations with minimal runtime overhead. AILFM captures both core-level performance variations and kernel-specific behavior, maintaining thermal safety while maximizing inference efficiency. Extensive experiments demonstrate that AILFM outperforms state-of-the-art baselines and generalizes across diverse LFM workloads.

AIAI3-II. AI/ML Application and InfrastructureQuantumDES3. Emerging Models of Computation +1 more
RESEARCH1307

MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization

Seeyeon Kim; Jaehun Lee; Sungyeob Yoo; Joo-Young Kim

Date:Monday, July 27 Location:Mtg Room 101B +1 Session:Beyond GPUs: Next-Generation Architectures for Modern AI Workloads +1
Abstract

Masked diffusion enables region-specific image synthesis but suffers from computational redundancy, since the entire image is processed each timestep even though only the masked region requires generation. To address this, we introduce MASQ, a hardware–software co-designed accelerator for masked diffusion. Our approach performs stage-wise MXINT8/4/2 precision assignment that dynamically reflects spatial and semantic importance, complemented by timestep-aware scheduling and optimized non-matrix operations. MASQ features a block-wise multi-precision compute engine and mask management unit, efficiently handling our approach. It achieves up to 16.06x and 5.39x speedup and 4.18x and 4.93x energy-efficiency gain over A100 and Orin NX, respectively, while preserving quality.

AIAI4-I. AI/ML Architecture DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH1308

SDM: A Variation-Tolerant Soft Decision Machine with In-Memory Sigmoid Computation

Bo Wen; Zhicheng XU; Pengyu Ren; Guoyun Gao; X. Sharon Hu; Can Li

Date:Monday, July 27 Location:Mtg Room 203AB Session:Advancing the Frontier of Neuromorphic Learning Systems +1
Abstract

Interpretable models like Decision Trees (DT) and Random Forests (RF) are ill-suited for edge hardware, facing von Neumann bottlenecks and sensitive to device variations. We propose a hardware-algorithm co-design, the Soft Decision Machine (SDM). Algorithmically, SDM transforms a multivariate model into a hardware-friendly univariate form. In hardware, a compact 1-FeFET analog CAM natively computes the required sigmoid function. This "soft" probabilistic approach provides inherent variation tolerance. Under 40 mV device variation, SDM maintains 93.2% accuracy on MNIST (0.54% drop), outperforming DT (43.86% drop) and RF (16.06% drop), enabling robust and efficient interpretable computing for the edge intelligence.

DesignDES3. Emerging Models of ComputationChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1312

Agent-Per-Qubit: Automated Qubit Placement for Fault-Tolerant Quantum Computing

Tian Li; Yang Wang; Tan Li; Wan-Su Bao

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Taming the Physical World: AI Strategies for Circuits and Cores to Quantum Machines +1
Abstract

Quantum computing holds the potential to revolutionize numerous fields, yet the practical execution of quantum circuits depends on an efficient compilation process, including the placement of logical qubits on a quantum chip. This placement is a hard combinatorial problem that often defeats traditional heuristics and manual optimization. We present QAgent, a novel multi-agent reinforcement learning (RL) framework that autonomously optimizes logical qubit layouts on quantum processor. In QAgent, one agent is assigned to each logical qubit, and agents jointly learn placement policies that minimize circuit execution cost. To address the challenges of sparse rewards and credit assignment, we propose the Breakthrough Return Bonus (BRB), a dynamic reward shaping mechanism that encourages meaningful layout improvements and accelerates convergence. Extensive experiments on diverse quantum circuit benchmarks show that QAgent reduces execution costs by up to 53.9% compared to leading approaches, and significantly enhances circuit success rates. Ablation studies confirm that BRB is essential for stable training and effective policy optimization. These results demonstrate the promise of AI-driven, workload-aware placement for advancing fault-tolerant quantum computation.

AIAI3-II. AI/ML Application and InfrastructureQuantumDES3. Emerging Models of Computation +1 more
RESEARCH1316

Search Smarter, Not Harder: A Scalable, High-Quality Zoned Neutral Atom Compiler

Yannick Stade; Lukas Burgholzer; Robert Wille

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Pioneering the Foundations of Quantum Circuit Synthesis, Compilation and Algorithmic Optimization +1
Abstract

Zoned neutral atom architectures are emerging as a promising platform for large-scale quantum computing. Their growing scale, however, creates a critical need for efficient and automated compilation solutions. Yet, existing methods fail to scale to the thousands of qubits these devices promise. State-of-the-art compilers, in particular, suffer from immense memory requirements that limit them to small-scale problems. This work proposes a scalable compilation strategy that "searches smarter, not harder". We introduce Iterative Diving Search (IDS), a goal-directed search algorithm that avoids the memory issues of previous methods, and relaxed routing, an optimization to mitigate atom rearrangement overhead. Our evaluation confirms that this approach compiles circuits with thousands of qubits and, in addition, reduces rearrangement overhead by 28.1% on average.

DesignQuantumDES6. Quantum ComputingAI
RESEARCH1317

HyPAS: A Hybrid Optimization Framework for Placement and Sizing Co-Optimization

Fangzhou Liu; Wuqian Tang; Bo-Ying Wang; An-Chieh Shen; HAN-WEN TSAO; Yuyang Ye; Yun Shao; Chun-Yao Wang; Bei Yu

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

Incremental placement optimization improves power, performance, and area (PPA) by refining cell locations, gate sizing, and buffering under strict physical constraints. Conventional incremental flows, composed of discrete heuristic stages with isolated static timing analyses, often converge slowly and yield inconsistent PPA gains due to the lack of a unified formulation. Recent advances in differentiable placement enable gradient-based refinement on smooth surrogates of wirelength, density, and timing, offering high scalability but limited robustness near legality and timing-closure boundaries. To combine their strengths, we propose HyPAS, a hybrid optimization framework that integrates discrete sign-off optimization with differentiable gradient refinement for placement and sizing co-optimization. The discrete stage, implemented with OpenROAD, ensures legality, timing closure, and buffer-aware repair, while the differentiable stage, built on DREAMPlace, performs gradient-guided placement and sizing using a GNN-based timing-power surrogate. A Straight-Through Estimator bridges continuous gradients with discrete library parameters, enabling end-to-end physical co-optimization. Experiments show HyPAS delivers superior PPA gains with competitive runtime against 2025 ICCAD CAD Contest results and SOTA methods.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH1320

Shooting Falcon with One Fault: A Practical Rowhammer-Based Fault Attack Against Falcon

Zihao Xin; Xin Zhang; Zhe Liu; Qianmei Wu; Qingni Shen; Ruyi Ding; Yixin Jiang; Wenqian Xu; Zhihong Liang; Lu Zhou; Fan Zhang

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

Falcon, a compact post-quantum digital signature scheme, has been selected by National Institute of Standards and Technology (NIST) for the standardization of post-quantum cryptography. While it effectively mitigates the emerging quantum threats, its susceptibly to fault attacks remains insufficiently understood. In this work, we present a novel Rowhammer-based fault attack against Falcon with only one single-bit flip, significantly reduces the number of required faulty signatures by 800× compared to the state-of-the-art attack. Our solution identifies a new vulnerable address offset in Falcon and optimizes the classical Hidden Parallelepiped Problem (HPP) attack. Our end-to-end attack on the Falcon-512 reference implementation successfully recovers the secret key with only 250k faulty signatures.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH1322

Keeping PIM Busy: Eliminating Execution Overheads for Full Throughput

Mun Seong Park; Jisung Pack; Junil Kim; Jun Sik Kim; Seok Young Kim; Seon Kim

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

Large language models (LLMs), such as GPT-3 and Llama-2, impose extreme memory-bandwidth demands, creating severe data movement bottlenecks. Processing-in-Memory (PIM) mitigates this by performing computations near memory; however, current PCIe-based PIM systems deliver only a small fraction of their theoretical throughput. On our multi-channel PIM emulation platform, PIM cores are active for only 6.55% of execution time, primarily due to (1) serialized host–device communication that limits channel-level parallelism, (2) insufficient request-generation capability in conventional DMA engines, and (3) per-bank microarchitectures that serialize batch execution. We address these bottlenecks with a co-designed memory system and microarchitectural solution: Channel-Level Burst (CL) for autonomous per-channel operand generation, PDMA for high-throughput PIM-oriented request scheduling, and an integrated PIM core for enabling true multi-batch parallelism. Across GEMM microbenchmarks and four LLMs, CL provides 9.10x speedup, PDMA adds 29.6x, and multi-batch execution contributes 102.4x. Together, they deliver a 145.8x improvement over a baseline multi-channel PIM system, enabling PIM to surpass CPU throughput in memory-bound LLM decode kernels and approach the efficiency of GPUs.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH1331

Marlin: I/O-Efficient Prefix KV Cache Retrieval for Long-Prefix LLM Serving

Guifeng Wang; Shengan Zheng; Ji Fang; Yucheng Li; Shi Shu; Weihan Kong; Cong Zhou; Kaijiang Deng; Linpeng Huang

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

As large language models (LLMs) are often deployed with long, context-rich prefixes, the prefix KV cache frequently exceeds GPU memory capacity. Although offloading the prefix KV cache to host memory or storage alleviates capacity pressure, it introduces severe I/O stalls that increase Time-to-First-Token (TTFT) latency. To address this challenge, we present Marlin, an I/O-efficient prefix KV cache retrieval system for long-prefix LLM inference. Marlin employs a dispersion-based token selector to precompute a compact, query-agnostic subset of important prefix tokens, and a sensitivity-guided head classifier that assigns different KV retrieval policies to classify prefix-sensitive and query-sensitive heads. An overlap-optimized attention pipeline further hides offload latency by overlapping head-specific KV transfers with attention computation. Experimental results demonstrate that Marlin significantly reduces TTFT compared to state-of-the-art methods while maintaining comparable model accuracy.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1334

PEL: A 4K@120FPS Palette-Based Enhancement Layer Validated with JPEG XS Achieving 18.13% Bdrate Reduction

Hanyang Cui; Shijie Yao; Wei Li; Hetao Xu; Chenlong He; Minge Jing; Leilei Huang; Yibo Fan

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

The growing demand for low-latency, high-quality video encoding in wireless applications, such as online conferencing and cloud gaming, necessitates robust screen content capabilities in mezzanine codec. However, some mezzanine codec like JPEG XS exhibit inefficiencies when handling screen content, particularly for text-rich scenes containing abundant high-frequency information. Thus we propose PEL, a low-complexity screen content enhancement layer that operates without modifying the core codec. The proposed enhancement layer is based on clustering and a palette algorithm, achieving a 18.13% BD-rate reduction on screen content datasets with 5H2V transform levels when validated with JPEG XS. The proposed method eliminates data dependencies and complex rate–distortion optimization processes, satisfying requirements for low complexity and low latency. To overcome hardware throughput bottlenecks, a scalable tri-core parallel architecture is developed for the palette algorithm. Implemented on the Xilinx ZCU106 evaluation board, the design achieves 4K@120FPS performance with only 19.7K LUTs and an additional latency of eight lines.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH1338

ZK-Tracer: A High-Performance Heterogeneous Accelerator for Zero-Knowledge VM Trace Generation

Jieran Cui; Zhengkai Wen; Haowen Fang; Yinan Zhu; Jia Xiong; Cheng Ni; Mingchi Zhang; Nan Guan; Xi Wang

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

While ZKP hardware acceleration has focused on backend proving, we identify frontend trace generation as the new system bottleneck through system-level profiling. To address this, we propose ZK-Tracer, the first hardware accelerator architecture for the zkVM frontend. ZK-Tracer features a novel heterogeneous design couples a Main Trace Unit with parallel Permutation Trace Units, all managed by a lightweight ISA extension for efficient offloading. Our ASIC implementation shows ZK-Tracer accelerates trace generation by 1829x, delivering a remarkable 963x end-to-end system speedup. This work rebalances the ZKP system by eliminating the emerging frontend bottleneck.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH1348

TOB-Sched: Topological Order Balancing-Driven Static Scheduling for Processor-Based Emulation

shunyang bi; qiwang Chen; Jing Tang; zhengguang tang; Haonan Wu; Hailong You; Cong Li

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

The efficiency of processor-based emulation (PBE) heavily depends on the scheduling of computations. In the scheduling process, the challenge in accurate modeling time step stems from the complex instruction execution process, which is influenced by both the inter-processor communication latency, the execution sequence of the allocated nodes, and scheduling constraints. Therefore, existing works struggle with the inability to optimize the time step directly. In this paper, inspired by the key insight of the inherent connection between the time step and the topological order balancing of the netlist graph, we propose the topological order balancing-driven scheduling algorithm. Our approach introduces the mobility-prioritized node selection and efficient forward and backward propagation. Besides, a theorem is presented to reduce the time complexity of gain calculation to O(1). Experiments demonstrate significant improvements over the SOTA scheduling algorithm, achieving 72% better TOB metric, 22.5% reduction in time steps, and 55% faster runtime on public and open-source chip benchmarks, thus proving TOB as a critical metric for the scheduling process and enhancing the efficiency and performance of PBE emulation.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1351

eLLM: Elastic Memory Management Framework for Efficient LLM Serving

Yi Xiong; Jiale Xu; Rui Zhang; Cong Guo; Zihan Liu; Yangjie Zhou; Weiming Hu; Hao Wu; Boyu Li; Junping Zhao; Minyi Guo; Jingwen Leng; zongwei zhu; Xuehai Zhou

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

GPU memory management is critical for efficient Large Language Model (LLM) serving. LLM memory usage primarily comprises weights, activations, and KV caches. While weights are static, activations and KV caches exhibit dynamic and unpredictable behavior, posing significant memory management challenges. Modern LLM serving systems address this through a dual-level approach: activations inherit static tensor abstractions from deep learning frameworks, while KV caches employ specialized page-table virtualization (i.e., PagedAttention). Although this reduces KV cache fragmentation, the fundamental isolation between activation and KV cache management prevents memory sharing across these spaces, leading to suboptimal utilization and 20\% throughput degradation. To address these limitations, we propose eLLM, an elastic memory management framework. The core components of eLLM include:(1) Virtual Tensor Abstraction: Decouples the virtual address space of tensors from physical GPU memory, creating a unified and flexible memory pool;(2) Elastic Memory Mechanism: Dynamically adjusts memory allocation through runtime memory inflation and deflation, and leverages CPU memory as an extensible buffer;(3) Lightweight Scheduling Strategy: Employs Service-Level Objective (SLO)-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints. Comprehensive evaluations demonstrate that eLLM outperforms state-of-the-art systems, achieving up to 2.32$\times$ higher throughput.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH1355

Debug Like a Human: Scaling LLM-Based Fault Localization to Processor Design via Block-Level Instruction-Oriented Slicing

Zizhen Liu; Xiaoguang Mao; Deheng Yang; Jiayu He; Yihao Qin; Guangda Zhang; Yan Lei; Jianjun Xu; Jiang Wu

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

Fault localization in modern processor design code is a critical yet time-consuming step during processor verification. While recent advances in LLM-based techniques for module-level hardware design have shown promising results, automatically localizing bugs in large-scale, project-level processor designs remains challenging. In this paper, we present BluesFL, a novel block-level LLM-based fault localization framework for processor designs. Inspired by the way engineers debug processors, we first propose a dataflow-based code blockization approach to guide LLMs to focus on critical local code context. We further propose a Block-level Instruction-Oriented Slicing (Blues) algorithm that enables LLMs to mimic human reasoning by analyzing instruction execution paths and processor states. We evaluate BluesFL on a real-world RISC-V processor core comprising 19K lines of SystemVerilog code. Experimental results demonstrate that BluesFL correctly localizes 24 bugs at Top-1, achieving 242.9% improvement over the existing state-of-the-art (7 bugs). Cost analysis shows that BluesFL requires an average of only $0.257 to localize a single bug.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1356

ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

Lik Tung Fu; Jie Zhou; shaokai ren; mengli zhang; Jia Xiong; Nan Guan; Hugo Jiang; Xi Wang; Jun Yang

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

Functional verification accounts for over 50% of the IC development lifecycle, making SystemVerilog Assertions (SVAs) indispensable for rigorous digital chip verification. However, manual SVA authoring is labor-intensive and error prone. To address these challenges, we introduce ChatSVA, an end-to-end SVA generation system built upon a multi-agent framework. The AgentBridge platform facilitates systematic data generation, augmentation, and validation, decomposing complex verification processes into modular, verifiable subtasks. ChatSVA achieves syntax and function pass rates of 98.66% and 96.12%, averaging 139.5 SVAs per design with 82.50% function coverage. A ChatSVA web service is publicly available.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1357

3D-DuRA: Accelerating Next-Resolution Generation via 3D Near/in-Memory Architecture with Dual-Ring Sparse Attention

Zecheng Zhou; Yuxiang Zhao; Xinyu Qu; Ruohang Xu; Yufei Ma

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

Visual Autoregressive (VAR) model, via innovative next-resolution prediction, demonstrates significant potential of GPT-style AR models in image generation. However, due to its coarse-to-fine nature, the input token-map size grows dramatically with each step, resulting in excessive memory access and computational overhead. In this paper, we propose 3D-DuRA, an algorithm-architecture co-design based on a hybrid 3D near-memory and in-memory computing architecture equipped with dual-ring sparse attention, for efficient next-resolution visual-autoregressive generation. Experimental results demonstrate that our proposed 3D-DuRA achieves 4.1× improvement in area efficiency compared with RTX 6000 Ada GPU, along with 3.5× and 9.1× speedups and 10.1× and 13.1× improvements in energy efficiency on Infinity2B and VAR-d36, respectively.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH1365

Bringing Graphs Closer to Memory: Compiler-Driven Data Transfer Reduction in the PIM Accelerator

Jisung Pack; Junil Kim; Mun Seong Park; Seok Young Kim; Jaehan Park; Ilkon Kim; Seon Wook Kim

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

Processing-in-Memory (PIM) alleviates the memory bottleneck of modern AI workloads; however, its limited computational capability often necessitates hybrid architectures that integrate PIM with nearby processing units. In such heterogeneous systems, communication and host-side overheads remain dominant performance bottlenecks. Our profiling of SK hynix's AiMX reveals that CPU–AiMX communication and CPU-side data reordering account for 50.07% of end-to-end LLM prefill and 78.36% of decode execution time. Notably, 54.03% of reordering operations occur adjacent to AiMX-executable nodes, indicating significant untapped potential for near-memory execution to eliminate unnecessary data movement. We propose a graph compiler that jointly optimizes computation and data movement for AiMX-based acceleration. The compiler performs (1) graph simplification to maximize AiMX-kernel coverage, (2) DMA-assisted tensor reordering and layout redefinition to offload host-side preprocessing, and (3) operator fusion to suppress redundant memory transfers. Integrated into the ONNX Runtime, our approach improves 2.00~2.73x speedup in the prefill phase and 18.50~38.29x in the decode phase, demonstrating the substantial benefits of graph-level co-optimization for PIM-enabled LLM inference and achieving performance comparable to hand-tuned AiMX execution.

AIAI5-I. AI/ML System and Platform Design
RESEARCH1373

Slcross: Cross-Component System-Level Cache Side-Channel Attacks on Apple M2 SoC

Yakun Wu; Guanlong Wu; Yusi Feng; Yinqian Zhang

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

Cross-component cache side-channel attacks exploit shared cache resources across different processing units to infer sensitive information, posing a significant threat to modern heterogeneous computing systems. However, the effectiveness of fine-grained cache attacks across components on Apple silicon remains unexplored. In this paper, we demonstrate that cross-component cache attacks can leak GPU sensitive information from the CPU via the System-Level Cache (SLC) on Apple M2 SoC. We first reverse-engineer the key SLC mechanisms required to enable cache attacks. Leveraging this knowledge, we introduce a GPU-to-CPU covert channel and attack targeting the Large Language Model (LLM) embedding table during local LLM inference.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH1374

SpArC: Sparse Tensor Accelerator Compilation with Scheduling and Mapping

Xingyan Chen; Zihan Wang; Lei Gong; Qianyu Cheng; Cheng Tang; teng wang; Wenqi Lou; Xuehai Zhou

Date:Monday, July 27 Location:Mtg Room 202C Session:Architectures for Sparse, Adaptive, and Scalable Acceleration +1
Abstract

Sparse tensor computation is widely used in deep learning and scientific computing, but its irregular computation and memory access patterns pose significant challenges for general-purpose accelerators such as GPUs and NPUs. FPGAs, owing to their reconfigurability, are inherently well-suited for handling sparse workloads. However, existing point-based manual design paradigms severely limit the performance potential of reconfigurable hardware across diverse sparse scenarios. To address this, we propose SpArC—an automated compilation framework for sparse tensor computation that automatically generates high-performance FPGA accelerators. SpArC consists of:(1) a primitive-based DSL serving as the design specification for sparse accelerators, providing a unified abstraction of data formats, iteration pattern, and hardware mapping strategies;(2) a two-stage mapping mechanism centered around sparse meta-operations, enabling arbitrary sparse dataflows to be efficiently mapped to hardware micro-architectures; and(3) a heuristic design space exploration method for joint software–hardware optimization.Experimental results show that SpArC can automatically generate FPGA accelerators for typical sparse workloads such as SpMSpV, Plus3, and SpMM. On the SuiteSparse benchmark dataset,for SpMSpV workloads, SpArC achieves 31.7×–130.4× performance improvement over automated frameworks such as ScaleHLS and Allo. On the same SuiteSparse SpMSpV workloads running on CPU platforms, SpArC further achieves 10.8x–34.7× speedup over TACO and 1.03x–1.89x speedup over MKL.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesAI2-I. AI/ML Algorithms and Models
RESEARCH1378

Leveraging Invariants for Scalable Verification of RISC-V Cryptography Extensions

Kim Fahrni; Katharina Ceesay-Seitz; Denis Zuppiger; Kaveh Razavi

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

RISC-V has recently ratified a vector cryptography extension. Ex- haustive formal verification of hardware designs that implement this extension is crucial for security. However, scaling verification for designs with such large bit-widths is challenging. We present the first formal verification of Marian, an open-source implementa- tion of the RISC-V vector cryptography extensions. We show that proof modularization enables us to obtain an unbounded proof for Marian. Together with our systematic invariant identification, we reach an additional speedup of 174%. Our evaluation shows that invariants that assert properties of counters, such as bounds or directions, or handshakes are particularly effective in improving the verification times. We show the generalizability and reusability of our invariant identification methodology by formally verifying another custom implementation of RISC-V vector cryptography extensions. During verification, we found a violation that turned out to be a flaw in the specification, which is now being updated.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH1382

Accelerating AI+SAR Applications Through Enhanced Graph Optimization

Yang Shi; Yaohua Wang; Zhe Li

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Taming the Physical World: AI Strategies for Circuits and Cores to Quantum Machines +1
Abstract

Synthetic Aperture Radar (SAR), benefiting from its all-weather, all-time, and high-resolution characteristics, has become a vital tool in earth observation. A typical application of SAR first completes the imaging process of echo data and then conducts subsequent analysis. As AI excels in image classification and recognition, integrating AI with SAR has garnered significant interest. In order to leverage the acceleration capabilities of existing AI frameworks, particularly graph optimizations, while also reducing developing complexity, developers strive to streamline the entire SAR application within these frameworks. However, a key performance challenge arises: SAR imaging differs greatly from typical AI tasks in both the requirements of data layouts and the composition of operators, causing graph optimizations to fail in effectively accelerating the SAR imaging process. The fundamental reason is the failure of the two key graph optimizations: layout transformation and operator splitting. In this paper, we first address the issue of significant transpose overhead introduced by the layout transformation strategy. To this end, we propose a novel layout transformation strategy based on pseudo-transposition operators, which can completely eliminate transpose overhead while maintaining memory access efficiency. Subsequently, we design a tailored splitting strategy based on movable reverse-order operators to compensate for existing frameworks' lack of capability in handling the core FFT operators. The proposed strategies were implemented in PyTorch and LiteRT, yielding a significant speedup of 3.45x for SAR imaging process.

AIAI3-II. AI/ML Application and InfrastructureQuantumDES3. Emerging Models of Computation +1 more
RESEARCH1388

Parallel Combinational Equivalence Checking Through Factored Form Sharing

Nanjiang Qu; Cong Tian; Nan Zhang; Zhenhua Duan

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

Combinational equivalence checking (CEC) is a key and often time-critical step in the IC design flow for verifying functional equivalence after synthesis and technology mapping. As designs scale, improving CEC efficiency has become a major challenge with wide-ranging consequences for the entire flow. In this paper, we propose a novel parallel CEC approach through factored form sharing. In particular, the factored form literal count (FFLC) closely reflects both the amount of shareable literal/CNF-encoding and the branching complexity encountered during satisfiability (SAT)-based verification. Therefore, FFLC, together with a lightweight branching complexity proxy, guides a sharing-aware seed-and-grow partitioning algorithm that exploits literal and CNF-encoding reuse while balancing estimated SAT-solving complexity. Implemented in ABC and evaluated on large-scale benchmarks with 40 physical CPU cores, our implementation delivers geometric-mean speedups of 45.57× (max. 6,595×) over single-threaded ABC, 1.53× (max. 17×) over a state-of-the-art (SOTA) CPU-parallel method, and 5.52× (max. 382×) over a SOTA GPU-parallel method.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH1390

ATLAS: Asynchronous Topological Learning for Accurate FIT Prediction via Decoupled Graph Neural Networks

Yicheng Liu; Mingjun Wang; Songwei Pei; Yuntao Lu; Zizhen Liu; Boyu Han; Huawei Li; Shangguang Wang

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

Safety-critical integrated circuits targeting ISO 26262 ASIL-D require a failure-in-time (FIT) $<10$. These physics-based tools for FIT prediction, such as BFIT, are accurate but slow, while ML baselines suffer collapse under long-tailed FIT distributions. We propose \textbf{ATLAS}, a decoupled GNN framework that combines a bidirectional asynchronous topological message passing backbone, aligned with BFIT's forward and backward propagation, with RankNet and CalibNet synergistic heads. RankNet uses a gap-aware pairwise ranking loss with log-gap weighting $\Delta\log(t)$ and explicit emphasis on head samples to identify top-$k\%$ high-risk gates. CalibNet employs a dual-domain loss in log and linear spaces, incorporating importance weighting, for accurate calibration. The proposed design achieves $O(|V|+|E|)$ complexity, delivering up to 220$\times$ speedup over BFIT while reducing selective hardening area overhead by 53\% compared to DeepGate2 on ITC'99 benchmarks, thereby enabling the rapid reliability-aware design of large-scale circuits. We plan to open-source our data and model code after acceptance.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH1400

Actionflow: A Pipelined Action Acceleration for Vision Language Models on Edge

Yuntao Dai; Hang Gu; teng wang; Qianyu Cheng; Yifei Zheng; zhiyong qiu; Lei Gong; Wenqi Lou; Xuehai Zhou

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Scaling Intelligence for the Physical World +1
Abstract

Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hindered by high inference latency. While smooth robotic interaction requires control frequencies of 20--30 Hz, current VLA models typically operate at only 3--5 Hz on edge devices due to the memory-bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge platforms. At the core of ActionFlow is a Cross-Request Pipelining strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a \textbf{Cross-Request State Packed Forward} operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55 times improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dynamic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.

SystemsSYS1. Autonomous Systems (Automotive, Robotics, Drones)AIQuantum
RESEARCH1404

Learning-Based Slack-Aware Timing-Driven Global Placement for Large-Scale Heterogeneous FPGAs

Yi Guo; Shikai Guo; Huijiang Liu; Ning Wang; Zhixiong Di; Xiaochen Li; He Jiang

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

Timing closure is a key objective in FPGA placement. As designs scale, heterogeneous interconnections induce strong RC effects, weakening delay–wirelength correlation and limiting existing placers with inaccurate timing models and fixed weighting. We propose LS-Placer, a learning-based, slack-aware timing-driven placement framework that jointly models net and logic delays through a graph representation and integrates the learned timing model into a dynamic global placement flow with adaptive TNS-guided weighting. On average, LS-Placer improves WNS and TNS by ~11% and ~19% and achieves 3% shorter critical path delay than Vivado 2021.2.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH1407

COmPOSER: Circuit Optimization of mm-Wave/rf Circuits with Performance-Oriented Synthesis for Efficient Realizations

Subhadip Ghosh; Surya Srikar Peri; Ramprasath S; Sosina Berhan; Endalk Gebru; Ramesh Harjani; Sachin S. Sapatnekar

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

This work presents COmPOSER, an open-source, end-to-end framework for RF/mm-wave design automation that translates target specifications into optimized circuits with layouts. It unifies schematic synthesis, layout generation for actives and passives, and placement/routing, incorporating physics-based equations and machine-learning-driven electromagnetic models. Based on post-layout validation on multiple LNAs and PAs operating at up to 60GHz in a commercial 65nm process-kit, COmPOSER meets performance targets, comparable to expert manual designs, while delivering a 100-300x productivity gain.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH1409

CODA: A Computation-Data Decoupled Dataflow Paradigm for DNN Computing on NPUs

Xiuping Cui; Chengrui Zhang; Zihao Zheng; Xiang Chen; Yun (Eric) Liang

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

Deep neural networks (DNNs) on Neural Processing Units (NPUs) require carefully optimized operator mappings to achieve high performance, yet the mapping space grows rapidly with increasingly complex on-chip memory hierarchies. Existing dataflow models rely on a computation–data coupled paradigm that forces all tensors to share identical storage structures and data transfer paths, severely limiting the expressible mapping space. We present CODA, a computation–data decoupled dataflow paradigm that enables tensor-wise independent modeling of on-chip storage and movement. CODA introduces the non-uniform loop space to jointly represent computation and per-tensor data mappings, together with an analytical performance model and a simulated-annealing–based optimizer. Across single-operator and fused-operator workloads, CODA achieves 1.10x-1.11x and 1.14x–1.85x speedup over state-of-the-art methods.

AIAI5-I. AI/ML System and Platform Design
RESEARCH1412

Reconfigurable 3D-VRRAM LUT-CiM with Greedy G-Shuffle Quantization for End-to-End 4-Bit ViTs in Edge Dense Prediction

Yi Li; Zijian Ye; Zhaori Cong; Shengzhe Yan; Songqi Wang; Ning Lin; Jinshan Yue; xiaojuan qi; Zhongrui Wang; Han Wang

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

The rise of embodied intelligence is driving the deployment of Vision Transformers (ViTs) for dense prediction (DP) tasks on resource-limited edge devices. Look-up-table (LUT)-based digital compute-in-memory (CiM) has emerged as a promising paradigm, delivering high accuracy with superior energy and area efficiency. However, practical ViT deployment on LUT‑CiM for edge DP faces dual hardware and software challenges: (1) LUT cost scales exponentially with bit-width, while NLP‑oriented quantization struggles to push convolution-heavy ViTs below 8 bits, especially in compact on‑chip CiM models; and (2) fixed LUT structures cannot accommodate ViTs' heterogeneous static/dynamic workloads, while unstructured mixed‑precision weights strain memory density and utilization. To overcome these, we propose a software-hardware co-designed low-precision ViT accelerator. Software-wise, we propose Greedy G-shuffle, a lightweight, general quantization method, paired with a PCA‑based, channel‑wise weight assignment, tailored for lightweight ViTs in edge DP tasks. Hardware-wise, at the architectural level, we introduce a reconfigurable LUT-based CiM macro that accelerates both static and dynamic matrix operations with superior energy efficiency, complemented by a high-utilization pipeline- and tensor-parallel dataflow that serves heterogeneous ViT layers. At the circuit level, we employ high-density 1TnR VRRAM with flexible read paths and channel-wise hardware remapping to support mixed-precision INT4/2/1 weights, minimizing area overhead. We demonstrate the first fully 4-bit ViT with INT4 activations and structured INT4/2/1 weights on a reconfigurable 3D 1TnR VRRAM LUT-based CiM macro, achieving a 26.7% improvement in mIoU and a 30.1% reduction in RMSE on the ADE20K and NYUDv2 compared to vanilla round-to-nearest quantization, along with up to 8.9× area efficiency and 9.7× energy efficiency improvements over the state-of-the-art.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH1414

Expcheck: Dynamic Expert-Aware Checkpointing for Mixture-of-Experts Based Models

Guorui Xu; Xiang Liu; Keni Qiu; Qiang Su; Weiwei Wu; Guang Kou; Chenchen Fu

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

Large-scale Mixture-of-Experts (MoE) models are pivotal in modern AI, yet their massive parameter size creates a "storage wall" for fault tolerance, where limited bandwidth restricts checkpoint frequency and risks significant wasted computation. We present "EXPCheck", a dynamic expert-aware checkpointing system designed to resolve the conflict between massive MoE states and limited persistence bandwidth. Grounded in the observation that expert activation is highly imbalanced, EXPCheck employs a novel "Aging-then-Greedy Expert Selection (AGES)" policy. AGES first enforces an age-based refresh for overdue "cold" experts to prevent indefinite staleness, and then greedily allocates the remaining persistence budget to frequently updated "hot" experts. Implemented on a production-scale training stack, EXPCheck significantly reduces persistence traffic and increases checkpoint frequency by at most 5× compared to full checkpointing, while maintaining downstream model accuracy comparable to standard methods.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH1417

Scalable Reliability Assessment of DNNs Through Simultaneous Fault Injection

Rafael Billig Tonetto; Marcello Traiola; Fernando Fernandes dos Santos; Angeliki Kritikakou

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Fast, Furious, and Fault-Tolerant: Accelerating the Generative Grind +1
Abstract

Deep Neural Networks (DNNs) are being deployed in safety-critical applications, where resilience to transient faults is essential. Traditional fault injection methods often face challenges in scaling efficiently to larger models, whereas the majority of existing speed-up techniques is closely linked to specific hardware or architectural configurations. To speed up the assessment of single-fault effects, an approach based on simultaneous injection of faults has demonstrated promising results, where multiple non-interacting faults are injected concurrently during a single workload execution. Nevertheless, the applicability of this method to DNNs has not been explored. In this study, we investigate the use of simultaneous injection of faults in DNNs and observe that faults can easily interact with one another due to DNNs' densely connected structure. These fault interactions can create ``artificial'' masking effects, leading to the misclassification of faults as non-critical (called false negatives), ultimately compromising the accuracy of the reliability assessment. To overcome this phenomenon, we propose an approach to mitigate the effects of such fault interaction during simultaneous injection of faults in DNNs, ensuring accurate assessment. Furthermore, we propose a strategy to further accelerate the assessment by pruning non-critical inputs from the DNN input batch during fault injection, further improving the speedup with negligible accuracy loss. To our knowledge, this is the first approach to enable accurate and efficient simultaneous injection of faults into DNNs, supporting fast reliability assessment applicable to different abstraction levels. We experiment with nearly 42 million injections at both software (SW) and RTL, achieving very low false negatives (as low as 0%, avg 0.2%) and an average injection time gain of 3.82x (RTL) and 5.29x (SW) over existing DNN fault injection approaches.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH1419

Duet: Exploiting High-Order Lossless Sparsity for Bit-Serial Transformer Acceleration via Distribution-Aware Pruning

Heng Liao; Runzhou Zhang; Faxian Sun; Yifan Zhang; Kejia Shi; Haodong Lu; Kun Wang

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

Transformer models achieve exceptional performance, but face high computational costs that hinder their deployment on resource-constrained devices. While bit-column sparsity (BCS) offers promising post-training acceleration, existing distribution-agnostic methods neglect natural sparsity in the most significant bits (MSBs) from Gaussian-like weight distributions, requiring aggressive accuracy-degrading modifications on the least significant bits (LSBs). This paper presents Duet, a bit-serial accelerator fully exploiting High-Order Lossless Sparsity in BCS for load-balanced Transformer acceleration. At the algorithm level, Distribution-Aware Pruning (DAP) partitions weights by a hyperparameter to maximize lossless MSB pruning opportunities, while Fixed Redundant Hierarchical Search (FRHS) optimally handles remaining compression, achieving 3.13/3.25 effective bits with negligible accuracy loss. At the architecture level, our Duet accelerator addresses four key challenges: (1) two-level shifter resolves bit significance mismatch in Duet encoding format; (2) parallel Metadata-Weight pipelines support variable bitwidths while completely hiding metadata processing overhead; (3) Activation Sum Generator supports time-multiplexed Metadata Pipeline; (4) dual mode operation handles both linear and attention layers in Transformer models. Duet accelerator ensures load-balanced execution for all Processing Elements (PEs). Experiments on BERT and ViT demonstrate 1.45× ~ 4.30× speedup and 1.32× ~ 2.94× energy improvement over SOTA accelerators.

AIAI5-I. AI/ML System and Platform Design
RESEARCH1420

IGNITE: An Inquisitive and Graph-Aware Neural Circuit Simulator with Two-Stage Enhanced In-Context Learning

Jing Kou; Jiayi Sun; Yifei Lyu; Liang Zhang; Yinpan Wang; Yuxin Xie; Yinbo Sun; Wei Xing; Wang Kang

Date:Monday, July 27 Location:Mtg Room 202AB Session:Generative AI and Graph-Based Learning for Next-Gen Circuit & Device Simulation +1
Abstract

Neural circuit simulators promise fast alternatives to SPICE but remain impractical due to three fundamental limitations: per-circuit hyperparameter tuning, cold-start problems from lacking crosstask knowledge sharing, and prohibitive online retraining costs that negate computational advantages. We introduce IGNITE, a pretrained neural simulator that addresses these challenges through in-context learning. Built on a self-supervised topology encoder and Prior-data Fitted Network decoder, IGNITE adapts to new circuits via a single forward pass conditioned on a small context set, eliminating online training overhead. An information-theoretic active learning strategy enables intelligent sample selection, maximizing data efficiency. IGNITE transfers knowledge across topologies and technology nodes, achieving R2 > 0.9 with only 95 samples in average—a 15.8× improvement over SOTA. Integration with optimization frameworks yields 2−30× speedup for transistor sizing and more than 5× speedup for yield optimization, demonstrating practical viability for industrial analog circuit design.

EDAEDA6. Analog CAD, Simulation, Verification and TestChipletSYS4. Embedded System Design Tools and Methodologies
RESEARCH1430

Spike-PTSD: A Bio-Plausible Adversarial Example Attack on Spiking Neural Networks via PTSD-Inspired Spike Scaling

Lingxin Jin; Wei Jiang; Maregu Assefa; Letian Chen; Jinyu Zhan; Xingzhi Zhou; Lin Zuo; Naoufel Werghi

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

Spiking Neural Networks (SNNs) are energy-efficient and biologically plausible, ideal for embedded and security-critical systems, yet their adversarial robustness remains open. Existing adversarial attacks often overlook SNNs' bio-plausible dynamics. We propose Spike-PTSD, a biologically inspired adversarial attack framework modeled on abnormal neural firing in Post-Traumatic Stress Disorder (PTSD). It localizes decision-critical layers, selects neurons via hyper/hypoactivation signatures, and optimizes adversarial examples with dual objectives. Across six datasets, three encoding types, and four models, Spike-PTSD achieves over 99% success rates, systematically compromising SNN robustness. Code: https://anonymous.4ope- n.science/r/Anonymous-testcode.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1431

G-Power: Architecture-Level GPU Power Modeling with Aggregated Knowledge Foundations from Known GPUs

Qijun Zhang; Yao Lu; Shang Liu; Mengming Li; Chen Zhang; Dongbo Wang; Zhiyao Xie

Date:Monday, July 27 Location:Mtg Room 203C Session:Streamlining Silicon: From Real-Time Scheduling to Efficient GPU Execution +1
Abstract

Graphics Processing Units (GPUs) have been serving as critical computation resources for large-scale parallel computations. With increasing chip complexity, power efficiency has become an important design objective for modern GPUs. GPU power optimization relies on fast power evaluation, requiring architecture-level GPU power model. However, because of the time-consuming power label collection, only simple microbenchmarks are adopted for training. The limitation of microbenchmarks as training data incurs low accuracy for existing architecture-level GPU power models like AccelWattch. To address the limitation of microbenchmarks as training data, we propose G-Power, an architecture-level GPU power modeling framework that utilizes additional known GPU chips to provide additional knowledge. G-Power utilizes the aggregated knowledge foundation from additional known GPU chips and then performs fine-tuning on our target GPU. To provide foundations with additional known GPU chips and capture the similarity to utilize these foundations for fine-tuning, G-Power adopts a three-phase algorithm consisting of 1) pre-training with additional known chips, 2) attention-inspired aggregation, and 3) fine-tuning on our target GPU. We evaluate G-Power on four modern NVIDIA GPUs, demonstrating high accuracy. G-Power can achieve a low MAPE of 14% and a high correlation coefficient R of 0.88 on average, which are 22% lower MAPE and 0.36 higher R than AccelWattch.

SystemsSYS4. Embedded System Design Tools and MethodologiesChipletEDA
RESEARCH1434

G-Matcher: GPU-Accelerated Multi-Level, Multi-Stage Framework for Large-Scale Layout Pattern Matching

Nan Wang; Hanbin Dong; Chengxuan Lv; Jiaqi Liu; Tao Wu; Jingyi Yu; Hao Geng

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

Pattern matching has become one of the most predominant industrial solutions in IC design-to-manufacturing flow, with broad applications in layout verification, hotspot detection and Optical Proximity Correction (OPC). However, conventional CPU-based approaches, which rely on rule-based extraction, suffer from excessive false matches and high runtime overhead. These inefficiencies create bottlenecks in both verification and manufacturing workflows at advanced nodes. To address these limitations, we introduce the first fully GPU-accelerated pattern matching framework G-Matcher, with a novel multi-stage, multi-level architecture. Our GPU-oriented development exploits massive parallelism to process ultra large scale layout data. The multi-stage pipeline filters false matches, while the multi-level paradigm utilizes specialized pattern and layout representations to accelerate the search. On industrial datasets, our framework demonstrates a 407× speedup over the 64-thread commercial tool Calibre. On public benchmarks, it achieves 7.1× speedup over a state-of-the-art CPU implementation. Importantly, G-Matcher maintains 100% matching accuracy with zero false matches across all evaluations.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH1438

PiHG: A Redundancy-Free Heterogeneous Graph Neural Network Accelerator via a Pivot-Centric Approach

Qiyuan Niu; Yu Zhang; Lin Gu; Zhongtian Long; Ruida Xin; Yutao Fu; Jin Zhao; Hai Jin

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

Heterogeneous Graph Neural Networks (HGNNs) are widely used to capture structural and semantic information in heterogeneous graphs based on sequences of vertex and edge types (i.e., metapaths). To support HGNNs, several solutions have been proposed and adopt a source-centric approach that can concurrently process metapath instances sharing the same source vertex. However, the vertices and edges, which are shared by different metapath instances and beyond the first hop of the same source vertex, may be repeatedly processed, incurring irregular memory accesses and redundant edge computations. In this work, we observe that HGNN inference exhibits substantial vertex and edge overlaps across metapath instances, and the metapaths sharing a common pivot vertex (i.e., the center vertex bridging multiple metapaths) exhibit strong spatial similarity that effectively captures these overlaps. Based on these observations, we propose a pivot-centric accelerator named PiHG to effectively support HGNN inference. Specifically, PiHG introduces a novel pivot-centric execution model into accelerator design to concentrate feature accesses and computations around pivot vertices, which enables reusing the feature vectors of overlapped vertices and edges across metapaths and thus eliminating redundant computations and reducing irregular off-chip memory accesses. Experimental results show that, compared with the state-of-the-art software and hardware solutions, PiHG achieves 13.1×∼236.8×, 2.2×∼11.7× speedups and 15.5×∼216.8×, 2.4×∼12.7× energy savings, respectively.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH1440

TACo: Training-Free, Hardware-Aware ViT Architecture Search with a Hypervolume-Based Unified Zero-Cost Score

Eunji Kwon

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

We propose TACo, a training-free, hardware-aware architecture search framework for Vision Transformers (ViTs), powered by a Hypervolume-based Unified Zero-Cost Score (HV-score) that integrates four accuracy- and hardware-related dimensions: a newly proposed zero-cost accuracy score (ZAS), latency, energy, and activation memory. TACo rapidly estimates the potential of candidate ViTs without training: ZAS captures layer-wise gradient stability, weight–gradient coupling, and gradient-modulated activation stability, while the HV-score guides multi-objective selection using Pareto and hypervolume principles. Experiments on CIFAR-10 and ImageNet-1K show that TACo reduces search time from GPU-days to GPU-hours while identifying a well-balanced ViT architecture in terms of accuracy and computational cost.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH1441

An Adaptive Prediction of Tokens Fluctuation for Decoupled Prefetching to Accelerate Long-Context LLM Decoding

Yuanhua Xiao; Jihe Wang; Zhiyu Sun; Linying Wu; Danghui Wang

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Orchestrating LLM Systems at Scale: Heterogeneity, Memory, and Throughput +1
Abstract

The prohibitively slow speed of long-context LLM decoding is a critical bottleneck, caused by massive Key-Value (KV) access that saturates memory bandwidth. Existing solutions fail due to a fundamental trade-off: static predictors cannot handle the Tokens Fluctuation, leading to permanent accuracy loss from irreversible eviction, while an on-the-fly scoring method creates a synchronous bottleneck that stalls the entire pipeline. This paper presents a co-designed system that breaks this trade-off by leveraging an Adaptive Prediction algorithm to enable a fully Decoupled Prefetching architecture. The proposed Adaptive Prediction model first mirrors the nature of Tokens Fluctuation, enabling a lightweight, reuse-based recovery mechanism that solves irreversible eviction. This high-fidelity prediction then unlocks the Decoupled Prefetching system, which completely hides data transfer latency and eliminates the synchronous bottleneck. On long-context LLM decoding tasks, the proposed system matches the accuracy of a full KV cache with just 256 tokens, while achieving an average 2.50x end-to-end speedup.

AIAI5-I. AI/ML System and Platform DesignEDA7-II. Physical Design and VerificationDesign
RESEARCH1444

Janus: Compiler-Based Defense Against Transient Execution Attacks Using ARM Hardware Primitives

Ciyan Ouyang; Peinan Li; Yubiao Huang; Dan Meng; Rui Hou

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

We present Janus, a compiler-based security framework that mit- igates transient execution attacks like Spectre and control-flow hijacking on ARM64 platforms. Janus integrates speculative execu- tion and control flow dependencies with PA modifiers, using PA and BTI microarchitectural features to prevent control-flow speculation attacks and secure both control flow and speculative execution through existing control-flow integrity mechanisms. To optimize performance, Janus minimizes overhead by merging defense opera- tions across different defense layers (modifier fusion) and reusing registers of protected variables (carrier reuse), while maintaining strong security guarantees. Evaluation on SPEC CPU2017 shows an average performance overhead of 3.85%, with real-world appli- cations exhibiting overheads ranging from 2.97% to 7.80%. Janus offers effective speculative execution security and low performance and code size overhead, making it a robust solution for ARM-based systems.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH1447

STCG: Sparse Tensor Code Generation via Adaptive Tile Scheduling on GPUs with Tensor Cores

Haotian Wang; Yan Ding; Huilong Pi; KEQIN LI; Kenli Li; Wangdong Yang

Date:Monday, July 27 Location:Mtg Room 203C Session:Streamlining Silicon: From Real-Time Scheduling to Efficient GPU Execution +1
Abstract

Sparse tensor operations underpin graph neural networks, scientific simulations, and data analytics, yet irregular sparsity leads to highly uneven tile densities and irregular memory access patterns, preventing GPUs from effectively utilizing Tensor Cores and CUDA cores. We present STCG, a compiler–runtime framework for sparse tensor code generation with adaptive tile scheduling on modern GPUs. STCG introduces a four-level IR that separates algebraic lowering, tiling, micro-kernel construction, and heterogeneous kernel assembly. The compiler emits a unified GPU kernel that embeds both WMMA-based Tensor Core micro-kernels and coordinate-indexed FMA kernels under a consistent memory and warp layout, ensuring stable register and shared-memory usage despite irregular sparsity. At runtime, lightweight tile descriptors enable constant-time, per-tile selection between the WMMA and FMA paths, providing data-aware load balancing and eliminating the need for kernel regeneration or manual tuning. On an NVIDIA A100, STCG achieves a 1.13$\times$ geometric-mean speedup over the per-dataset best existing method on SpMM and a 5.06$\times$ speedup on SpTTM, demonstrating the effectiveness and advancement of our approach across diverse sparsity patterns.

SystemsSYS4. Embedded System Design Tools and MethodologiesChipletEDA
RESEARCH1449

Maple: Efficient Data-Transfer-Aware Mapping Exploration for NDP-Enabled Edge LLM Inference

Xuefeng Hua; Cong Li; Xin Si; Qiang Wu; Guangyu Sun

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

Near-DRAM Processing (NDP) has demonstrated great potential for memory-bound operators in edge-side large language model (LLM) inference. Featuring a xPU-NDP heterogeneous system, existing designs concentrate on optimizing the execution flow within processing units, yet ignoring the bottleneck arising from xPU-NDP data transfer. To tackle this challenge, we propose Maple, an efficient mapping exploration framework. Given a specific workload and NDP architecture, Maple adopts an address-mapping-based description method to construct a comprehensive search space that encompasses resource grouping, tensor partitioning and transfer binding, thereby facilitating joint optimization of computation and data transfer. Experiments show that Maple achieves an improvement of up to 3.34$\times$ in performance and 1.49$\times$ in energy compared with existing approaches on mainstream NDP architectures.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH1464

Thrend: Mitigating Counting Thread-Based Fine-Grained Timing on Arm and Apple CPUs

Chang Liu; feng shuaihu; Jiaqi Cui; Hongpei Zheng; Aodong Cui; Jian Dong; Yuan Li; Trevor E. Carlson; Dongsheng Wang

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

This paper presents Thrend, the first defense that mitigates counting threads used in timing side-channel attacks on Arm and Apple CPUs through real-time monitoring and scheduling. Thrend leverages fuzzing to automatically generate high-quality counting threads for training, and detects them by sampling hardware performance counters. For suspected counting threads, Thrend prevents effective timing measurements by co-scheduling the attacker threads and the counting threads on the same core. We implement Thrend on four ARM and Apple devices. Under the sampling rate of 100 Hz, Thrend achieves 99.71% detection accuracy, a 0% false-negative rate, and incurs less than 1.42% runtime overhead.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1466

Inferweave: Efficient LLM Inference on the MT-3000 Processor

Xinxin Qi; Jianbin Fang; Peng Zhang; Yonggang Che

Date:Monday, July 27 Location:Mtg Room 101B Session:Computing Without Waste: Lean and Scalable Neural Architectures +1
Abstract

The deployment of large language models (LLMs) is increasingly targeting heterogeneous unified-memory architectures (HUMA) for edge and cost-sensitive computing. However, GPU-centric inference systems perform suboptimally on HUMA due to shared DRAM bandwidth contention and inefficient static operator placement. This paper introduces InferWeave, a bandwidth-aware inference framework that co-designs offline planning and runtime scheduling for HUMA. InferWeave's offline planner formulates operator placement and data staging as a unified optimization problem with DRAM bandwidth as a first-class constraint. At runtime, a bandwidth-aware asynchronous pipeline scheduler enforces these plans by regulating memory access rates, enabling non-blocking execution across host and device. Evaluated on LLaMA and GPT models using the MT-3000 processor, InferWeave achieves up to 2.3x higher throughput than FlexGen and llama2.c, demonstrating its effectiveness in enabling efficient LLM inference on integrated architectures.

AIAI3-I. AI/ML Application and InfrastructureChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1468

Adahuff-GNN: A Convergency-Aware Adaptive Huffman Compression for Communication Reduction in Distributed GNN Training

Zhiyu Sun; Jihe Wang; zhaohui jia; Yuanhua Xiao; Linying Wu; Danghui Wang

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Rethinking AI Compute: Cross-Layer Co-Design Beyond Conventional GPUs +1
Abstract

Distributed training of Graph Neural Networks (GNNs) is hindered by communication overhead from exchanging embeddings and gradients. Existing quantization methods mitigate this cost by using shared bit widths at the layer or group level, followed by node-level uniform quantization, but such coarse granularity cannot fully exploit redundancy in long-tailed communication distributions. Entropy compression is well suited to these distributions, but static entropy compression schemes become less effective when training induces untracked distribution shifts. To address this issue, AdaHuff-GNN, a convergency-aware adaptive Huffman compression framework, is proposed to apply entropy compression to reduce communication cost in distributed GNN training. Within AdaHuff-GNN, loss-triggered codebook reconstruction and coarse-to-fine adaptive selection of codebook sizes jointly produce stage-wise codebooks that track distribution shifts. Binning-assisted clustering and conflict-free decoding reduce codec overhead on the communication critical path. Experiments show that AdaHuff-GNN reduces communication volume by 2.87× and shortens training epoch time by 37.5% over state-of-the-art methods without degrading model accuracy.

AIAI5-I. AI/ML System and Platform Design
RESEARCH1473

GRACE: A Ground Plane Generation and Re-Routing Aware Co-Design Engine

Tongkai Wu; Beichen Li; Yapeng Li; Zuxun Duan; Yanshi Liang; Yihe Wang; Baohui Xie; Qunsong Ye; Tinghuan Chen

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

In modern high-density Printed Circuit Board (PCB) design, the conventional sequential workflow often leads to a fractured ground plane, compromising signal integrity and system performance. Existing Electronic Design Automation (EDA) tools passively fill unoccupied space with copper post-routing, lacking the ability to modify wire segments to improve ground plane continuity. In this paper, we propose A Ground Plane Generation and Re-routing Aware Co-design Engine, a novel, topology-aware framework that fundamentally shifts the paradigm from passive filling to an integrated co-design process. GRACE formulates ground plane generation as an optimization problem tightly coupled with automated rerouting. It partitions the layout into copper-pourable regions and blockages, and abstracts this into a novel weighted graph model, the Ground Plane Graph (GPGraph). We then formulate an Integer Linear Programming (ILP) problem on the GPGraph to identify a globally optimal set of blockages to eliminate. By strategically rerouting a minimal number of power & signal wire segments, GRACE merges isolated copper islands to create a maximum and contiguous ground plane. Experimental results demonstrate that GRACE drastically increases the area of the ground plane by an average of 33.06% and reduces the number of the ground plane polygons by 54.34% compared to open-source EDA tools. Moreover, our method achieves a significant speed-up over manual refinement while attaining an equivalent ground-plane area and number of ground plane polygons. To the best of our knowledge, this is the first work to automate ground plane unification through an integrated rerouting-aware co-design methodology.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH1474

Compilation Tells Energy: Rethinking Power Modeling for DNN Accelerator Agile Design

Donger Luo; Yihong Yi; Xinheng Li; Tianyi Li; Jianwang Zhai; Qi Sun; Cheng Zhuo; Hao Geng

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

RTL generators enable agile design of DNN accelerators, but the lack of early-stage power feedback forces designers to discover energy inefficiencies only after costly synthesis. Existing arch-level simulator-based approaches fall short for agile workflows: they require expertise and effort incompatible with rapid iteration. While machine learning struggles to capture software-hardware coupling, we reveal a key insight: compilation tells energy. Compiler toolchains in RTL generators already fuse workload and hardware characteristics—the coupling determining power. By extracting features from compiler IRs and multi-task learning, our methodology achieves practical accuracy through push-button workflows requiring no expertise. Validation on Gemmini and hls4ml demonstrates broad applicability, enabling true power-aware agile design.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1480

ADAPT: An Algorithm-Hardware Co-Design for Spiking Transformer with Training-Free Adaptive Token Pruning and Hierarchical Sparsity

Pinfeng Jiang; Yilong Fang; Letian Wang; Yi Wang; Mingde Zhu; Xiangshui Miao; Xingsheng Wang

Date:Monday, July 27 Location:Mtg Room 203AB Session:Advancing the Frontier of Neuromorphic Learning Systems +1
Abstract

Spiking neural networks (SNNs) represent a promising solution for emerging architectures due to high sparsity and low power consumption. Spiking Transformers extend these advantages to attention-based modeling and show strong potential for energy-efficient applications. However, their practical deployment remains difficult. Spiking transformers demand substantial computation and memory access because of long token sequences and heavy workloads in feed-forward networks. Moreover, existing sparse optimization methods provide limited benefit for spiking transformers since they either focus only on unstructured bit-level sparsity or require model retraining. This work presents ADAPT, an algorithm–hardware co-design that exploits the hierarchical sparsity of spiking transformers. At the algorithm level, we propose Adaptive Token Pruning (ATP), a training-free method that evaluates token diversity and spike activity to remove redundant tokens. At the hardware level, we design a hierarchical-sparse accelerator that introduces a block-sparsity compression format and a pattern processing unit to leverage repeated bit patterns inside non-zero blocks. Experiments on multiple spiking transformer models demonstrate that ATP prunes 50\% of tokens with negligible accuracy loss. The ADAPT accelerator achieves average speedups of 3.1x and 4.2x over state-of-the-art SNN accelerators Prosperity and GPU, while reducing energy by 1.9x and 149.4x. These results show that exploiting hierarchical sparsity with algorithm–hardware co-design enables efficient deployment of spiking transformers.

DesignDES3. Emerging Models of ComputationChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1494

When Cloud TEEs Encounter Availability: A Lightweight Framework for Verifiable CPU Availability

Shangjie Pan; Haochuan Lei; Yinghao Yang; Dongrong Zhang; Dong Du; Hang Lu; Xiaowei Li

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

Trusted Execution Environments (TEEs) provide confidentiality and integrity, but their reliance on untrusted schedulers makes them vulnerable to CPU Denial-of-Service (DoS) attacks, compromising availability. Existing solutions either enlarge the Trusted Computing Base (TCB), rely on static closed-world workload assumptions that block dynamic enclave admission, or require hardware modifications. We propose AvaTEE, a lightweight framework for verifiable CPU availability in TEEs. AvaTEE uniquely integrates resource negotiation into remote attestation, providing a verifiable resource commitment pre-deployment. At runtime, a trusted scheduler performs dynamic monitoring and preemptive arbitration to mitigate DoS attacks. We evaluate AvaTEE on an FPGA. The results demonstrate that it (1) provides robust protection under contention, where native TEEs suffer a 94.8\% performance loss and up to a 19.7$\times$ slowdown; (2) incurs negligible overhead (less than 2\%) during enclave startup and normal runtime; and (3) maintains near-native performance, with only 1.70\% average overhead compared to Keystone.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1495

PMU-Faker: The Apparently Harmless Performance Monitoring Unit Events Are Actually Harmful

Zeru Lan; Pengfei Qiu; Yaxuan Zhao; Ziyan Huang; Zhihao Zhang; Jiayi Guo; Chunlu Wang; Dongsheng Wang; Gang Qu

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

Previous studies have verified that some special Performance Monitoring Unit (PMU) events are insecure. In this work, we further disclose that even the PMU events are securely designed and implemented, they are capable of constructing malicious attacks. We develop and implement a fuzz testing framework to automatically identify and verify numerous PMU events. Based on which, we successfully discover 118 vulnerable PMU events. Leveraging these identified events, we introduce PMU-Faker, a precise timing mechanism that can implement most of the timing-based side channel attacks without timer. To accommodate variations in instruction-triggered cycle durations, PMU-Faker provides two timing mechanisms with varying levels of granularity, thereby balancing precision and adaptability. Moreover, we successfully utilizing PMU-Faker to implement four attacks including executing transient execution attacks, extracting AES's key, conducting attacks against Intel Software Guard Extensions (SGX), and achieving cross-virtual machine information leakage.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH1497

Unveiling the Security Risks Driven by the Hardware Interrupts

Zhihao Zhang; Pengfei Qiu; Zhouyang Li; Yaxuan Zhao; Zeru Lan; Jiliang Zhang; Chunlu Wang; Gang Qu; Dongsheng Wang

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

The interrupt is a significant mechanism for computer systems to schedule hardware resources. We find that some hardware interrupts are not well-managed so that they can expose low-level events to unprivileged software, which introduce unanticipated security risks. We first design an automated testing framework to identify the interrupts' working features and trigger condition, which helps us discover six vulnerable interrupts. Next, we design a novel set of attack primitives for leaking secrets using them. Finally, we realize four attacks with the proposed attack primitives: leaking contents from a restricted directory, fingerprinting DNN model architectures, classifying processes, and enhancing Spectre attacks.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH1499

KCD-CAM: A K-Nearest-Neighbor Accelerator Based on Charge-Domain Content-Addressable-Memory for Point Cloud Processing

Tong Hu; Jiancong Li; Yingjie Yu; Zhiwei Zhou; Jia Chen; Yi Li; Xiangshui Miao

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

Three-dimensional (3D) point cloud models have been widely employed in modern 3D perception tasks such as robotics, autonomous driving, and virtual reality. The k-nearest-neighbor (KNN) search serves as a cornerstone operation for point cloud models, providing the essential mechanism for defining and exploiting local spatial relationships within unstructured data. However, the massive scale of point cloud data and the high computational complexity of Euclidean distance-based top-k searches in KNN impose substantial computational overhead. Conventional edge computing platforms struggle to achieve real-time and high efficiency of point cloud processing. In this work, we propose KCD-CAM, a KNN accelerator using ReRAM-based charge-domain content-addressable memory (CAM) for efficient point cloud processing. The proposed KCD-CAM employs a 4T2R CAM cell capable of performing in-situ range search, effectively replacing complex Euclidean distance calculations with massively parallel operations. In addition, corner-clipped (CC) iterative top-k search scheme and dual-granularity voxel hashing (DG-VH) are employed to enhance accuracy and parallelism. Performance benchmarks in real-world datasets demonstrate that KCD-CAM achieves 279.79× higher speed and 3282× greater energy efficiency than GPU implementations. Compared to the SOTA KNN accelerators, our KCD-CAM also achieved 8.51× and 4.76× improvements in speed and energy efficiency, respectively.

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1519

Agenticdse: A Multi-Agent Design Space Exploration Framework with Multi-Phase Bayesian Optimization for Chiplet Accelerators

Zhantong Zhu; Zhuolin Li; Kangbo Bai; Hongou Li; Tianyu Jia

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Interconnect-Driven System Architecture Innovation +1
Abstract

Chiplet accelerators offer a scalable solution for LLM inference with reduced manufacturing costs. However, the design space exploration (DSE) of chiplet accelerators is challenging due to the complex design space. Prior black-box Bayesian optimization (BO) solutions lack domain knowledge, limiting their effectiveness. In this work, we propose AgenticDSE, a multi-agent DSE framework that incorporates the collaboration of three LLM agents, i.e., exploration orchestrator, architecture analyst, and optimization engineer. This multi-agent framework enhances exploration efficiency through analysis-driven design space refinement and phase-wise design exploration using ensemble surrogate modeling. Over the same number of explorations, AgenticDSE achieves up to 36.9% reduction of average distance to the real Pareto front, and 66% increase in diversity of explored Pareto front compared to state-of-the-art DSE solutions. Additionally, it offers scalable performance with 46x and 18x reduction in input and output token consumption compared to prior LLM-based solutions.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageQuantum +1 more
RESEARCH1524

Exploiting Dependency and Parallelism: Real-Time Scheduling and Analysis for GPU Tasks

Yuanhai Zhang; Songyang He; Ruizhe Gou; Mingyue Cui; Boyang Li; Shuai Zhao; Kai Huang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

With the rapid advancement of Artificial Intelligence, the Graphics Processing Unit (GPU) has become increasingly essential across a growing number of safety-critical application domains. Applying a GPU is indispensable for parallel computing; however, the complex data dependencies and resource contention across kernels within a GPU task may unpredictably delay its execution time. To address these problems, this paper presents a scheduling and analysis method for Directed Acyclic Graph (DAG)-structured GPU tasks. Given a DAG representation, the proposed scheduling scales the kernel-level parallelism and establishes inter-kernel dependencies to provide a reduced and predictable DAG response time. The corresponding timing analysis yields a safe yet non-pessimistic makespan bound without any assumption on kernel priorities. The proposed method is implemented using the standard CUDA API, requiring no additional software or hardware support. Experimental results under synthetic and real-world benchmarks demonstrate that the proposed approach effectively reduces the worst-case makespan and measured task execution time compared to the existing methods up to $32.8\%$ and $21.3\%$, respectively.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH1534

DRΛMA: Detailed Routing by a Versatile Maze Routing Algorithm

Jinwei Liu; Xingyu Cui; Liang Xiao; Bangqi Fu; Evangeline Young

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

Detailed routing, despite its long history of study, is considered one of the most challenging problems in Electronic Design Automation, due to complex design rules and enormous scale. In this work, we propose a versatile maze routing algorithm to deal with the various challenges in detailed routing. By introducing hybrid grid graph, our maze routing algorithm can create both on-track and off-track wires during path search. By adopting a scalable track-based resource model and techniques like adaptive grid graph sparsification, it can handle both large guide-based and small region-based grid graphs. Moreover, we also propose a simple and effective rip-up and reroute strategy. As a result, we achieve design rule violation-free on most designs (9/10) in the ISPD 2018 detailed routing benchmarks, with 2.5% better score, 4.6% fewer vias, and 14.7% shorter runtime on average, and significantly lower non-preferred usage, compared with the state-of-the-art approaches.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH1552

Beyond Exact: Tight WCET Analysis of GPU Kernels with Branch Divergence

Shuai Zhao; Chen Jie; Yaowei Liang; Mingyue Cui; Xiaotian Dai; Nan Chen

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

The increasing adoption of GPUs in real-time systems necessitates precise timing analysis for GPU thread blocks to ensure overall system predictability. However, the uncertainties of branch executions within GPU warps impose significant barriers for predicting the worst-case execution time (WCET) of the thread block. Existing WCET analysis for thread blocks typically assumes deterministic warp execution paths, which is unrealistic given the dynamic control flows of threads within the warps. Moreover, as the warp scheduler operates as a black box, the analysis must rely on relaxed scheduling assumptions, resulting in overly-pessimistic bounds in order to cover edge scenarios. This paper first establishes the need for static analysis by showing how branch divergence can trigger timing anomalies and influence block WCETs. We then develop an exact WCET analysis for a GPU thread block under the same scheduling constraints as prior work while not imposing any warp execution path assumptions. Furthermore, by enforcing a practical constraint on warp executions, we present a tighter analysis that enhances system predictability. Experiments show that the proposed analysis under warp execution constraints significantly reduces the WCET estimations of GPU thread blocks (by 19.13% on average and up to 39.39%).

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH1560

LOHA: A Latency-Optimized CPU-Storage Hybrid Architecture for Billion-Scale Graph-Based Vector Similarity Search

Seongjoon Cho; Junyoung Park; Donghyun Kang; Moohyeon Nam; Hongchan Roh; Se-Hyun Yang; Moo-Kyoung Chung; Seungkyu Choi

Date:Monday, July 27 Location:Mtg Room 202C Session:Intelligent Fetch & Match Architectures +1
Abstract

Graph-based Approximate Nearest Neighbor Search (GANNS) has been widely adopted for vector retrieval owing to its high scalability and low latency. However, existing GANNS frameworks face severe latency bottlenecks on billion-scale datasets due to excessive SSD I/O. This work presents LOHA, a latency-optimized CPU–storage hybrid architecture for product quantization (PQ)-based billion-scale ANNS. LOHA decouples graph traversal and re-ranking workloads, executing the former on the CPU while offloading the latter to an in-storage accelerator that exploits SSD-level parallelism. Furthermore, a speculative re-ranking mechanism pipelines both stages to minimize idle time and reduce end-to-end latency. Experiments show that LOHA achieves throughput improvements of up to 9.6x, 6.2x, and 4.0x over CPU-, GPU-, and in-storage architectures, respectively.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1562

PG-GNN: Physics-Guided Graph Neural Network for Cell Timing Library Characterization

Jason Cheng; Jeffery Chen; Aaron Liang; Hung-Pin (Charles) Wen

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

Standard cell library characterization is a critical bottleneck for timing closure. Existing machine learning (ML) surrogates reduce SPICE costs but are constrained by labeled data requirements, limiting accuracy. This work introduces PG-GNN, a Physics-Guided Graph Neural Network (GNN) framework integrating residual learning with uncertainty-driven active learning. It simulates sparse anchor PVT corners, builds a physics-consistent reference, and trains a GNN on residual errors. Active learning then queries high-uncertainty points to minimize labeling. In experiments on TSMC 16 nm libraries, PG-GNN reduces SPICE effort by 98.7% with 1.67% MAPE. When applied to ISCAS'89 benchmarks, it achieves 0.74% critical path delay mismatch versus foundry libraries. PG-GNN offers orders-of-magnitude speedups in characterization runtime while maintaining signoff-level accuracy, presenting a scalable solution for next-generation IC design flows.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1569

FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration

Xingzhen Chen; Jinming Zhuang; Zhuoping Yang; Shixin Ji; Sarah Schultz; Zheng Dong; Weisong Shi; Peipei Zhou

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

With the development of deep neural network (DNN) enabled applications, achieving high hardware resource efficiency on diverse workloads is non-trivial in heterogeneous computing platforms. Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads. Other works discuss overlay architecture that can dynamically switch dataflow for different workloads. However, these works are still limited by flexibility granularity and induce much resource inefficiency. To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x - 5x throughput and hardware efficiency on various diverse workloads.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH1570

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Mingbo Hao; Changwei Yan; Haoyu Cui; Zhihao Yan; Yizhi Ding; Zhangrui Qian; Weiwei Shan

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Memory-Busting Inference Engines +1
Abstract

The rapid growth of LLMs demands high-throughput, memory-capacity–intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPUs and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage-access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads FFN computation into the Flash array while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, ECC units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated ECC. Attention weights remain in DRAM, and a KV-cache–aware scheduler maintains throughput as context grows. Evaluated on OPT and LLaMA models up to 30B parameters, NVLLM achieves 16.7× speedup over A800 out-of-core inference and 1.1×–4.7× improvement over SSD-like designs, with only 2.7\% CMOS area overhead.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsQuantumDES6. Quantum Computing +1 more
RESEARCH1582

Slicemoe: Bit-Sliced Expert Caching Under Miss-Rate Constraints for Efficient MoE Inference

Yuseon Choi; Sangjin Kim; Jungjun Oh; Gwangtae Park; Byeongcheol Kim; Hoi-Jun Yoo

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate–constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37× and 2.85×, respectively, and improves decode latency by up to 1.81× and 1.64×, while preserving near–high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH1585

DFA-DRIVE: A Cross-Layer Delay Fault Analysis and Optimization Framework for Robust Multi-Task Driving Perception

Nrusinga Charan Gantayat; Philip Jacobson; Matthew Marinella; Ben Feinberg; Jeff Zhang

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Scaling Intelligence for the Physical World +1
Abstract

Autonomous driving systems increasingly rely on deep neural network (DNN) based multi-task perception models for reliable, real time scene understanding. At nanoscale technology nodes, these workloads are highly susceptible to timing errors arising from temperature fluctuations, voltage droop, and device aging. Among these, temperature poses a critical challenge prolonged high thermal stress exacerbates delay faults, degrading perception accuracy and endangering safety-critical operation. We present DFA DRIVE, a cross-layer Delay Fault Analysis Framework for Autonomous Driving that bridges circuit-level timing analysis with system level resilience evaluation. DFA DRIVE quantifies how temperature induced timing failures propagate through object detection, drivable area segmentation, and lane line segmentation, exposing task level reliability bottlenecks. Building on this foundation, we introduce DFA-OPT, an adaptive DNN hardware mapping algorithm that dynamically reassigns systolic-array resources based on DNN layer and applicaiton level thermal sensitivity. Targeting the automotive reliability envelopes of AEC-Q100 Grade 0 (–40 °C to 150 °C) and Grade 1 (–40 °C to 125 °C), DFA-OPT restores near baseline accuracy of small, high reliability systolic arrays (e.g., 4×4) even when large systolic arrays (e.g., 256×256) experience accuracy drops of up to 4% at 150 °C, achieving comparable accuracy with up to 92% fewer computation cycles.

SystemsSYS1. Autonomous Systems (Automotive, Robotics, Drones)AIQuantum
RESEARCH1587

Celestialeye: A Synergistic CPU-FPGA Engine for Accurate Frequency Measurement in High-Speed Data Streams

Zhongxian Liang; Ying Li; Zimo Dong; Han Wang; Wenjun Li; Qun Huang; Zhuo Li; Weizhe Zhang

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

Accurate frequency measurement in high-speed data streams is crucial. While FPGA-based solutions are promising, existing designs often trade off throughput against accuracy due to limited hardware resources. This paper introduces CelestialEye, a CPU-FPGA synergistic architecture that employs a CPU-based heuristic decoding algorithm for high accuracy and offloads high-throughput tasks to the FPGA. Its pipelined hardware integrates hot/cold item separation, advanced compression, and encoding to optimize on-chip SRAM efficiency and bandwidth utilization. Furthermore, the frequency bounds generated by the FPGA process can accelerate the CPU's decoding. Experimental results demonstrate that CelestialEye significantly outperforms the state-of-the-art BitMatcher in both performance and accuracy.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH1596

PIMGRAG: A Heterogeneous PIM Architecture for Graph-Based Retrieval-Augmented Generation

Zhaoyu Zhong; Yunhao Dong; Yaodong Zhang; Jiaxian Chen; Tianyu Wang; Chenlin Ma; Rui Mao; Kecheng Huang; Yi Wang

Date:Monday, July 27 Location:Mtg Room 202C Session:Intelligent Fetch & Match Architectures +1
Abstract

Graph-based retrieval-augmented generation (RAG) improves the interpretability and factual consistency of large language models (LLMs) through structured knowledge graphs. Despite these benefits, graph-based RAG suffers from inefficient retrieval. The retrieval stage causes massive movement of vector and graph data between memory and processors. This movement leads to low arithmetic intensity and heavy pressure on memory bandwidth. This work presents PIMGRAG, a heterogeneous architecture that accelerates graph-based RAG through hardware/software co-design. At the hardware level, PIMGRAG designs a PIM architecture for the retrieval stage, which reduces off-chip data transfer by executing bandwidth-intensive operations near memory. At the software level, PIMGRAG applies a lightweight scheduling method that orchestrates PIM and GPU execution and lowers idle time across stages. Evaluation results show improvements in throughput, latency, and energy efficiency over CPU–GPU and existing PIM-based baselines.

DesignDES2B-I. In-memory and Near-memory Computing Architectures, Applications and SystemsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1605

Foundry-SRAM-Compatible Compute-Near-Memory Macro for Microscaled Block-FP in 12-nm CMOS

Samir Rahman; Arun George; Shehab Naga; Thomas Summe; Yipin Guo; Siddharth Joshi

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Are We There Yet? That is the Question ... for Computing near and in Memory. +1
Abstract

Trillion-parameter Transformers across vision, language, and action increasingly rely on reduced - precision floating- point (e.g., FP8) for dynamic range and efficiency. While their compute-intensive operations might make in-memory solutions attractive, the incompatibility of FP operations with analog computing renders many optimizations untenable for emerging compute-in/near-memory (CIM/CNM) platforms. To address this incompatibility, we present MANTIS, a foundry- SRAM -compatible, mixed-signal, CNM/CIM macro that performs vector integer multiply-accumulate opera- tions in the analog charge domain, while applying per-vector FP8 scaling factors digitally via microscaling (MXFP). The proposed design, in a commercially available 12 nm node, reuses the digital-to-analog converter (DAC) in a successive approximation register (SAR) analog-to-digital converter (ADC) as both a bit-plane charge-domain accumulator and as part of the ADC. The design is implemented as a 4kb macro, operating at 0.8V. Our 8b ADC delivers 7.62 effective number of bits (ENOB), resulting in a precision-scalable efficiency of 22.5 TOPS/W (for MXINT3×MXINT3) to 6.43 TOPS/W(MXINT8×MXINT3). In end-to-end evaluations on MMLU (5-shot) and HellaSwag (0-shot), our design only incurs a 0.15 percentage point accuracy degradation across multiple open-weight LLMs (vs. the MXINT8 activation × MXINT3 weight, quantized baseline). For reduced-precision activations (MXINT6 / MXINT3 + FP8 scalar), our results track within ±0.3% of the quantized models, with minimal impact from analog computation.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIAI5-I. AI/ML System and Platform Design
RESEARCH1608

Compiler Framework for Directional Transport in Zoned Neutral-Atom Systems with AOD Assistance: A Hybrid Remote CZ Approach

Lingyi Kong; Chen Huang; Zhemin Zhang; Yidong Zhou; Xiangyu Ren; Shaochen Li; Zhiding Liang

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Charting the Architecture of Tomorrow’s Fault Tolerant Quantum Machines by Harmonizing Codes to Systems +1
Abstract

We present a directional-transport (DT)-based remote CZ gate and compiler for zoned neutral-atom arrays that overcomes movement-bound entanglement limitations. Current AOD-based shuttling faces row/column non-crossing constraints, device-speed limits, and FOV/NA-restricted range—bottlenecks for long-distance connectivity. Our approach reserves AODs for channel setup and micro-tuning while making DT the default for remote entanglement. Under antiblockade, a detuning-modulated pi-pulse sequence drives directional transport of a Rydberg excitation along a dynamic and resettable ancilla corridor, realizing a CZ gate between stationary, non-adjacent qubits. This cuts entangling-stage duration by approximately 50% -90% versus AOD-only baselines and enables long-distance connectivity beyond objective-limited shuttling.

DesignQuantumDES6. Quantum ComputingDES3. Emerging Models of Computation +1 more
RESEARCH1610

Algorithm and Hardware Co-Design for Efficient Complex-Valued Uncertainty Estimation

Zehuan Zhang; Hao (Mark) Chen; He Li; Wayne Luk

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Where Biology Meets Computation: Next‑Gen Bio‑Cyber‑Physical Innovations +1
Abstract

Complex-Valued Neural Networks (CVNNs) have significant advantages in handling tasks that involve complex numbers. However, existing CVNNs are unable to quantify predictive uncertainty. We propose, for the first time, dropout-based Bayesian Complex-Valued Neural Networks (BayesCVNNs) to enable uncertainty quantification for complex-valued applications, exhibiting broad applicability and efficiency for hardware implementation due to modularity. Furthermore, as the dual-part nature of complex values significantly broadens the design space and enables novel configurations based on layer-mixing and part-mixing, we introduce an automated search approach to effectively identify optimal configurations for both real and imaginary components. To facilitate deployment, we present a framework that generates customized FPGA-based accelerators for BayesCVNNs, leveraging a set of optimized building blocks. Experiments demonstrate the best configuration can be effectively found via the automated search, attaining higher performance with lower hardware costs compared with manually crafted models. The optimized accelerators achieve approximately 4.5× and 13× speedups on different models with less than 10% power consumption compared to GPU implementations, and outperform existing work in both algorithm and hardware aspects. The code will be open-source after acceptance.

DesignDES3. Emerging Models of ComputationQuantumEDA
RESEARCH1612

EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation

Jiahe Shi; Zhengqi Gao; Ching-Yun Ko; Duane Boning

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

Recent advances in large language models (LLMs) have demonstrated significant potential in hardware design automation, particularly in using natural language to synthesize Register-Transfer Level (RTL) code. Despite this progress, a gap remains between model capability and the demands of real-world RTL design, including syntax errors, functional hallucinations, and weak alignment to designer intent. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising approach to bridge this gap, as hardware provides executable and formally checkable signals that can be used to further align model outputs with design intent. However, in long, structured RTL code sequences, not all tokens contribute equally to functional correctness, and naïvely spreading gradients across all tokens dilutes learning signals. A key insight from our entropy analysis in RTL generation is that only a small fraction of tokens (e.g., always, if, assign, posedge) exhibit high uncertainty and largely influence control flow and module structure. To address these challenges, we present EARL, an Entropy-Aware Reinforcement Learning framework for Verilog generation. EARL performs policy optimization using verifiable reward signals and introduces entropy-guided selective updates that gate policy gradients to high-entropy tokens. This approach preserves training stability and concentrates gradient updates on functionally important regions of code. Our experiments on VerilogEval and RTLLM show that EARL improves functional pass rates over prior LLM baselines by up to 14.7%, while reducing unnecessary updates and improving training stability. These results indicate that focusing RL on critical, high-uncertainty tokens enables more reliable and targeted policy improvement for structured RTL code generation. We will release the code upon acceptance. An anonymized repository for review is available at https://anonymous.4open.science/r/EARL-1C25.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1613

STRAP-ViT: Segregated Tokens with Randomized - Transformations for Defense Against Adversarial Patches in ViTs

Nandish Chattopadhyay; Anadi Goyal; Chandan Karfa; Anupam Chattopadhyay

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Robust and Trusted AI Systems: Attacks, Defenses, and Hardware Reliability +1
Abstract

Adversarial patches are physically realizable localized noise, which are able to hijack Vision Transformers (ViT) self-attention, pulling focus toward a small, high-contrast region and corrupting the class token to force confident misclassifications. In this paper, we claim that the tokens which correspond to the areas of the image that contain the adversarial noise, have different statistical properties when compared to the tokens which do not overlap with the adversarial perturbations. We use this insight to propose a mechanism, called STRAP-ViT, which uses Jensen-Shannon Divergence as a metric for segregating tokens that behave as anomalies in the Detection Phase, and then apply randomized composite transformations on them during the Mitigation Phase to make the adversarial noise ineffective. The minimum number of tokens to transform is a hyper-parameter for the defense mechanism and is chosen such that at least $50\%$ of the patch is covered by the transformed tokens. STRAP-ViT fits as a non-trainable plug-and-play block within the ViT architectures, for inference purposes only, with a minimal computational cost and does not require any additional training cost/effort. STRAP-ViT has been tested on multiple pre-trained vision transformer architectures (ViT-base-16 and DinoV2) and datasets (ImageNet and CalTech-101), across multiple adversarial attacks (Adversarial Patch, LAVAN, GDPA and RP2 ), and found to provide robust accuracies lying within a 2-3% range of the clean baselines, and outperform the state-of-the-art.

SecuritySEC1. AI/ML Security/PrivacyAIQuantum
RESEARCH1619

HDDB: Efficient In-Storage SQL Database Search Using Hyperdimensional Computing on Ferroelectric NAND Flash

Quanling Zhao; Yanru Chen; Runyang Tian; Sumukh Pinge; Weihong Xu; Augusto Vega; Steven Holmes; Saransh Gupta; Tajana Rosing

Date:Monday, July 27 Location:Mtg Room 203AB Session:Advancing the Frontier of Neuromorphic Learning Systems +1
Abstract

Hyperdimensional Computing (HDC) encodes information and data into high-dimensional distributed vectors that can be manipulated using simple bitwise operations and similarity searches, offering parallelism, low-precision hardware friendliness, and strong robustness to noise. These properties are a natural fit for SQL database workloads dominated by predicate evaluation and scans, which demand low energy and low latency over large fact tables. Notably, HDC's noise-tolerance maps well onto emerging ferroelectric NAND (FeNAND) memories, which provide ultra-high density and in-storage compute capability but suffer from elevated raw bit-error rates. In this work, we propose HDDB, a hardware–software co-design that combines HDC with FeNAND multi-level cells (MLC) to perform in-storage SQL predicate evaluation and analytics with massive parallelism and minimal data movement. Particularly, we introduce novel HDC encoding techniques for standard SQL data tables and formulate predicate-based filtering and aggregation as highly efficient HDC operations that can happen in-storage. By exploiting the intrinsic redundancy of HDC, HDDB maintains correct predicate and decode outcomes under substantial device noise (up to 10% randomly corrupted TLC cells) without explicit error-correction overheads. Experiments on TPC-DS fact tables show that HDDB achieves up to 80.6× lower latency and 12,636× lower energy consumption compared to conventional CPU/GPU SQL database engines, suggesting that HDDB provides a practical substrate for noise-robust, memory-centric database processing.

DesignDES3. Emerging Models of ComputationChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1622

N3Litho: A GPU-Native Numerical-Neural Framework for 3D Mask Modeling, 3D Propagation, and 3D Aerial Image Simulation

yuyang chen; Zhen Wang; Yinuo Zhu; Yiwen Wu; Wenxuan Dong; Jiaqi Liu; Jingyi Yu; Hao Geng

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Differentiable Dreams: GPU-Powered Litho Gets Its Gradient Groove Back +1
Abstract

Accurate and efficient lithography simulation is critical for virtual fabrication and shift-left manufacturing. While thin-mask models and single-plane aerial images suffice at older nodes, advanced nodes require 3D in-resist aerial images to capture depth-dependent effects from mask thickness, sidewalls, and multilayer stacks. Rigorous EM solvers provide this fidelity but are too slow for ILT or full-chip use, and thin-mask learning methods fail to model the underlying vector field needed for 3D propagation. This paper introduces a new paradigm that reconstructs the full 3D in-resist vector field from the image-plane vector field and 3D mask/film-stack parameters. Based on this insight, we develop N3Litho, a GPU-native numerical-neural framework that couples a GPU-accelerated vectorial Abbe solver with an NGP-based multi-resolution hash backbone for fast 3D field reconstruction. The method achieves near-EM fidelity with orders-of-magnitude speedup, and the resulting 3D aerial images can be directly used by resist models, OPC/ILT pipelines, and hotspot detection flows

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH1628

Attackonctf: Defending Hardware Security Competition Benchmarks in the Age of LLMs

Mohamadreza Rostami; Nikhilesh Singh; Stephen Muttathil; Lichao Wu; Chen Chen; Huimin Li; Jeyavijayan Rajendran; Ahmad-Reza Sadeghi

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

Hardware security competitions such as HackTheSilicon serve as benchmarking platforms for evaluating vulnerability detection methods and training human and AI. However, our study reveals that LLMs threaten their validity. Instead of genuine security reasoning, detectors exploit a diff-style syntactic comparison, achieving an 83% detection rate, undermining fair evaluation. To mitigate this, we propose the first LLM-oriented, semantics-preserving obfuscation framework for these benchmarks. Unlike IP-protection approaches, it applies human-readable transformations and controlled diff-noise while preserving functionality. On HackTheSilicon, the framework reduces LLM-based detection accuracy by 50% with only 10% obfuscation and by 78.6% under complete obfuscation, restoring benchmark reliability.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH1629

AQR-HNSW: Accelerating Approximate Nearest Neighbor Search via Density-Aware Quantization and Multi-Stage Re-Ranking

Ganap Tewary; Nrusinga Charan Gantayat; Jeff Zhang

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Fast, Furious, and Fault-Tolerant: Accelerating the Generative Grind +1
Abstract

Approximate Nearest Neighbor (ANN) search has become fundamental to modern AI infrastructure, powering recommendation systems, search engines, and large language models across industry leaders from Google to OpenAI. Hierarchical Navigable Small World (HNSW) graphs have emerged as the dominant ANN algorithm, widely adopted in production systems due to their superior recall vs. latency balance. However, as vector databases scale to billions of embeddings, HNSW faces critical bottlenecks: memory consumption expands, distance computation overhead dominates query latency, and it suffers suboptimal performance on heterogeneous data distributions. This paper presents Adaptive Quantization and Rerank HNSW (AQR-HNSW), a novel framework that synergistically integrates three strategies to enhance HNSW's scalability. AQR-HNSW introduces (1) density-aware adaptive quantization, achieving 4× compression while preserving distance relationships; (2) multi-state re-ranking that reduces unnecessary computations by 35%; and (4) quantization-optimized SIMD implementations delivering 16-64 operations per cycle across architectures. Evaluation on standard benchmarks demonstrates 2.5-3.3× higher QPS than state-of-the-art HNSW implementations while maintaining 98%+ recall, with 75% memory reduction for the index graph and 5× faster index construction.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH1632

Ecprefix: Erasure-Coded Prefix KV Cache for Optimizing Load Balancing in Large Language Model Serving

Feifan Liu; Yuchong Hu; Mingqi Li; Lin Wang; Chenxuan Yao; Zhenghao Yu; Dan Feng

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

Prefix KV cache is widely used to accelerate LLM serving by trading more storage for less computation, and state-of-the-art methods often replicate hotspot caches for load balancing. However, we observe that the few nodes that have cache replicas still lead to severe load imbalance. This paper presents ECPrefix: a new prefix KV cache framework based on erasure coding (instead of replication), which distributes encoded blocks of hot prefix caches (organized as profile-guided objects) across nodes, along with adaptive striping and pipelined reading optimizations. Evaluation shows that ECPrefix reduces TTFT by up to 52.3% over existing systems.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH1638

Hierarchical Boundary Recovery: Overcoming Synthesis Obscurity via SAT Sweeping

Kuo-Wei Ho; Jie-Hong Roland Jiang; Alan Mishchenko; Sean Weaver; Yu-Wei Fan

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

In the integrated circuit design flow, a high-level specification is synthesized into a gate-level netlist. However, this synthesis often results in the loss of the association between the high-level and gate-level designs. Maintaining signal correspondences between these two levels is crucial for preserving the design hierarchy, which in turn facilitates design debugging, engineering change orders (ECOs), cross-abstraction-level model training, and other synthesis and verification tasks. In this work, we present an automated approach for recovering the hierarchical boundaries of high-level components in gate-level netlists, even in cases where the synthesis process has completely removed the original boundary signals. The method is implemented as an open-source tool, and the experimental results demonstrate its robustness in recovering hierarchical boundaries. Our results may benefit various potential applications in synthesis and verification.

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1640

Unispike: Accelerating Spiking Neural Networks on Neuromorphic Systems via Eliminating Address Redundancy

Qinghui Xing; Zhuo Chen; Xin Du; Ouwen Jin; Ming Zhang; Pan Lv; Ying Li; Shuiguang Deng; Gang Pan

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

Many-core neuromorphic systems serve as specialized accelerators for Spiking Neural Networks (SNNs). However, the communication mechanisms in existing neuromorphic systems incur significant traffic and energy overhead due to redundant address transmissions, a bottleneck exacerbated by the short payloads inherent to the spike-based communication nature of SNNs. Specifically, our characterization reveals that in current neuromorphic systems, duplicate address transmissions can account for up to 49% of the total traffic in representative workloads. This paper presents UniSpike, a hardware-software co-design that eliminates address redundancy by aggregating spikes destined for the same core into single, compact packets. UniSpike implements a spike transmission scheduling strategy to enable efficient spike merging, supported by a dedicated hardware architecture for runtime packet assembly and dispatch, as well as a destination-aware SNN partitioning algorithm that maximizes address sharing opportunities. Experimental results demonstrate that on average, UniSpike reduces traffic volume by 1.93x, achieving 1.77x speedup and 1.50x energy efficiency improvement over state-of-the-art designs.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more
RESEARCH1646

Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors

Wei Xing; Kaiqi Huang; Jiazhan Liu; Hong Qiu; Shan Shen

Date:Tuesday, July 28 Location:Mtg Room 101B Session:AI Graces the Cell Party: Transformers, LLMs and Zero Drama Extraction +1
Abstract

Yield Multi-Corner Analysis validates circuits across 125+ Process-Voltage-Temperature corners, creating combinatorial simulation cost of $O(K \times N)$ where $K$ denotes corners and $N$ exceeds $10^5$ samples per corner. Existing methods face a fundamental trade-off: simple models achieve automation but fail on nonlinear circuits, while advanced AI models capture complex behaviors but require hours of hyperparameter tuning per design iteration, forming the Tuning Barrier. We break this barrier by replacing engineered priors (i.e., model specifications) with learned priors from a foundation model pre-trained on millions of regression tasks. This model performs in-context learning, instantly adapting to each circuit without tuning or retraining. Its attention mechanism automatically transfers knowledge across corners by identifying shared circuit physics between operating conditions. Combined with an automated feature selector (1152D to 48D), our method matches state-of-the-art accuracy (mean MREs as low as 0.11\%) with zero tuning, reducing total validation cost by over $10\times$.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH1664

Probabilistic Memory Design for Efficient Trustworthy Edge Intelligence

Likai Pei; Jiahao Zheng; Xueji Zhao; Emilie Ye; Jianbo Liu; Hanqing Tao; Ming-Yen Lee; Ruiyang Qin; Yiyu Shi; Shimeng Yu; X. Sharon Hu; Ningyuan Cao

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

Probabilistic computation plays an important role in trustworthy edge intelligence to quantify uncertainty, enhance robustness, reconstruct data and protect privacy, but its adoption is limited by orders of magnitude data throughput gap between Gaussian random number generation(GRNG) and computation, as well as instruction overhead. This paper introduces \emph{probabilistic memory} (p-MEM), a unified memory primitive that stores distribution parameters and samples directly at native memory bandwidth where deterministic data becomes the zero-variance special case. Using a layout-validated p-MEM simulator, we comprehensively explore device choices, memory specifications, and technology nodes, showing that p-MEM can achieve $>1000$\,GSa/s/mm$^2$ GRNG throughput (including memory arrays). Integrated into CPU / GPU systems, p-MEM reduces instruction count by up to $2.19\times$/$4.37\times$, sampling latency by $562\times$/$3.45\times$, and energy by $295.5\times$/$3.53\times$ for BNN workloads, providing a scalable hardware substrate for trustworthy probabilistic AI.

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1665

CARAMEL: Boosting Device Utilization in Control Flow Auditing

Alexandra Lengert; Adam Caulfield; Ivan De Oliveira Nunes

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

Microcontroller Units (MCUs) are widely used in safety-critical systems, making them frequent attack targets. This demands lightweight defenses that remain reliable even after software compromise. Control Flow Auditing (CF-Aud) strengthens Control Flow Attestation by ensuring authenticated control-flow logs (CFLogs) are reliably delivered from a compromised prover (Prv) to a remote verifier (Vrf), enabling assessment of system behavior and support for remediation. However, existing CF-Aud designs rely on a costly busy-wait phase that limits Prv's utilization. In this work, we propose CARAMEL: a hybrid hardware-software root-of-trust architecture that reduces this bottleneck by enabling log transmission without halting execution. Its minimal communication interface and implementation are open-source.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1691

Towards Always-on Interaction: A Sub-mW Visual Wakeup Acceleration Subsystem for Smart Glasses

Arpan Suravi Prasad; Sergio Mazzola; Cyrill Durrer; Davide Rossi; Francesco Conti; Luca Benini

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

Always-on vision is essential for smart glasses, yet continuous visual-processing under stringent power budgets remains a major challenge. This work presents an always-on visual wakeup acceleration-engine for sub-milliwatt hand gesture recognition. A compact convolutional neural network (<13k parameters), trained on the HaGRID dataset, performs binary classification on 64×64 inputs achieving over 92% accuracy at 3b precision, requiring less than 9kB of memory. When executed on a flexible accelerator engine implemented in 7nm technology, at 100MHz, it consumes only 64nJ per frame, translating to always-on power of 11μW at 30FPS, enabling energy-efficient, always-on interaction for next-generation smart-glasses.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more
RESEARCH1692

CORE: Compact, Optimized, and Resource-Efficient Compute Engine for FP-INT GEMM Acceleration with Binary Coding Quantization

Chaewon Park; Inseong Hwang; Beom Jin Kang; Hyun Kim

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

Recent advances in large language models (LLMs) have driven exponential growth in parameter counts, amplifying memory footprints and creating acute computational bottlenecks. Weight-only quantization is a common mitigation because weights dominate storage and incur less accuracy degradation than activation quantization. However, it leaves activations in floating point (FP), forcing FP execution whose area and energy costs far exceed those of integer (INT) compute. To address this limitation, binary coding quantization (BCQ) lowers this cost by replacing many FP multiplies with FP additions, but still requires FP multiplications for scaling factors and suffers from limited representational capacity, causing non-trivial accuracy loss. In this paper, we propose a sum-of-power-of-two scaling factor–based BCQ (SS-BCQ) that approximates FP scaling factors with a small set of power-of-two terms, eliminating FP multiplications (shift–add only) while preserving accuracy. To efficiently realize SS-BCQ in hardware, we introduce CORE, a fully pipelined FP–INT general matrix-matrix multiplication (GEMM) engine. CORE features (i) an adaptive data-mapping module that computes optimized block-wise activation scales to minimize scaling operations, and (ii) an extra processing unit that allocates a small subset of processing elements to high-importance weights to retain accuracy at low cost. Evaluated on the OPT-6.7B model, SS-BCQ reduces perplexity by 3.14 compared with prior BCQ methods. Implemented in CORE, our approach achieves up to 1.41x higher area efficiency (TOPS/mm^2) and 2.18x higher energy efficiency (TOPS/W) than state-of-the-art FP–INT accelerators, enabling a homogeneous low-precision execution path for on-device LLMs.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH1709

Compilation Pipeline for Predicting Algorithmic Break-Even in an Early-Fault-Tolerant Surface Code Architecture

Tianyi Hao; Joseph Sullivan; Sivaprasad Omanakuttan; Michael Perlin; Ruslan Shaydulin

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Charting the Architecture of Tomorrow’s Fault Tolerant Quantum Machines by Harmonizing Codes to Systems +1
Abstract

Recent experimental progress in surface-code hardware, including demonstrations of break-even logical memory on devices with up to hundreds of physical qubits, has materially advanced the prospects for fault-tolerant quantum computation. This progress creates ur- gency for compilation workflows that directly target the forth- coming generation of devices with thousands of physical qubits, for which algorithm execution becomes practical. We develop a pipeline for compiling logical algorithms to physical circuits imple- menting lattice surgery on the surface code, and use this pipeline to identify the requirements for achieving algorithmic break-even— where quantum error correction improves the performance of a quantum algorithm—for the quantum approximate optimization algorithm (QAOA). Our pipeline integrates several open-source software tools, and leverages recent advances in error-aware uni- tary gate synthesis, high-fidelity magic-state production, and the calculation of correlation surfaces in the surface code. We apply our pipeline by performing classical simulations of physical Clifford proxy circuits produced by our pipeline, and find that 5-qubit QAOA can reach algorithmic break-even with 2517 physical qubits (surface code distance 𝑑 = 11) at physical error rates of 𝑝 = 10−3, or 1737 physical qubits (𝑑 = 9) at 𝑝 = 5 × 10−4. Our work thereby identifies conditions for achieving algorithmic break-even with near-term quantum hardware and paves the way towards an end-to-end com- piler for early-fault-tolerant surface code architectures

DesignQuantumDES6. Quantum ComputingDES3. Emerging Models of Computation +1 more
RESEARCH1712

RAM-CGRA: Reachability-Aware Mapping for CGRAs

Anh Nguyen; Linyi Li; Jason H. Anderson

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Interconnect-Driven System Architecture Innovation +1
Abstract

Coarse-grained reconfigurable arrays (CGRAs) are programmable hardware devices having large coarse-grained processing elements (PEs) and word-wide configurable interconnect. The interconnect comprises a considerable fraction of total CGRA area, making it desirable to restrict interconnect connectivity richness. However, reducing interconnect richness makes application mapping more difficult, and impossible in some cases. We propose techniques for mapping applications onto CGRAs with restricted routing architectures. For a target CGRA having restricted interconnect, we precompute the reachability and distance between all PE pairs, as well as an estimate of the number of available routing paths between PE pairs. The values are stored in tables, and used during the placement stage of the mapping flow to assess the potential routability of an intermediate placement. For benchmark applications mapped onto CGRA variants with restricted interconnect, results demonstrate higher mapping success, and lower mapping runtimes vs. recent state-of-the-art CGRA mappers.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageQuantum +1 more
RESEARCH1715

Semi-Virtual Addressing to Enhance Memory Safety on Microcontrollers

Unjang Yeo; Kyeongwoo Park; Yeongpil Cho

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

Microcontrollers lack support for virtual addressing, limiting their ability to employ efficient memory protection mechanisms that rely on a spacious virtual address space. While integrating an MMU could overcome this limitation, it is often impractical due to increased hardware cost and unpredictable latency. To address this, we propose \textit{semi-virtual addressing}—a lightweight technique that extends the effective address space of microcontrollers without full virtual memory support. This technique introduces a Lightweight Address Translation Module (LATM) that performs deterministic, computation-based address translation without page tables, enabling semi-virtual addressing with minimal translation overhead. Implemented on the RISC-V CORE-V-MCU, LATM expands its 575 KB physical address space to 2 GB of semi-virtual address space. We demonstrate the practicality of semi-virtual addressing by enabling heap randomization and type-safe allocation on the extended address space, efficiently enhancing memory protection.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1718

RISC-Q: A Generator for Real-Time Quantum Control System-on-Chips Compatible with RISC-V

Junyi Liu; Yi Lee; Haowei Deng; Connor Clayton; Gengzhi Yang; Xiaodi Wu

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

Quantum computing imposes stringent requirements for the precise control of large-scale qubit systems, including, for example, microsecond-latency feedback and nanosecond-precision timing of gigahertz signals – demands that far exceed the capabilities of conventional real-time systems. The rapidly evolving and highly diverse nature of quantum control necessitates the development of specialized hardware accelerators. While a few custom real-time systems have been developed to meet the tight timing constraints of specific quantum platforms, they face major challenges in scaling and adapting to increasingly complex control demands – largely due to fragmented toolchains and limited support for design automation. To address these limitations, we present RISC-Q – an open-source flexible generator for Quantum Control System-on-Chip (QCSoC) designs, featuring a programming interface compatible with the RISC-V ecosystem. Developed using SpinalHDL, RISC-Q enables efficient automation of highly parameterized and modular QCSoC architectures, supporting agile and iterative development to meet the evolving demands of quantum control. We demonstrate that RISC-Q can surpass the performance of existing QCSoCs with significantly reduced development effort, facilitating efficient exploration of the hardware-software co-design space for rapid prototyping and customization.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH1732

Drop-In Masked Modular Reduction for ML-DSA: Cutting Side-Channel Cost in the Root-of-Trust

Merve Karabulut; Reza Azarderakhsh

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon’s Last Stand: Defending Against Quantum and Architectural Decay +1
Abstract

Masking is an effective defense against side-channel attacks, yet it remains costly under hardware constraints. The Caliptra Root-of-Trust is a representative case, where its masked ML-DSA implementation incurs about 6× area overhead. We propose a novel first-order masking solution that optimizes Caliptra, achieving significant improvements in area–delay efficiency. Compared to Caliptra's ML-DSA reduction, our design achieves a 12.1× speedup, reducing LUTs by 86.7% and FFs by 94.5%, while improving area–delay efficiency by 91×. The optimized architecture increases signing throughput by 1.32×. TVLA, with over 1,000,000 traces, shows no first-order leakage, satisfies Caliptra's security requirements, and significantly improves implementation efficiency.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestEDA7-II. Physical Design and VerificationDesign
RESEARCH1741

Two Teachers Better Than One: Hardware-Physics Co-Guided Distributed Scientific Machine Learning

Yuchen Yuan; Junhuan Yang; Hao Wan; Yipei Liu; Hanhan Wu; Youzuo Lin; Lei Yang

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Fast, Furious, and Fault-Tolerant: Accelerating the Generative Grind +1
Abstract

Scientific machine learning (SciML) field deployment faces communication bottlenecks from centralized architectures, while distributed approaches violate physical principles. We introduce EPIC, a hardware-physics co-guided framework using full-waveform inversion as a representative task. EPIC performs lightweight edge encoding and physics-aware central decoding, transmitting compact latents instead of raw data. Cross-attention preserves inter-receiver wavefield coupling while reducing communication costs. Evaluated on five Raspberry Pi devices across 10 OpenFWI datasets, EPIC reduces latency by 8.9× and communication energy by 33.8×, while improving reconstruction fidelity on 8 out of 10 datasets compared to centralized approaches.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH1742

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

Prashanth Vijayaraghavan; Apoorva Nitsure; Luyao Shi; David Beymer; Ehsan Degan; Vandana Mukherjee

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1764

Ciphershield: Safeguarding DNN Inputs from Ciphertext Side-Channels in TEE

Hui Feng; Yuntao Liu; Qian Wang

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Robust and Trusted AI Systems: Attacks, Defenses, and Hardware Reliability +1
Abstract

Confidential environments such as Trusted Execution Environments (TEEs) are increasingly used to protect the data and models of machine learning applications from adversarial attacks. Frameworks like TEE-protected Neural Networks (NN) have been deployed for secure cloud-based inference. However, recent studies have revealed one kind of ciphertext side channels that exploit the encryption process within TEE environments, known as CipherSteal attacks. Specifically, when data is transferred into a TEE using encryption schemes such as AES, the resulting ciphertext patterns can be observed to infer input data. In this paper, we propose a series of methods named CipherShield to mitigate the ciphertext side-channel leakage and protect sensitive input data. It is a set of lightweight transformations that diffuse and decorrelate ciphertext without modifying the cryptography scheme. Our defenses, including block-based encryption, sparsity, and quantization, disrupt the per-address mapping patterns that ciphertext side-channels rely on to detect collisions while maintaining TEE compatibility and high throughput. The ciphertext hit rate, which reflects the amount of information leaked through ciphertext traces, drops from 50–80% to nearly zero across all evaluated datasets. On MNIST, CipherShield reduces classification accuracy from about 68% to as low as 2%. Evaluations on the Chest X-ray and CelebA datasets show similar reductions. Also, prototyping the block-based encryption in a TEE environment achieves markedly lower runtime than the original trace-based method, reducing encryption time by over 20x.

SecuritySEC1. AI/ML Security/PrivacyAIQuantum
RESEARCH1765

L2L: Logic to Layout Exploration of Standard Cell Library Design

Byeonggon Kang; Alan Mishchenko; Masahiro Fujita; Bill Lin; Chung-Kuan Cheng

Date:Tuesday, July 28 Location:Mtg Room 101B Session:AI Graces the Cell Party: Transformers, LLMs and Zero Drama Extraction +1
Abstract

We present L2L, exploring end-to-end design optimization from logic to layout. The L2L framework includes logic, circuit, and layout stages. Logic exploration introduces two-stage transistor network synthesis. At the circuit stage, we propose two SMT models for transistor network synthesis: Connected-Diffusion (arbitrary, stack-height-bounded, multi-solution) and Grid-Scaffold (series-parallel). At the layout stage, an ILP-based flow co-optimizes circuit topology and double-height layouts, producing two 3-input P-class libraries (area- and metal-optimized). Block-level experiments demonstrate that full inclusion of 3-input P-class functions achieves up to 1.3% lower power, 5.6% higher performance, and 4.1% smaller area (average: 0.8%, 0.6%, 2.6%) outperforming the best prior baseline.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH1784

Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

Cunxi Yu; Haoxing Ren

Date:Monday, July 27 Location:Mtg Room 101A Session:Self-Driving EDA: Agents, Benchmarks, and Evolving Toolchains +1
Abstract

This paper presents the first self-evolved logic synthesis framework, which leverages Large Language Models (LLMs) in a multi-agent setup to autonomously improve the source code of ABC. Our approach builds upon recent work in LLM-driven code evolution and extends the idea to the substantially more complex monolithic ABC codebase. We bootstrap the process with established human-designed optimizations and external research code. And in each iteration, the agents propose and implement code modifications which are then validated for correctness and evaluated on standard benchmark circuits to provide quality-of-result (QoR) feedback. Over time, the framework organically discovers improvements beyond the initial heuristics and complete the coding evolution autonomously in the whole ABC repository, effectively learning-to-progress better synthesis tool.

AIAI2-I. AI/ML Algorithms and ModelsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1793

Stone-Skimming: Attacking Big-Data Analytics Applications with Page Table Walk

Jiajun Luo; Yu Jin; Yufeng Gu; Jinyi Deng; Yang Hu; Shuwen Deng

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Silicon Under Siege: Hardware-Rooted Attacks and Defenses in Modern SoCs +1
Abstract

This paper proposes the Stone-Skimming microarchitectural attack, the first page table walk (PTW)-based attack targeting large-scale sensitive memory applications. Stone-Skimming leverages the correlation among multiple signals to establish novel side channel attacks. We develop the first microarchitectural attack targeting DNA sequence reconstruction. Multiple PTW entries generated by the victim are cross-validated, enabling robust DNA reconstruction where traditional attacks fail. Moreover, we discover a new stateful side effect of PTW on prefetch instructions that can be exploited to break KASLR, even with KPTI on. We also introduce the early-instruction-fetch (early-IF) attack capable of bypassing LFENCE and CPUID defense.

SecuritySEC3-I. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH1804

Perifi:physics-Infused RF Inverse Design with Parametric Data-Efficient Feasible-Region Sampling

Siwen Wang; Keren Zhu; Hyunsu Chae; Zhaori Bi; Zhiang Wang; Changhao Yan; Xuan Zeng

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

Recent AI-driven inverse design approaches have shown promise in synthesizing complex electromagnetic (EM) structures for radio- frequency (RF) circuits. However, existing methods suffer from two fundamental limitations: random topology generation often pro- duces physically infeasible layouts, and pixel-based representations encode only topology while ignoring geometric dimensions, forc- ing complete dataset regeneration and model retraining whenever layout scale changes. To address these issues, we propose PeRIFi, a physics-constrained and geometry-aware inverse design frame- work built on three key innovations. (1) Feasibility-aware param- eterization integrates B-splines, which guarantee direct-current (DC) connectivity, with level-set representations that enable flexible geometric variation, ensuring 100% physically feasible structure generation. (2) Explicit geometric encoding decouples topol- ogy from geometric dimensions, allowing a single surrogate model to generalize across multiple layout scales without regenerating datasets. (3) High-dimensional optimization employs Particle Swarm Optimization tailored to the proposed 268-dimensional fea- sible design space. Experimental results demonstrate substantial data-efficiency gains: with only 5k training samples, PeRIFi attains 73% lower MSE and a 6.7% higher 𝑅^2 than a pixel-based baseline trained on 20k samples. Furthermore, PeRIFi reduces optimization cost by 22.19%–34.71% and lowers prediction error (MAE) by 58.2% compared with state-of-the-art methods, enabling more accurate and scalable RF inverse design.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH1813

RL4HDL: Code Diversity Guided FPGA Logic Synthesis Compiler Testing Via Reinforcement Learning

Zhihao Xu; Hui Zeng; Hui Li; Qian Ma; Furui Zhan; Shikai Guo

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

Logic synthesis compilers play an indispensable role in FPGA design, translating register-transfer-level (RTL) descriptions into gate-level netlists. Defects in these compilers may compromise the security of the final hardware implementations. Existing testing methods often produce redundant test cases, which limits their ability to effectively explore compiler defects. To address this problem, we propose RL4HDL, a new method that uses reinforcement learning to guide logic synthesis compiler testing. Comprehensive experiments conducted over a three-month period demonstrate the practical effectiveness of our approach: we discovered 20 unique defects, including 12 previously unreported ones, all of which have been confirmed by the official developers.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH1822

Leveraging AI-Inspired Hardware Architecture to Enhance LPN Acceleration in Post-Quantum Cryptography

Yingxue Gao; Jiliang Zhang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

Learning Parity with Noise (LPN) has emerged as a promising foundational primitive for post-quantum cryptography (PQC). However, existing LPN accelerators primarily optimize isolated computation while overlooking execution flow and suffering from data transfer bottlenecks. Recent advances in AI accelerators have demonstrated remarkable capabilities attributed to their novel architecture paradigms. Inspired by this insight, this work presents LHPA, the first AI-inspired hardware accelerator for LPN acceleration on FPGAs. LHPA implements a heterogeneous architecture comprising configurable hardware engines. Then, LHPA employs a stage-level scheduling strategy and establishes a bandwidth-oriented performance model. LHPA achieves a 52.80x performance gain over the state-of-the-art LPN architecture.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH1837

Factored Floating-Point MAC: Eliminating Redundant Normalization for Energy-Efficient Systolic Arrays

Hao Xiong; Zhichao Rui; Zhijie Yang; Rui Gong; Lei Wang

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

Floating-point multiply-accumulate (FPMAC) is crucial in scientific computing, machine learning, and graphics but remains a major performance and energy bottleneck. Conventional IEEE-754 FP-FMA schemes optimize single multiply-add fusion but overlook long-chain MAC patterns in GEMM, leading to redundant rounding and normalization, accumulated numerical error, and high delay and power overhead from wide carry-propagate adders (CPAs). We propose Factored-FPMAC (F-FPMAC), which employs 4:2 carry-save adders (CSAs) and CSA-based redundant representation inside systolic arrays to eliminate CPAs from processing elements and defer normalization to a unified post-processing stage. To prevent accumulated intermediate value overflow (AIVO) under deferred normalization, we introduce a lightweight hierarchical risk-aware boundary protection mechanism. To further reduce register overhead from redundant representation, we replace per-PE dual buffers with an array-shared buffer pool. Experimental results indicate that F-FPMAC reduces critical-path delay by 61.9%, lowers power consumption by 22.6%, improves energy efficiency by 3×, achieves nearly two orders of magnitude lower numerical error, and decreases overflow events by up to 44.5%.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH1839

Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices

Tao Lu; Haoyu Wang; Zonghui Wang; Keshen Xiang; Jiaheng Zhang; Wenzhi Chen

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

With the growing deployment of large language models (LLMs), LLM inference cost has become a key challenge. Pruning techniques that introduce sparsity into weight matrices can accelerate inference. However, maintaining model quality typically limits pruning to moderate unstructured sparsity (around 50\%). At these sparsity levels, none of the existing GPU kernels for sparse matrix multiplication (SpMM) can outperform their dense counterparts. This paper proposes an efficient GPU inference method for LLMs with moderate sparsity. We propose a three-layer matrix storage format comprising: (i) a Sparse-TC layer enabling sparse tensor cores to accelerate SpMM; (ii) a Slot-Filling layer using parallel differential distance for matrix compression while supporting low-cost on-chip decoding; (iii) a lightweight Residual Layer ensuring correct SpMM computation. Building on this format, we design a SpMM kernel that jointly utilizes sparse tensor cores and CUDA cores. This design enables an efficient execution pipeline and overlaps on-chip computation with memory access. Evaluations show that our work is the first to outperform dense matrix multiplication on modern GPUs equipped with high-bandwidth memory (HBM). It achieves up to 1.64× kernel-level speedup over SpInfer (EuroSys'25, Best paper) and up to 1.41× end-to-end speedups over FlashLLM (VLDB'24). Our source code: https://anonymous.4open.science/r/spmm-32E1.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH1841

Schemacoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting

JIAXIN WAN; Chia-tung Ho; RONGJIAN LIANG; Cunxi Yu; Deming Chen; Haoxing Ren

Date:Monday, July 27 Location:Mtg Room 101A Session:Self-Driving EDA: Agents, Benchmarks, and Evolving Toolchains +1
Abstract

Log schema extraction, the process of deriving human-readable templates that support clear log understanding from massive volumes of log data, is fundamental for automated debugging and performance analysis across modern Electronic Design Automation (EDA) tool chains. The challenge is amplified in advanced design flows, where logs interweave command scripts, multi-stage progress reports, timing updates, and deeply nested performance tables, rendering manual regular-expression design brittle and unscalable to evolving tool versions. We introduce SchemaCoder, a fully-automated LLM-driven schema extraction framework that constructs structured log schemas by synthesizing reusable parser code, enabling robust handling of arbitrary unstructured log files and complex EDA tool logs without manual regular-expression design. At its core, SchemaCoder uses a novel Question-Tree (Q-Tree) pattern code generation process to identify pattern codes and utilizes the extracted raw contents to drive a textual residual evolutionary optimizer in the inner loop without relying on a gold labeled dataset. In the outer loop, a residual Q-Tree boosting mechanism identifies additional pattern codes and iteratively refines the parser code. On EDA tool logs from OpenROAD and commercial tools, the structured log schema generated by SchemaCoder enables an average 11.7% improvement in agentic EDA tool log analysis QA tasks (pass@1) over a strong commercial baseline. SchemaCoder also achieves up to 7.3% improvement in the average scores and outperforms state-of-the-art baselines in 9 of 14 applications on LogHub-2.0. We will open-source the code and the EDA tool log QA benchmark for reproducibility upon acceptance.

AIAI2-I. AI/ML Algorithms and ModelsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1846

COOL: A Cooling-Aware Point Transformer Framework for Thermal Prediction in Advanced 3D/3.5D IC Packaging

Yao Lu; Zhicheng Guo; Qijun Zhang; Shang Liu; Wenji Fang; Wenkai LI; Zhiyao Xie

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

Advanced 3D and 3.5D IC packaging significantly improves integration density but elevates thermal management challenges due to cross-layer heat coupling and complex cooling structures. Traditional solvers deliver high fidelity but are too slow for iterative design flows, while existing learning-based methods either fail to capture inter-die thermal coupling or treat cooling structures as static components, limiting their applicability in real packaging co-design scenarios. In this work, we introduce COOL, a cooling-aware point transformer framework that represents heterogeneous assemblies (dies, interposers, TIMs, heat spreaders) as annotated 3D point clouds embedding geometric, material, and power attributes. COOL explicitly encodes geometric boundaries and cooling structures, and introduces a physics-informed boundary condition (PI-BC) loss to enforce thermal consistency at material interfaces and cooling boundaries. Extensive experiments demonstrate that COOL achieves a remarkable 2.4% NMAE on our constructed benchmark of multi-package thermal designs, substantially outperforming existing learning-based approaches while providing over 15.7× speedup compared to commercial FEM solvers.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1853

Signing Crossbar Columns Instead of Neural Network Weights in Compute-in-Memory Accelerators

Jun Pei; Shaocong Wang; Yian Yang; Jiukai Fang; Jiuxian Zhang; Yuliang Shen; Yujian Du; Yan Wang; Yuchao Yang

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Move Less, Compute More: Memory-Centric Architectures for AI +1
Abstract

Weights in neural networks (NNs) are signed. Resistive Random Access Memory (RRAM), though a promising computing in-memory (CiM) for NN acceleration, is inherently unipolar as the conductance can only be positive, making it challenging to represent signed weights.Existing mainstream solutions utilize a differential pair of RRAM cells for representing a signed weight, which double the array size and significantly increase peripheral circuit cost, limiting the scalability and energy efficiency of CiM accelerators.In this work, instead of assigning a specific sign to each weight, we treat each column of the CiM crossbar as a unit and assign an identical sign to the entire column. To do this, we propose a Sign-Align Training and Mapping (SATM) framework that regularizes the network during training so that all weights mapped to the same column of a CiM crossbar share an identical sign. A per-column Sign Register (PSR) is added to the array's periphery to store the learned 1-bit sign configuration for each respective column. The experimental results show that the proposed SATM achieves up to 49.3% reduction in area overhead and up to 60.7% reduction in energy consumption for AlexNet, ResNet18, VGG16, ResNet50, and ViT-Base models compared to differential-pair implementations.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIQuantum
RESEARCH1858

Reflectbench: An Agentic Framework for Generating System-Level Design Testbench via Consensus and Reflection

Junzhe Liu; Chao Li; Puyuan Zhang; Jinheng Wang; Xiaowei Chen; Zhuorui Zhao; Zhaoyan Shen; Mengying Zhao; Zheyu Yan; Zhenge Jia

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

Hardware design verification is essential for ensuring the functional correctness of electronic designs. Recent advances have explored using Large Language Models (LLMs) to automate hardware testbench synthesis. However, existing approaches primarily target single hardware modules and often fail to generalize to complex, system-level designs. As design complexity increases, both the LLM-generated testbench and the reference design are more likely to exhibit correlated or compensating errors, often producing false-positive outcomes that obscure true functional violations. To tackle these challenges, we propose ReflectBench, an agentic framework for automated system-level hardware design verification with logic analysis, consensus-based validation, feedback self-reflection, and testbench self-correction. To further minimize manual effort in constructing knowledge bases or preparing examples, we devise Self-Reflective Domain Knowledge Mining, an automated approach for constructing an effective retrieval database aligned with hardware semantics. We develop ReflectVDB, a benchmark consisting of 53 system-level hardware designs for evaluating LLM-based verification frameworks. Experimental results show that our approach outperforms SOTA framework ConfiBench on both ReflectVDB and AutoBench datasets, achieving 7.6\% and 4.5\% improvement in terms of testbench generation success rate, and 3.8\% and 1.2\% improvement in terms of coverage rate. The framework and ReflectVDB benchmark are open-sourced at https://anonymous.4open.science/r/ReflectBench/.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1859

iPCL-M: Pre-Training of Chip Layout for Metrics Evaluation and Optimization

Xinhua Lai; He Liu; Weiguo Li; Yihang Qiu; Miao Liu; Simin Tao; Xingquan Li; Jungang Xu

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

Physical design is a critical step in the electronic design automation (EDA) process, where the gate-level netlist from logic synthesis is converted into a GDS-II file. Recently, AI methods have been increasingly employed to address challenges in physical design. This paper proposes a new paradigm that integrates routing metric evaluation and optimization on chip layouts, forming a ``Generate–Evaluate–Optimize–Generate'' (GEOG) closed-loop. In our experiments, the pre-routing generation model achieves an mean relative error (MRE) of only 1%, with a mean absolute error (MAE) below one movement unit. Post-processing reduces both MRE and MAE to zero while ensuring connectivity. The evaluation model performs comparably to iSTA on the test set, with a 336× speedup. Furthermore, the optimization strategy effectively improves wirelength, delay, slew, R, and C by 10.49%, 13.03%, 14.85%, 10.51%, and 6.46%, respectively, and produces routing results superior to commercial tools in key metrics.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH1862

Thermal Stability Strategy for a 12nm/28nm Hybrid-Bonding Wafer-on-Wafer 3D-Integrated Eflash In-Memory Computing SoC

Guangyao Wang; Saiya Wang; Yuexi Lv; Jenny Ma; Jing Kou; Yizhe Chen; Jinhao Guo; Yichen Song; Yingming Lu; Wang Kang

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Are We There Yet? That is the Question ... for Computing near and in Memory. +1
Abstract

3D In-memory Computing (IMC) chips, with high density and bandwidth, offer a promising solution for AI workloads but suffer from thermal instability, which degrades weight precision and system throughput. This work proposes a cross-layer thermal stability strategy (TSS) and validates it on the first 3D eFlash analog IMC prototype with hybrid bonding, integrating a 12nm logic die and a 28nm eFlash die (36MB). TSS consists of: (1) a physics-guided thermal model achieving a kullback-leibler divergence of 0.088; (2) a temperature-insensitive differential programming (TIDP) scheme that stabilizes weights from –30°C to 60°C; and (3) an adaptive temperature calibration (ATC) algorithm that reduces output error standard deviation by 60.7%. With TSS, ResNet-18 and DCCRN maintain robust inference on the 3D eFlash IMC SoC across –15°C to 75°C with less than 10% degradation in end-to-end task accuracy. This is the first demonstration of thermally resilient analog IMC on a commercial-grade 3D SoC, paving the way for reliable AI deployment under dynamic thermal conditions.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIAI5-I. AI/ML System and Platform Design
RESEARCH1872

SSD Reforge: Dynamic Parity Management to Extend SSD Lifetime Under Aging

Ziyu Zeng; Yingdi Shan; Jinlei Jiang; Mingxing Zhang; Yongwei Wu

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Scaling the Bedrock: OS-Hardware Co-Design for High-Reliability Storage +1
Abstract

Flash wear raises raw bit error rate (RBER), limiting SSD lifetime by TBW or P/E cycles. We present SSD Reforge, which dynamically strengthens protection near end‑of‑life by adding intra‑device parity, allowing more lifetime. To handle inter‑block reliability heterogeneity, we build an RBER‑based error‑probability model and derive parity stripe grouping as a combinatorial optimization problem. We design Reliability‑Aware Striping (RAS) for efficient grouping and Stripe UPER Leveling (SUPERL) to sustain RAS during subsequent writes. Implemented in FEMU and evaluated with real I/O traces, SSD Reforge increases user TBW by up to 40% on average versus traditional mode at the lifetime limit, significantly extending SSD lifespan.

SystemsSYS5. Embedded Memory and Storage SystemsQuantumDES3. Emerging Models of Computation +1 more
RESEARCH1878

Hardware-Aware Neural Architecture Search for Real-Time BEV-Based 3D Object Detection

Jisup Lee; Soonhoi Ha; Jeongwon Her

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

Bird's-eye-view (BEV) fusion of multi-modal sensor data is a leading approach for high-performance 3D object detection. However, state-of-the-art models rarely exceed 10 FPS even on powerful GPUs, limiting their use in real-time applications. In this paper, we present a neural architecture search (NAS) framework that identifies models achieving both high accuracy and real-time inference. The search space is explicitly designed with hardware constraints to balance accuracy and computational efficiency on resource-limited platforms. To better align LiDAR and camera features in the BEV conversion module, we introduce a lightweight depth estimation network and a LiDAR–camera cross-attention mechanism that enhances detection accuracy with minimal overhead. On the challenging nuScenes benchmark, our model achieves 70.1% mAP, which represents a 2.3% point improvement over the baseline model, while running at 35.6 FPS on the Nvidia Jetson AGX Orin, demonstrating its suitability for real-time autonomous driving.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH1885

PEACE: Algorithm–hardware Co-Design for PE-Granular Energy-Aware Compressed GEMM Acceleration

yueting Li; Terry Ye; He Xiao; Zheng Liu; Ngai Wong; Weisheng Zhao

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

General matrix–matrix multiplication (GEMM) remains the dominant compute kernel in transformer-based large language model (LLM) inference. However, large-scale matrices exhibiting irregular sparsity fundamentally limit throughput and energy efficiency in existing systems. To address these challenges, we propose PEACE, an energy-aware matrix compression framework for accelerating GEMM operations at the processing element (PE)-array granularity. PEACE comprises a novel matrix compression algorithm, PEACE-Alg, which reorganizes large-scale matrices into hardware-friendly compressed formats by exploiting the column-wise energy distribution. To support PEACE-Alg, we design a hardware microarchitecture, PEACE-Hw, that integrates RISC-V ISA extensions with a table-metadata fetcher and a metadata-driven operand loader to feed a reconfigurable systolic PE array, and a dedicated partial-sum adder for efficiently merging intermediate results. Experimental results show that PEACE occupies 2.705 mm^2 in a 14 nm ASIC and delivers a peak INT8 throughput of 88 GOPS/W. PEACE achieves 1.60×-1.67× speedup over a RISC-V core baseline across 14 transformer-based LLMs. Compared with state-of-the-art designs, PEACE provides 4.4× higher PE density, 1.79× average speedup and 2.53× energy efficiency.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH1899

ACE-NAS: A Zero-Cost NAS Framework for Co-Design of Architecture and Encoding in Spiking Neural Networks

Mingyu Qin; Liangming Fang; Wei Liu; Jinghai Wang; shanlin xiao

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

Spiking Neural Networks (SNNs) promise exceptional energy efficiency for neuromorphic computing through event-driven processing. However, unlocking their full potential requires navigating a complex, strongly coupled design space of network topology and temporal neural encoding. Existing SNN Neural Architecture Search (NAS) frameworks typically decouple these dimensions, focusing exclusively on topology search while relying on manual, fixed encoding schemes. This limitation leaves a vast portion of the design space unexplored, resulting in sub-optimal energy-accuracy trade-offs. To bridge this gap, we present ACE-NAS, the first zero-cost NAS framework that automates the co-design of SNN architecture and encoding mechanisms. Addressing the challenge of "encoding-blind" proxies, we introduce Jacob_cov, a novel Jacobian-based spectral metric that efficiently quantifies the temporal discriminability of different encoding schemes without training. ACE-NAS integrates Jacob_cov with structural proxies (ZiCo) into a hardware-aware, multi-objective evolutionary search strategy. On CIFAR-10, ACE-NAS achieves 92.95% accuracy, effectively identifying Pareto-optimal designs that balance high performance with minimal spike activity. By automating the joint optimization of structure and dynamics, ACE-NAS delivers an orders-of-magnitude reduction compared to standard training-based NAS and a 7× speedup over state-of-the-art efficient SNN-NAS methods (e.g., AutoSNN).

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH1900

KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Zihao Zheng; Zhihao Mao; Maoliang Li; Jiayu Chen; Xinhao Sun; Zhaobo Zhang; Donggang Cao; Hong Mei; Xiang Chen

Date:Wednesday, July 29 Location:Mtg Room 101A Session:The Edge-Lord Chronicles: Tiny Gestures, Big LLMs, and Cluster Chaos +1
Abstract

Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH1907

Seeing Through Silicon: Multimodal Thermal Reconstruction via Physics-Informed Diffusion

Jing Li; Yuquan Sun; Wei Xing; Yuanqing Cheng

Date:Monday, July 27 Location:Mtg Room 201B Session:Advances in 3D/2.5D Physical Design +1
Abstract

Real-time thermal monitoring of 3D integrated circuits confronts a fundamental observability crisis: reconstructing high-resolution volumetric thermal fields from sparse, heterogeneous on-chip observations. We demonstrate that existing single-modality approaches are physically insufficient—performance counters capture heat generation but are blind to thermal transport, while sparse sensors capture local states but lack spatial coverage. To bridge this gap, we introduce a physics-informed conditional diffusion framework that synergistically fuses these complementary modalities, treating thermal reconstruction as a constrained inference problem over the heat equation. Our 3D architecture explicitly models vertical inter-layer coupling, reducing estimation error by over 94% (0.48°C vs. 13.08°C RMSE) compared to single-source baselines. Furthermore, by leveraging deterministic sampling, we accelerate inference 155× over finite-element simulation (0.95s vs. 147s) while maintaining high physical fidelity (𝑅2 > 0.99). This work establishes a new, observability-driven paradigm for thermal monitoring, enabling practical closed-loop dynamic thermal management for next-generation 3D systems.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageSYS4. Embedded System Design Tools and Methodologies
RESEARCH1909

Pimony: A DRAM-PIM Design for Harmonizing PIM and Memory Accesses with Minimal Interference

Gyeonghwan Park; Inseong Choi; Byungkuk Yoon; Sanghyeok Han; Jae-Joon Kim

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Move Less, Compute More: Memory-Centric Architectures for AI +1
Abstract

Edge-side LLM inference is gaining importance, yet its single-query nature shifts the performance bottleneck from computation to memory traffic. Processing-in-Memory (PIM) can exploit high internal bandwidth, but on resource-constrained edge devices, the same DRAM must also serve host memory requests, causing frequent bank conflicts and global stalls. We present PIMony, a DRAM-PIM architecture that enables seamless co-execution of PIM and conventional memory operations. PIMony introduces (i) interrupt-based asynchronous PIM, which allows preemptible MAC execution to remove channel-wide stalls, and (ii) Dual-Path Subarray Access (DPSA), which permits concurrent PIM and memory access within a single bank through dual-row activation. These mechanisms jointly enable higher co-execution efficiency and stable LLM decoding latency under concurrent PIM and memory accesses.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIQuantum
RESEARCH1920

Graph-Based Attack-Chain Traceback Across Multiple In-Vehicle Network Domains

Abla Smahi; Xiaohang Wang; Yu Xin; Chang Zhu; Amit Kumar Singh; Kaiwei Wu; Yingtao Jiang; Kui Ren

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

The expanding connectivity of modern vehicle networks has widened their attack surface. It creates critical vulnerabilities to multi-stage attacks where an initial compromise can lead to control over core vehicle functions. Addressing these vulnerabilities requires precise attack traceback to avoid costly security responses. While traceback is essential for this in traditional networks, its application to automotive systems faces fundamental barriers. International homologation standards (UN R155, ISO/SAE 21434) require network architectures to maintain certification compliance, rendering traditional traceback methods (e.g., packet marking) that modify communication stacks as legally and technically infeasible. Production vehicles further exacerbate this challenge with limited resources. We present the first practical solution to this impasse: a gateway-resident traceback system operating entirely within certification boundaries. Our key insight is to reconstruct attack chains solely from existing intrusion detection systems (IDSs) alerts across host, Ethernet, and Controller Area Network (CAN) domains by modeling these alerts as nodes in a temporal graph. Edges are admitted only after four plausibility checks—topological, temporal, semantic, and kinematic—ensuring all reconstructed chains are physically and functionally plausible. Our system achieves a 0.90 edge-level F1 score (64% higher than baselines) with a 0.086 false discovery rate, representing a 72% reduction, while using only 1.2% CPU and 26 MB of Memory usage on a production Leapmotor C10. In addition, we release the dataset for traceback.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH1922

Pegasus: Personalized and Communication-Efficient Mixture-of-Experts System for Distributed Edge LLM Inference

Zhixuan Liao; Kongyange Zhao; Tao Ouyang; Liekang Zeng; Siqi Luo; Xu Chen

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

Large Language Models (LLMs) are increasingly pushed to the edge to reduce latency, yet limited memory and compute make full-model deployment infeasible. Mixture-of-Experts (MoE) offers a lightweight alternative, but edge nodes can host only a few experts, creating persistent mismatches between dynamic query semantics and locally available expert models under resource constraints. We present Pegasus, a collaborative MoE inference system for multi-edge deployments that leverages the spatiotemporal correlations inherent in distributed query semantics. Pegasus integrates three innovations: (i) a similarity-based expert deployment mechanism using an efficient heuristic metric to assess the semantic relevance between queries and experts across nodes and over time; (ii) a personalized gating design that selectively fine-tunes a subset of gating parameters on each node to balance expert accuracy and communication latency; (iii) an intra-node online scheduling algorithm with adaptive batching for efficient memory utilization. Extensive performance evaluations corroborate that Pegasus achieves 11.8x lower inference latency than state-of-the-art distributed MoE frameworks, demonstrating high-throughput and communication-efficient edge LLM inference.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH1936

RTFL: Energy-Aware Federated Learning for AIoT Design Via Adaptive Quantization-Based Multi-Agent Scheduling

Jun Xia; Junqi Zhang; Zhaorong Zhu; Wenjie Chen; Mingsong Chen

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

Although Federated Learning (FL) is becoming increasingly popular in designing Artificial Intelligence of Things (AIoT) applications, due to the varying computing and communication capabilities of resource-constrained devices, it suffers from the problems of slow convergence and poor training performance, especially when devices are powered by batteries. To address these issues, this paper introduces RTFL, a novel energy-aware Real-Time Federated Learning framework based on Multi-Agent Reinforcement Learning (MARL), aiming to enhance the knowledge sharing across AIoT devices within a specified training time constraint. Specifically, RTFL employs an Adaptive Quantization-based Multi-Agent Scheduling (AQMAS) strategy, enabling a team of agents to intelligently select devices with specific model quantization levels for each round of local training, taking into account the resource constraints (e.g., remaining battery power, computing capability, and communication bandwidth) of the current devices. By facilitating collaboration among agents through reinforcement learning, our approach enables devices to maximize their contributions to forming an optimal global model, while balancing the trade-off between the accuracy of quantized models and the limited resources available on each device. Comprehensive experiments show that RTFL not only accelerates the convergence of FL training, but also encourages devices to participate in more rounds of knowledge aggregation, thereby significantly improving overall training performance.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH1938

Phantom Walk: Reducing the Post-Migration Translation Path via Proactive Translation Dissemination in Multi-GPU Systems

Jihun Choi; Jaesang Moh; Tae Hee Han

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

Multi-GPU systems rely on unified virtual memory to provide a global address space, which inevitably triggers page migrations. Our analysis shows that reducing migrations further is difficult even with advanced policies. These migrations invalidate page table entries (PTEs) across GPUs and generate migration-induced misses (M-misses), which constitute a large portion of all translation requests and significantly degrade performance. This paper introduces Phantom Walk, which accelerates M-miss address translation by reducing GPU-local page faults and page table walks through a shorter post-migration translation path. Our evaluation shows that Phantom Walk improves overall performance by 44.37% on average.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH1942

PD-Swap: Prefill–decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration

Yifan Zhang; Zhiheng Chen; Ye Qiao; Zhenyu Tang; Sitao Huang

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

Aggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. As prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making context length a first-order systems concern. Recent work with table-lookup-based ternary matrix multiplication on edge FPGAs exposes a fundamental prefill--decode asymmetry: prefill is compute-bound and dominated by dense matrix--matrix operations, whereas decoding is memory-bandwidth-bound and dominated by KV-cache traffic. A static accelerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fabric, and tight LUT/URAM limits that cap model size and usable context. We propose a prefill--decode disaggregated LLM accelerator that uses Dynamic Patial Reconfiguration (DPR) to time-multiplex the attention module on edge FPGAs. The core table-lookup ternary matrix multiplication and weight-buffering engines remain static, while the attention subsystem is a reconfigurable partition with two phase-specialized dataflows: a compute-heavy, token-parallel prefill engine and a bandwidth-optimized, KV-cache-centric decoding engine. A roofline-inspired model and design space exploration jointly optimize reconfigurable-region size, parallelism under reconfiguration and routability constraints, and partial reconfiguration is overlapped with ongoing computation. Our design achieves state-of-the-art performance, with up to 27~tokens/s decoding throughput. Compared with the prior static design, PD-Swap achieves a 1.3x--2.1x speedup, with larger gains at longer context lengths.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH1945

Safegen: LLM-Driven Assertion Generation and Fault Criticality Evaluation for Functional Safety

Xuanyi Tan; Arjun Chaudhuri; Rubin Parekhji; Krishnendu Chakrabarty

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Leveraging AI for Silicon Health: Test, Yield & Fault Tolerance +1
Abstract

Traditional simulation-based fault analysis tends to be overly conservative and fails to reflect true fault criticality. This paper presents SafeGen, an LLM-driven, formal-verification-assisted framework for functional-safety-oriented fault criticality assessment. SafeGen employs large language models with a Hyper Knowledge Graph (HyperKG) to extract verifiable specifications and to evaluate their importance for overall system safety. The HyperKG is then extended with register-transfer level information to guide the generation of Functional Safety Assertions. A gate-to-RTL fault-mapping mechanism supporting both stuck-at and bridging faults, combined with formal property verification, enables semantic-level fault criticality grading. A digital–physical co-simulation platform for a field-oriented control system validates SafeGen.

EDAEDA9. Test, Validation and Silicon Lifecycle ManagementAIQuantum
RESEARCH1949

Athena: Analytical Timing-Aware Pre-Silicon Estimation of Side-Channel Leakage

C.Rohin Menon; Annapurna Valiveti; Janakiraman Viraraghavan; Chester Rebeiro

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

We present Athena, a scalable analytical framework for pre-silicon side-channel leakage evaluation that avoids simulation entirely. Athena integrates static timing information with signal-probability propagation to model time-dependent circuit behavior, capturing leakage arising from input-arrival skew, glitches, and other timing-induced effects. The fine-grained leakage estimates produced, precisely identify vulnerable signals and time intervals, and supports evaluation of diverse countermeasures including masking, gate resizing, and differential logic. We evaluate Athena across a broad set of countermeasures on S-boxes and full ciphers, comparing both correctness and performance against state-of-the-art tools. Across all benchmarks, Athena delivers over 400x speedup over existing methods.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH1960

Hyperlidar: Adaptive Post-Deployment LiDAR Segmentation via Hyperdimensional Computing

Ivannia Gomez Moreno; Yi Yao; Ye Tian; Xiaofan Yu; Flavio Ponzina; Michael Sullivan; Jingyi Zhang; Mingyu Yang; Hun Seok Kim; Tajana Rosing

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

LiDAR semantic segmentation plays a pivotal role in scene understanding for edge applications such as autonomous driving. Once deployed in the real world, the model must adapt to its surrounding environment through rapid adaptive training and updates, even with limited compute and energy resources on an edge device. Existing segmentation models rely on large neural networks, which need significant memory and computing resources for post-deployment adaptation. To address the above challenges, we introduce HyperLiDAR, the first lightweight LiDAR segmentation framework based on Hyperdimensional Computing (HDC) for adapting to streaming point cloud scans after deployment. HDC is a brain-inspired approach well-suited for efficient on-device learning. HyperLiDAR combines a pretrained feature extractor with HDC training to support lightweight adaptation on edge devices. We further design a buffer selection strategy to mitigate the high data volume in each scan. Extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and three representative devices demonstrate that HyperLiDAR outperforms state-of-the-art segmentation methods and accelerates retraining speed by up to 13.8×.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH1963

Optimized Time-Dependent Hamiltonian Evolution Quantum Solver for Power System Transient Computations

Jiyuan Liu; Yongming Tang; Baoping Wang; Mengdi Sun; He Li

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Pioneering the Foundations of Quantum Circuit Synthesis, Compilation and Algorithmic Optimization +1
Abstract

Quantum computing have been demonstrated as a promising solution for computation-intensive power system electromagnetic transient (EMT) computations. However, existing work have focused on time-independent Hamiltonian evolution to solve ordinary differential equations (ODEs) with heavily classical interactions. This work propose OHQS, an optimized time-dependent Hamiltonian evolution quantum solver, to enable high-accuracy computation for ODE-dominated EMT problems without classical iterations. The spatiotemporay mapping transformation is discretized to construct iterative quantum circuit evolution, leveraging the spatiotemporal fixed points (SFPs)-enabled automatic optimizer to reduce discretized errors. Experimental results have demonstrated that OHQS can support time-dependent Hamiltonian evolution in practical applications with up to 99.91% accuracy and achieve up to 9.56× error reduction compared to classical methods. The observed exponential decay of the relative error implies that system qubits scales as O(log(1/ξ)) and the ξ is the target relative error tolerance.

DesignQuantumDES6. Quantum ComputingAI
RESEARCH1966

A Hamiltonian-Guided Pre-Trainer for Variational Quantum Algorithms

Taotao Zhao; He Li

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Pioneering the Foundations of Quantum Circuit Synthesis, Compilation and Algorithmic Optimization +1
Abstract

The optimization of variational quantum algorithms (VQAs) is notoriously challenging due to poor parameter initialization, which often traps optimizers in suboptimal local minima. Existing methods rely on a static guessing paradigm that is fundamentally limited. This paper presents a Hamiltonian-guided pre-trainer (HGP), a new approach that dynamically constructs a better starting point. HGP iteratively refines parameters by performing exact global optimization within low-dimensional subspaces. These subspaces are identified using a Hamiltonian-guided parameter blocking strategy, and the optimization is achieved by reconstructing the analytic landscape from a few quantum measurements via a Fast Fourier Transform. We evaluated HGP on canonical spin models, where it consistently produced superior starting points for standard optimizers. Ablation studies reveal Hamiltonian-guided parameter blocking reduces the initial energy error by nearly 30-fold versus the next best benchmark. These results highlight the importance of Hamiltonian guided pre-training for enhancing VQA performance.

DesignQuantumDES6. Quantum ComputingAI
RESEARCH1971

Test Point Insertion with ATPG-Free Self-Supervised Learning

Tae-Min Park; Sung-Hyuk Cho; Jeongyeol Lee; Jiwhan Kim; Nur Touba; Joon-Sung Yang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Leveraging AI for Silicon Health: Test, Yield & Fault Tolerance +1
Abstract

Optimal Test Point Insertion (TPI) to reduce pattern count is NP-hard. Traditional testability metrics fail in reconvergent fanout structures. Machine learning solutions suffer from costly ATPG labels (SL) or high runtime overhead and instability (RL). We propose an ATPG-free Test Point Insertion approach using self-supervised learning (SSL). We train an autoencoder on bit sequences to capture functional embeddings. A propagation model, trained via SSL, reliably propagates embeddings across reconvergent regions. Controllability is inferred from the embeddings; observability is estimated via saliency maps. Our pipeline eliminates ATPG dependency and RL overhead for effective TPI.

EDAEDA9. Test, Validation and Silicon Lifecycle ManagementAIQuantum
RESEARCH1976

Cacheparallel: Comprehensively Exploring Multi-Level Acceleration for Diffusion Models

Qianru Lyu; Di Niu; Jinwei Xu; Jingfei Jiang; Xiangrui Yang; Sheng ma; Dongsheng Li

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Fast, Furious, and Fault-Tolerant: Accelerating the Generative Grind +1
Abstract

Diffusion models, despite their exceptional generative capabilities, are bottlenecked by slow inference owing to their complex architectures and multi-step iterative processes. Although caching is a promising acceleration method, previous approaches suffer from a tight coupling with specific network structures and low computational density. Moreover, caching is challenging to integrate with other methods to further enhance performance due to non-orthogonality, as simple integration often leads only to model performance degradation. To address these challenges, we propose CacheParallel, a novel training-free scheme that optimizes both caching and parallelism at the intra- and inter-step levels to reduce the computational load and increase the computational density simultaneously. We first introduced the fusion node to abstract the network structures into generic pre-fusion and post-fusion computational streams. For the pre-fusion stream, we applied intra-step caching optimized by the proposed CPS algorithm. For the post-fusion stream, we exploit inter-step output correlations to enable parallel computing and boost computational density. To mitigate accumulated errors and non-orthogonality issues, an error correction mechanism was designed based on our observation of the inherent linear correlations of the models. Comparative experiments showed that we achieved significant performance improvements across various models, with the peak improvement surpassing 100\%.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH1986

REFLEX: Rewrite-Free Row-Aligned Sparse Attention for Efficient LLM Execution on PIM

Juhong Park; Yintao He; Sangheum Yeon; Yiran Chen; Jong Hwan Ko

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Move Less, Compute More: Memory-Centric Architectures for AI +1
Abstract

Large language models (LLMs) face decoding bottlenecks as attention repeatedly accesses the key-value (KV) cache. Sparse attention and processing-in-memory (PIM) each reduce data movement, but their naive integration produces irregular KV accesses that span multiple DRAM rows, leading to unnecessary activations and cache rewrites. We present REFLEX, a rewrite-free sparse attention framework that colocates required KV entries in a single DRAM row and applies activation-aware scheduling for PIM execution. REFLEX preserves accuracy without hardware changes, achieving up to 1.64× throughput and 1.36× energy efficiency on PIM, and 1.37× throughput in GPU-PIM systems.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIQuantum
RESEARCH1991

PCBgen3D: A Self-Correcting Graph-Based MLLM Framework for Automated PCB 3D Models Generation

Zhaohai Di; Jindong Tu; Zhiyuan HE; Yuan Pu; Tinghuan Chen

Date:Monday, July 27 Location:Mtg Room 101A Session:Self-Driving EDA: Agents, Benchmarks, and Evolving Toolchains +1
Abstract

Generating three-dimensional (3D) models for Printed Circuit Board (PCB) components from manufacturer datasheets is a foundational yet inefficient step in Electronic Design Automation (EDA) workflows. Current processes remain heavily manual, relying on skilled engineers to interpret 2D views, textual specs, and dimension tables. This manual dependence leads to high error rates and limited scalability. To tackle these challenges, we propose PCBgen3D, a self-correcting graph-based multimodal Large Language Model (MLLM) framework. It integrates a suite of coordinated modules, each enhanced by advanced MLLMs, to automate high-precision 3D modeling. These modules follow a structured pipeline of data extraction, task planning, and process refinement, overseen by an iterative self-correction loop. A core innovation is its dynamic task graph, which decomposes complex PCB modeling into adaptive sub-tasks, and incorporates a self-correction mechanism to fix errors. Evaluated on a dataset containing components of various package types and from different manufacturers, PCBgen3D outperform state-of-the-art general-purpose MLLM on industrial-grade PCB 3D modeling tasks.

AIAI2-I. AI/ML Algorithms and ModelsChipletSYS4. Embedded System Design Tools and Methodologies +1 more
RESEARCH1999

HAP: Efficient Quantization Harnessing Adaptive Precision for DNN Hardware Acceleration

Erjing Luo; Xinkuang Geng; Honglan Jiang; leibo liu; Jie Han

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

Post-training quantization (PTQ) avoids expensive training process and it is effective in accelerating Deep Neural Networks (DNNs). Since the representation capacity of a quantizer decreases exponentially with precision, PTQ encounters fundamental limitations in the low-precision domain. As an alternative, adaptive-precision quantization (APQ) leverages the sparsity in data distribution to reduce precision without sacrificing resolution. This is especially suitable for activations, as they tend to be sparser and more vulnerable than weights. However, existing APQ approaches require data to heavily cluster near zeros, which is not guaranteed in asymmetric quantization that is preferred for activations. Furthermore, the varying precision can lead to an unbalanced computational workload, making it difficult to effectively harvest the theoretical performance gain. To solve these problems, we propose a novel quantization and accelerator design, harnessing adaptive-precision (HAP). It leverages a per-group dynamic zero-point to generalize APQ. Moreover, its intra-channel grouping strategy makes it possible to balance variable-precision workload via reordering. Leveraging a channel-level dual-precision weight quantization scheme, it achieves superior accuracy compared with existing PTQ solutions at the level of 4 or 5 bits for a variety of DNN families. On hardware, a novel bit-serial accelerator featuring a lightweight reorder engine is developed. Results show it achieves a 2.65x speedup and a 55% energy reduction on average compared to existing accelerators.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2006

Cell-Structure-Independent Transistor-Level Placement and MOL-Aware Routing Framework Beyond the Standard-Cell Paradigm

Keyu Peng; Hao Wu; Yinuo Wu; qingsheng qiu; Hao Gu; Ziran Zhu; Chao Wang; Yang Jun

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

Standard-cell-based design has long been the dominant approach in VLSI design, providing scalability and compatibility with established EDA tools. However, as design technology co-optimization (DTCO) becomes increasingly important at advanced process nodes, the fixed structure of standard cells limits the ability to optimize wirelength and area effectively, imposing a critical bottleneck at the Middle-of-Line (MOL) layers. In this context, dense pin access and M0 routing resources, critical for alleviating congestion on higher metal layers, are rigidly constrained by the cell boundary. A more flexible approach is to directly place transistors on the design canvas and perform routing at the transistor level, enabling physical-level optimization beyond the logic-level constraints imposed by standard-cell design. This work introduces a novel framework featuring cell-structure-independent transistor-level placement coupled with an MOL-aware routing engine. Experimental results demonstrate that the proposed framework achieves considerable improvements in wirelength and design area compared to the commercial standard-cell-based design tool and state-of-the-art transistor-level work.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH2007

Transsyn: Scalable, Area-Oriented Transistor Network Synthesis via Cut-Based Mapper with SAT-Based Refinement

Jiun-Cheng Tsai; Wei-Min Hsu; Hsuan-Ming Huang; Jen-Hang Yang; Heng-Liang Huang; Yen-Ju Su; Hung-Pin (Charles) Wen

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

SAT-based transistor network synthesis delivers optimal, minimum-transistor designs but faces exponential complexity, making it impractical for large circuits. This paper introduces TransSyn, a scalable transistor network synthesizer that brings the benefits of SAT-based synthesis to the macro-cell scale through a novel two-stage framework. First, a cut-based global synthesis stage reformulates the task as a cut-based mapping problem, rapidly decomposing the design into smaller, manageable sub-problems. This ensures scalability and produces a high-quality network in seconds. Then, a SAT-based detailed synthesis stage refines this network by systematically identifying and re-optimizing merged subcircuits across cut boundaries using a SAT-based approach, recapturing lost optimization opportunities. Experiments demonstrate that TransSyn significantly outperforms previous methods, achieving 11.2% and 10% reductions in transistor count compared to graph-based and hybrid-based approaches, respectively. Furthermore, when integrated with a transistor-level placer, TransSyn breaks the library and cell boundaries, yielding up to a 48.8% area reduction compared to conventional cell-based designs. TransSyn demonstrates its capability for scalable, high-quality transistor-network synthesis, successfully bridging the gap between the optimality of SAT-based synthesis and the practical demands of large-scale cell design.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH2011

Dart: Towards Redundancy-Free RTL Simulation via DAG-Driven Execution

Jingyi Zhou; Yi Zhang; Yu Huang; Yang Wu; Chunyan Xu; Shaofei Li; Xueqi Li; Long Zheng; Xiaofei Liao; Hai Jin

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

Register Transfer Level (RTL) simulation is key to pre-silicon verification, yet its performance has stagnated in the face of rapidly escalating SoC complexity. Although recent CPU- and GPU-based simulators improve throughput through per-stimulus optimization or batch parallelism, they execute each stimulus in isolation and therefore fail to exploit the substantial computation locality that emerges across stimuli. As many stimuli converge to identical internal states, large portions of the circuit are redundantly re-evaluated, forming a principal bottleneck to scalable RTL simulation. This paper introduces Dart, a Directed Acyclic Graph (DAG)-driven RTL simulation framework that systematically eliminates cross-stimulus redundancy. Dart constructs a DAG-based intermediate representation that makes structural commonality and shared subexpressions across stimuli explicit, enabling principled redundancy elimination through systematic sub-DAG merging. A computation-centric execution engine evaluates shared logic once and amortizes its results across all stimuli that traverse the corresponding state, while a lightweight state-reconstruction mechanism preserves per-stimulus correctness with negligible overhead. Across a suite of industrial RTL designs, Dart delivers speedups of up to 136.7x over Verilator and 4.1x over RTLflow, respectively.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2014

OpenACMv2: An Accuracy-Constrained Co-Optimization Framework for Approximate DCiM

Yiqi Zhou; Yue Yuan; Yikai Wang; Bohao Liu; Qinxin Mei; Zhuohua Liu; Shan Shen; Wei Xing; Daying Sun; Li Li; Guozhu Liu

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Autonomous Synthesis and Intelligent Optimization for Analog and RF Circuits +1
Abstract

Digital Compute-in-Memory (DCiM) accelerates neural networks by reducing data movement. Approximate DCiM can further improve power–performance–area (PPA), but demands accuracy-constrained co-optimization across coupled architecture and transistor-level choices. Building on OpenYield, we introduce Accuracy-Constrained Co-Optimization (ACCO) and present OpenACMv2, an open framework that operationalizes ACCO via two-level optimization: (1) accuracy-constrained architecture search of compressor combinations and SRAM macro parameters, driven by a fast GNN-based surrogate for PPA and error; and (2) variation- and PVT-aware transistor sizing for standard cells and SRAM bitcells using Monte Carlo. By decoupling ACCO into architecture-level exploration and circuit-level sizing, OpenACMv2 integrates classic single- and multi-objective optimizers to deliver strong PPA–accuracy tradeoffs and robust convergence. The workflow is compatible with FreePDK45 and OpenROAD, supporting reproducible evaluation and easy adoption. Experiments demonstrate significant PPA improvements under controlled accuracy budgets, enabling rapid "what-if" exploration for approximate DCiM.

EDAEDA6. Analog CAD, Simulation, Verification and TestSystems
RESEARCH2024

Rectilinear Soft Module Floorplanning via Differentiable Polygon Shaping Model and Flow-Based Legalization

Fuxing Huang; Qiyuan Chen; Hao Wu; Hao Gu; Wenxing Zhu; xinning liu; Ziran Zhu

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

Fixed-outline floorplanning remains a critical challenge in VLSI physical design, particularly for modern System-on-Chips (SoCs) that employ rectilinear soft modules to maximize area utilization and reduce wirelength. Existing methods adopt an incremental optimization strategy, introducing rectilinear shape considerations only after obtaining an initial rectangular solution. These approaches restrict module flexibility in early stages, limit the solution space, and degrade the final Quality of Results (QoR).We propose a unified floorplanning framework that, for the first time, seamlessly integrates the optimization of rectilinear soft module into both global floorplanning and legalization stages.To address the computational challenges posed by rectilinear polygons, we introduce a differentiable polygon shaping model that represents and dynamically optimizes module shapes using a structured set of continuous variables during global floorplanning.We further design an augmented Lagrangian method to efficiently manage complex constraints with theoretical convergence guarantees, enhancing solution quality and stability. During legalization, we ingeniously model the problem as a minimum-cost flow formulation, where the objective is to minimize the total displacement, thereby ensuring a legal and high-quality final floorplan. Experimental results on GSRC and MCNC benchmarks demonstrate that our method achieves better solution quality, reducing half-perimeter wirelength by an average of 4% and up to 14% compared with state-of-the-art floorplanners, while maintaining competitive runtime performance.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH2027

Relay-GS: Reusing Temporal Sort Information for 4D Gaussian Splatting Acceleration

Yeonggeon Kim; Seongmin Ki; Sangkyu Jeon; Sungju Ryu

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

This paper presents Relay-GS, an algorithm-hardware co-design that accelerates 4D Gaussian splatting (4DGS) rendering. Conventional methods process each frame independently and repeatedly execute the heavy sorting stage. This results in performance bottlenecks during real-time rendering. To mitigate this inefficiency, our approach introduces optimizations at both the algorithmic and architectural levels. At the algorithmic level, we propose selective Gaussian sorting (SGS), which reuses the sorted results of previous frames, and periodic correction to maintain rendering quality. At the architectural level, we design a fine-grained pipeline to hide sorting latency and a parallel rasterization unit that reduces operations through parameter sharing. Our experimental results show that the proposed accelerator achieves a speedup of 1.55x to 1.64x and up to 16.5% in energy savings over a baseline design, while maintaining a rendering quality with negligible degradation compared to that of full sorting.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH2030

On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices

Lianming Huang; Haibo Hu; Qiao Li; Nan Guan; Chun Jason Xue

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic approaches. Experiments on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6× on average compared to existing sparsity methods.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH2036

Pipegs: Unlocking 3DGS Pipeline Parallelism via Hierarchical Reuse and Dynamic Partitioning

Xueling Wang; Yuzhou Chen; Zhican Wang; Pan Zhao; Renda Jian; Yitian Chen; Chen Zhang; Yang Hu; Guanghui He

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

3D Gaussian Splatting (3DGS) delivers high quality and speed for 3D scene reconstruction. However, achieving real-time performance on edge platforms remains challenging due to pipeline inefficiencies. Our profiling reveals two major bottlenecks: (1) insufficient pipeline overlap due to the latency gap between sorting and rasterization, and (2) pipeline parallelism constrained by globally serialized preprocessing. To address these challenges, we present PipeGS, a 3DGS accelerator with algorithm–architecture co-design. At the algorithm level, we propose hierarchical Gaussian reuse, which shortens rasterization latency by eliminating redundant computation and reduces 76.2% of pipeline bubbles. To further unlock full pipeline overlap, we introduce dynamic orthogonal partitioning, which breaks global serial dependencies and hides 63% of preprocessing latency. At the hardware level, PipeGS employs a customized layer-wise pipelined architecture that supports concurrent execution across stages. Implemented on a 28nm technology, PipeGS achieves 1.96 ∼ 3.34× on area efficiency, and 1.20 ∼ 2.77× on energy efficiency compared with SOTA 3DGS accelerators.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH2037

LRTA: Routability-Driven 3D-Aware Track Assignment with Layer Reassignment

Pengcheng Huang; Zepeng Li; Zhenkun Lin; Guangyong Chen; Genggeng Liu; Wen-Hao Liu; Xing Huang

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

As the critical stage in chip design, routing comprises global routing and detailed routing, with track assignment (TA) serving as an intermediate step to bridge the two routing phases. However, existing methods consider only 2D mismatches and overlook 3D coordination, resulting in degraded routing quality. This paper proposes LRTA, a routability-driven 3D-aware track assignment framework with layer reassignment to coordinate the resources across layers while preserving connectivity. In addition, a GPU-accelerated TA scheme is proposed to construct candidates and dynamically assign tracks, enabling efficient concurrent evaluation. Experimental results demonstrate that, compared with existing works, the proposed LRTA achieves significant improvements in routability estimation with shorter runtime.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH2048

Efficient HLS Accelerator Floorplan on Multi-Die FPGA Aided by Graph Neural Networks

Haoran Xue; teng wang; Qianyu Cheng; Zhendong Zheng; Wenqi Lou; Lei Gong; Xi Li; Xuehai Zhou

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

High-level synthesis (HLS) and multi-die FPGAs have been widely applied in large-scale accelerator design. To address cross-die boundary delays and local congestion issues in HLS designs on multi-die FPGAs, prior work proposed a coarse-grained floorplanning and pipelining method to improve frequency performance. However, achieving higher frequency still requires multiple iterative implementations in Vitis to tune parameters, incurring significant time overhead. To accelerate this process, we propose a graph neural network (GNN) based floorplan quality predictor at the FPGA slot level, achieving an accuracy of 84.13% and an F1-score of 0.86 in the congestion prediction task, with an average inference latency of only 0.58ms. Compared with the traditional tool flow that requires tens of hours, our method enables millisecond-level floorplan parameter fine-tuning, improving an unroutable and 318.2MHz case to 329.2MHz and 330.3MHz, respectively. Furthermore, we integrate our method into the HLS design-space exploration framework, achieving an average frequency improvement of 11.2%, with the maximum improvement reaching 23.8%. The execution time of all benchmarks is reduced by 30.9%.

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2056

GPU-Fuzz: Finding Memory Errors in Deep Learning Frameworks

Zihao Li; Hongyi Lu; Yanan Guo; Zhenkai Zhang; Shuai Wang; Fengwei Zhang

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Robust and Trusted AI Systems: Attacks, Defenses, and Hardware Reliability +1
Abstract

GPU memory errors are a critical threat to deep learning (DL) frameworks, leading to crashes or even security issues. We introduce GPU-FUZZ, a fuzzer addressing this issue by modeling operator parameters as formal constraints. GPU-FUZZ utilizes a constraint solver to generate test cases that systematically probe error-prone boundary conditions in GPU kernels. Applied to PyTorch, TensorFlow, and PaddlePaddle, we uncovered 13 unknown bugs, which demonstrate the effectiveness of GPU-Fuzz in finding memory errors.

SecuritySEC1. AI/ML Security/PrivacyAIQuantum
RESEARCH2057

Accshield: Operator-Aware Robust Scheduling for Fault-Tolerant AI Accelerators

Chenyang Wang; Hengshan Yue; Qianhao Wang; Tianle Li; Yongzhe Ma; Zeyu Guan; Jingweijia Tan; Xiaohui Wei

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Leveraging AI for Silicon Health: Test, Yield & Fault Tolerance +1
Abstract

To meet the increasing computational demands of large language models (LLMs), multi-systolic-array architectures are widely adopted by AI accelerators. However, compared with single-array designs, hardware faults in multi-array accelerators propagate more complexly way and cause more severe reliability degradation. Directly deploying single-array fault-tolerant mechanisms to multi-array architectures will introduces significant overhead. To address this limitation, we propose a Reliability-Aware Scheduling framework that jointly considers hardware-level reliability variations, operator-level error sensitivity, and inter-operator data dependencies. The scheduler maps insensitive operators to faulty arrays and reserves reliable arrays for critical computations, significantly improving reliability with minimal performance loss.

EDAEDA9. Test, Validation and Silicon Lifecycle ManagementAIQuantum
RESEARCH2058

Sharedkd: Gradient-Guided Dynamic Student Discovery Inside a Single 3D Object Detector

Hyunjoon Cho; Sangho An; HYUNSUNG PARK; Jangho Kim

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

While state-of-the-art 3D object detectors achieve high accuracy, their computational cost hinders deployment. Knowledge Distillation (KD) offers compression, but existing methods use separate teacher-student networks, incurring memory overhead with static teachers that cannot adapt to student progress. We introduce SharedKD, unifying pruning and distillation within a single network. The full network acts as a dynamic teacher while a pruned sub-network serves as student, eliminating separate teachers, reducing memory, and enabling co-evolution for effective knowledge transfer. On nuScenes, SharedKD achieves 2.55% higher NDS than prior state-of-the-art at 75% pruning ratio, demonstrating exceptional accuracy-efficiency balance.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH2071

SpRDA: An Efficient Sparse Accelerator for Robotics Diffusion Model

Renda Jian; Yuzhou Chen; Pan Zhao; Yitian Chen; Zhican Wang; Xueling Wang; Chen Zhang; Guanghui He

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

Diffusion models have recently demonstrated great performance in robotic manipulation, standing out with their ability to handling multi-modal action distributions and high-dimensional action spaces. However, its iterative computing feature limits real-time control in resource-constrained robot platform. Some of existing diffusion accelerators explored sparsity based on the similarity of adjacent steps, but suffering from low sparsity of full network and hardware inefficiency due to under-utilization of processing elements (PE) and ignorance of more significant non-matrix-multiplication operators (NMOs) in sparse model. To address these issues, we propose SpRDA, an algorithm-architecture co-design accelerator that can achieve high-speed and energy-efficient inference of robotics diffusion model. A fine-grained and difference-aware cache algorithm combining differential computing and cache is proposed to identify more redundant computation in robotics diffusion, achieving over 90\% sparsity with no distinct accuracy degradation. Furthermore, we apply two-level group sparsity adjusting and design a dynamic sparsity-aware matrix processing array to fully leverage sparsity in hardware. We also propose a graph-based vector unit to process various NMOs in higher throughput and lower area. Compared to the state-of-the-art diffusion accelerators, SpRDA achieves 1.56-2.60× speedup and 6.45-7.85× energy efficiency improvement.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH2074

Z-Paraswap: Co-Optimizing ZNS Parallelism and Swapping for Heterogeneous Graph Neural Network

Shun-Ying Hsieh; Tseng-Yi Chen; Yuan-Hao Chang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

Zoned Namespace (ZNS) solid-state drives (SSDs) have been widely adopted in database systems, file systems, and large-scale data centers because they expose part of the physical storage layout to the host system. By leveraging this visibility, host software can access correlated data in parallel and eliminate valid-data copying during garbage collection, thereby improving I/O efficiency. To further integrate ZNS SSDs into memory management, prior work developed a Linux-based ZNS swapping mechanism that reserves multiple zones as swap space, extending virtual memory capacity. However, when deploying Heterogeneous Graph Neural Networks (HetGNNs), this mechanism fails to fully exploit ZNS parallelism due to HetGNN's highly irregular and non-sequential access patterns, which lead to chip-level congestion and long-tail latency. To address this issue, we propose Z-ParaSwap, a ZNS-based Parallelism-Aware Swapping Management framework for HetGNN applications. Z-ParaSwap identifies access correlations in graph data and distributes highly correlated pages across different chips to enhance parallelism and mitigate I/O contention. Experimental results show that Z-ParaSwap reduces average swapping latency by 34% and tail latency by 44%, significantly improving overall HetGNN execution efficiency on ZNS-based systems.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH2079

Towards Practical Live Migration for Heterogeneous Confidential Virtual Machines

Jianquan CAI; Xiaoxi Ren; Shuang Liu; Shoumeng Yan; Kehuan Zhang; Yinqian Zhang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

The security properties and practical efficiency of Trusted Execution Environments (TEEs) have made them play an increasingly critical role in cloud computing, e.g., Confidential Virtual Machines (CVMs). However, TEE implementations, closely tied to specific hardware, introduce compatibility challenges for upper cloud services. In this paper, we focus on Live Migration, which is widely used in Normal Virtual Machines (NVMs) by Cloud Service Providers (CSPs) to manage computing resources, e.g., upgrading the host system with out taking services into downtime. Currently, TEE vendor solutions support CVM migration only within homogeneous TEE stacks, without consideration of heterogeneous environments, rendering heterogeneous migration difficult even impossible. To narrow the gap, we propose a generic framework applicable to various x86 TEEs, providing insights into achieving heterogeneous migration. In addition, we implement a migration system based on this frame work that manages to achieve migration between AMD SEV and Hygon CSV by introducing a trusted migration agent with a specific design. The key idea is to emulate the required migration commands of the TEE firmware and resolve compatibility issues through a helper agent. Our prototype offers security guarantees comparable to homogeneous CVM migration, with experiments demonstrating acceptable performance overhead.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH2083

CHiRM-DSE: CodeLLM-Based Hierarchical and Rule-Mining Guided DSE for FPGA Accelerators

Shi Chao; teng wang; Zhu Qinggang; Qianyu Cheng; Jing Wu; Wenqi Lou; Lei Gong; Zhou Xuehai

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

High-Level Synthesis (HLS) lowers the barrier to FPGA development by allowing a wider range of programmers to design hardware accelerators. However, determining the appropriate synthesis directives (pragmas) remains a major challenge, particularly for developers without hardware expertise. As designs grow more complex and the pragma search space expands, choosing the right pragmas becomes essential for achieving low resource usage and high performance. Automated design space exploration (DSE) provides an effective solution to this challenge. The enormous search space and time-consuming design-point evaluation highlight the need for efficient search strategies. However, existing search strategies mostly rely on inefficient exhaustive search, hyperparameter-sensitive metaheuristic methods, or dedicated methods that are difficult to port and generalize. To address these issues, we propose a rule-mining-based search strategy that efficiently guides exploration toward the most promising regions of the design space. In addition, we introduce a design space decomposition method to prune the search space, as well as a CodeLLM-based design-point evaluation method, which is both faster than directly invoking HLS tools, and more accurate than prior GNN-based approaches. Experimental results on four widely-used HLS benchmarks demonstrate that, under the same time budget, our DSE framework achieves better Quality of Results (QoR) than the state of the art. Our demo code is released at https://github.com/ScopeHLS/CHiRM-DSE

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2085

Prelude: Priming-Guided State Reconstruction for Efficient FPGA Processor Debugging

Jialin Sun; Yuchen Hu; Dean You; Hui Wang; Yushu Du; Xinwei Fang; Zhe Jiang

Date:Monday, July 27 Location:Mtg Room 203C Session:Streamlining Silicon: From Real-Time Scheduling to Efficient GPU Execution +1
Abstract

Debugging complex FPGA prototypes of modern processors in embedded systems is challenging due to limited signal visibility and significant tracing overhead. Existing approaches struggle to balance execution efficiency with debugging capability, often requiring either expensive continuous tracing or heavyweight snapshots. We propose Prelude, a lightweight snapshot-based debugging framework that records only essential architectural states and memory footprint of the processor on FPGA. During replay, a short visibility warm-up reconstructs internal micro-architectural states, enabling cycle-accurate analysis. Implemented on BOOM and Rocket, Prelude provides comparable visibility to prior work while significantly improving debugging efficiency: 32.88× / 2191.2× speedup over DESSERT / ENCORE on BOOM, and 18.09× / 896.4× on Rocket.

SystemsSYS4. Embedded System Design Tools and MethodologiesChipletEDA
RESEARCH2094

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Maoyang Xiang; Bo Wang

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16x and decreases DSP utilization by 16x while maintaining comparable or better performance across vision Transformers and GPT-2 models.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2098

PiMM-NoC: Process-in-Memristor-Memory NoC with RL Mapping Framework for Versatile AI Models

Ziang Xie; Yaoyu Tao; Zelun Pan; Haojun Chen; Zhiyuan Li; Qinghao Wang; Zhiming Pan; Yihang Zhu; Zixiang Luo; Yian Yang; Guang Mo; Kaiwen Long; Yaodong Yang; Yuchao Yang

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Are We There Yet? That is the Question ... for Computing near and in Memory. +1
Abstract

Process-in-memristor-memory (PiMM) offers promising in-/near-memory compute capability. However, existing PiMM designs support only a narrow set of model types and depend on coarse, static mapping strategies that overlook the vast design space, resulting in limited adaptability, scalability, and overall performance. In this work, we propose a Process-in-Memristor-Memory Network on Chip (PiMM-NoC) architecture combined with a reinforcement learning (RL) based mapping framework to optimize on-chip latency for versatile AI workloads. PiMM-NoC integrates two types of tiles: PiMM tiles for weight-stationary operations and PU tiles for non-weight-stationary and nonlinear computations. A cycle-accurate architecture simulator is incorporated into an end-to-end hardware–software co-design framework that uses Monte Carlo Tree Search (MCTS) to automatically search efficient mapping strategies. Experiments show that PiMM-NoC with the RL-based mapping framework achieves up to 3.45× speedup on DNNs and 3.85× on LLMs over existing mapping strategies, and up to 71.6× higher performance and 6.7× better energy efficiency compared to state-of-the-art AI accelerators.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIAI5-I. AI/ML System and Platform Design
RESEARCH2105

Acc-SpMV: Accelerating General-Purpose Sparse Matrix-Vector Multiplication with GPU CUDA Cores

Tang Lei; Xin Zhikuang; Wang Zijian; Zhou Chunbao; Wang Jue; Wang Yangang

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Fast, Furious, and Fault-Tolerant: Accelerating the Generative Grind +1
Abstract

General Sparse Matrix-Vector Multiplication (SpMV) is a fundamental kernel in scientific computing, graph analysis and deep learning. However, to fully unleash the power of CUDA cores performance, systematic optimization is required for SpMV. In this paper, we propose OmniSpMV, a high-performance SpMV library on CUDA cores, with multiple optimizations, including data-locality-aware reordering, memory-efficient tiling , sparsity-aware load balancing and highly optimized SpMV kernel. Extensive experimental results on various NVIDIA GPU architectures with 2715 matrices show that, OmniSpMV achieves significant performance improvements on average, 2.58x (up to 6.39x) speedup on RTX 4090, 1.91x (up to 4.92x) speedup on A800, and 1.78x (up to 8.21x) speedup on H100 over cuSPARSE designed for CUDA cores, and 1.93x (up to 15.67x) speedup on RTX 4090, 1.70x (up to 15.02x) speedup on A800, and 1.75x (up to 14.53x) speedup on H100 over DASP designed for Tensor cores.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH2107

ODBP: Obfuscation Design Approach for Biochemical Protocols in Continuous-Flow Microfluidic Biochips

Bowen Liu; Yuhan Zhu; Youlin Pan; Ruping Zhou; Zhisheng Chen; Genggeng Liu; Xing Huang

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Where Biology Meets Computation: Next‑Gen Bio‑Cyber‑Physical Innovations +1
Abstract

The intellectual property (IP) of continuous-flow microfluidic biochips (CFMBs) faces the risk of reverse engineering (RE) attacks during the IP transfer process from designers to manufacturers. This IP comprises not only hardware layouts but also biochemical protocols that specify bioassays procedures, whose theft can lead to technology leakage, trust crises, and substantial economic losses. To address this risk, this paper presents ODBP, the first obfuscation design approach for biochemical protocols in CFMBs. This approach contains four complementary techniques: an anti-RE strength measurement, a dependency-preserving and function-covering strategy, an executable dummy scheme, and a position selection strategy. By incorporating these techniques into the position selection of dummy operations and dummy edges, ODBP generates obfuscated biochemical protocols that can effectively resist RE attacks. Experimental results on multiple benchmarks show that the proposed approach generates obfuscated biochemical protocols that reach the target anti-RE lifetime with low protocol-level cost. Moreover, architecture prediction based on these obfuscated protocols shows that the corresponding CFMBs architectures can be realized at a lower overall cost.

DesignDES3. Emerging Models of ComputationQuantumEDA
RESEARCH2123

Dypamear: Efficient and Scalable Dynamic Graph Pattern Mining on Practical Processing-in-Memory Architecture

Yi Zhang; Yu Huang; Deting Chen; Chaoqiang Liu; Haifeng Liu; Xueqi Li; Xiaofei Liao; Hai Jin

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Are We There Yet? That is the Question ... for Computing near and in Memory. +1
Abstract

Dynamic Graph Pattern Mining (DGPM) has been widely applied in various domains. However, existing solutions still suffer from severe memory access bottlenecks due to the irregular and data-intensive nature of DGPM workloads. In this paper, we propose DyPamear, the first full-stack hardware-software co-designed system for accelerating DGPM on practical Processing-in-Memory (PIM) hardware. DyPamear is built atop UPMEM, an emerging commercially available PIM platform. To fully exploit UPMEM's bandwidth and parallelism, DyPamear introduces a cross-layer design that integrates load-aware task distribution, data-driven asynchronous execution, and a degree-adaptive set intersection kernel to balance load and alleviate architectural constraints. Evaluations on real UPMEM hardware show that DyPamear achieves average speedups of 267.38x, 82.52x, and 8.78x over Cheetah, PimPam, and PSMiner, respectively, and scales nearly linearly to 20,480 DPUs. The source codes are available at https://github.com/DyPamear-AE/DyPamear-AE.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIAI5-I. AI/ML System and Platform Design
RESEARCH2125

Overmind NSA: A Unified Neuro-Symbolic Computing Architecture with Approximate Nonlinear Activations and Preemptive Memory Bypass

Weilun Wang; Zirui Wang; Wantong Li

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

Neuro-symbolic AI is gaining traction in domains such as large language models, scientific discovery, and autonomous systems due to its ability to combine perception with structured reasoning. However, its deployment is often constrained by high memory demands, diverse computation patterns, and complex hardware requirements. Existing hardware platforms struggle with large on-chip memory overheads, frequent pipeline stalls, limited I/O bandwidth, and inefficient handling of nonlinear operations. To address these key computational bottlenecks, we propose Overmind, a unified neuro-symbolic architecture with cross-layer optimizations. Overmind tackles these core bottlenecks through Padé approximations for universal nonlinear functions, preemptive memory bypass that eliminates costly on-chip caches, and a complete software stack that optimizes model deployment. By reconfiguring the Padé orders for approximating nonlinear functions, we also demonstrate adaptive accuracy-performance scaling. Overmind achieves an energy efficiency of 8.1 TOPS/W and a throughput of 410 GOPS for mixed neuro-symbolic workloads with minimal model accuracy loss. Compared to existing solutions, Overmind improves performance and efficiency with significantly fewer hardware resources.

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH2128

Enabling Robust On-Device Personalization via Pulse-Based Mapping and Noise Suppression for Non-Ideal CIM Architecture

Yiyang Shi; Xinyu Liu; Zhenge Jia; Shaocong Wang

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Are We There Yet? That is the Question ... for Computing near and in Memory. +1
Abstract

Computing-in-memory (CIM) architectures alleviate the data movement bottleneck by performing neural computations directly within memory arrays, which significantly addresses the resource constraints for edge AI applications. As edge applications increasingly demand user-specific adaptation, on-device deep model personalization has become essential for supporting evolving environments, privacy-sensitive data, and low-latency intelligence. However, in CIM-based systems, deep model parameters cannot be efficiently extracted for external fine-tuning due to the high cost of reading, digitizing, and transferring analog device states. Consequently, personalization is expected to be executed directly on the CIM hardware through in-situ weight updates. Prior works have focused primarily on improving inference robustness under non-ideal CIM conditions, leaving the problem of robust on-device fine-tuning fundamentally unaddressed. To fill the gap, we propose CIM-MP, a hardware–software co-optimized framework enabling stable and accurate on-device personalization under noisy CIM conditions. CIM-MP introduces a pulse-based mapping strategy that ensures convergence during in-memory weight updates. To further enhance robustness, we propose a Feature Variation Elimination (FVE) mechanism to mitigate feature-map noise in forward propagation, and a Gradient Adaptive Purification (GAP) mechanism to refine gradients during backpropagation. Experiments show that CIM-MP achieves up to 35.1% accuracy improvement over state-of-the-art approaches, demonstrating the feasibility of efficient and robust on-device learning directly on CIM platforms.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIAI5-I. AI/ML System and Platform Design
RESEARCH2131

EastPCG: An Efficient and Self-Tuning Graph Sparsification Based PCG Solver for Circuit Simulation

Baiyu Chen; Wenjian Yu

Date:Monday, July 27 Location:Mtg Room 202AB Session:Closing Integrity Faster: Physics- and Learning-Based Methods for Power, EM/IR, and Thermal +1
Abstract

Graph spectral sparsification plays an important role in extensive EDA applications. For preconditioned conjugate gradient (PCG) solvers, graph spectral sparsification is a promising preconditioning technique in both theory and practice. In this paper, a highly efficient and stable graph sparsification algorithm based on spectral probability is proposed. Meanwhile, targeting at minimum total solution time of the linear equation with multiple right-hand sides, an efficient self-tuning PCG framework powered by neural networks is proposed. Combining the proposed techniques, an efficient and self-tuning graph sparsification based PCG solver, named EastPCG, is finally developed. Extensive experiments on various benchmarks have demonstrated the advantages of the proposed algorithms over existing counterparts.

EDAEDA4. Power Analysis and OptimizationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2132

Hyperwafer: Communication-Aware Sparse Matrix Multiplication on Wafer-Scale Chips

Zhongcheng Du; Yu Huang; Yi Zhang; Qihang Qiu; Chencheng Ye; Huanyu Wu; Qie Wang; Xueqi Li; Long Zheng; Xiaofei Liao; Hai Jin

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

Sparse-matrix sparse-matrix multiplication (SpGEMM) is a key kernel across many domains, yet its performance on modern architectures is dominated by highly irregular data movement. Wafer-scale chips (WSCs) offer unprecedented bandwidth and massive parallelism, making them a compelling platform for SpGEMM. However, sparsity-induced irregularity generates substantial communication overheads, resulting in severe underutilization. In this work, we present HyperWafer, the first communication-aware SpGEMM framework for WSCs, combining a hypergraph-guided execution model with dedicated architectural support. HyperWafer captures SpGEMM's inherently high-order dataflow dependencies, which arise from multiway row sharing and overlapping reduction scopes, using a weighted hypergraph abstraction that enables communication-centric partitioning and topology-aware mapping aligned with the wafer's physical bandwidth distribution. To efficiently realize this model, HyperWafer integrates a sparsity-aware SpGEMM execution engine that sustains high per-PE throughput under irregular workloads together with a lightweight, congestion-sensitive runtime routing substrate that preserves effective on-wafer communication along hypergraph-optimized paths. Across diverse workloads, HyperWafer delivers average speedups of 979.97x, 125.23x, 12.14x, and 5.14x over state-of-the-art CPU, GPU, FPGA, and WSC SpGEMM implementations.

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH2133

Toward True-3D Timing-Driven Analytical Global Placement for Mixed-Size Face-to-Face 3D ICs

Qiyuan Chen; Fuxing Huang; Lixin Chen; Hao Wu; Hao Gu; Xiqiong Bai; Xi Wang; Ziran Zhu

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

Face-to-face (F2F) three-dimensional integrated circuit (3D IC) design leverages advanced vertical interconnect technologies, such as hybrid bonding and micro-bump soldering, to vertically stack dies and preserve transistor-density scaling beyond Moore's law. Despite significant progress in 3D physical design methodologies, explicit timing optimization within 3D analytical global placement (GP) remains largely unexplored. In this work, we present the first timing-driven analytical GP framework that moves toward true-3D optimization for mixed-size F2F 3D ICs. We propose a comprehensive timing-driven net weighting formulation that integrates drive-strength-based cell delay, L-shaped RC-based net delay, and static timing analysis (STA)-based incremental timing criticality into the analytical GP model, serving as a backbone for timing optimization and flexibly adapts to both true-3D and multi-die 2D GP environments. To steer macros away from central congestion and provide an effective 3D initialization, we introduce a macro-boundary-aware true-3D initial placement approach that models macro-to-boundary interactions using a differentiable function. Then, we develop the first timing-driven mixed-size true-3D GP algorithm that jointly optimizes standard cells and macros within a unified 3D design space, enabling cross-die timing refinement and improving 3D timing closure. After die partitioning based on true-3D GP results, we further introduce a timing-driven multi-die 2D GP guided by 3D-aware STA, in which cross-die RC-trees are reconstructed to enable realistic 3D parasitic estimation for STA. Experimental results on OpenROAD benchmark suites demonstrate that, compared with existing open-source placement flows and a wirelength-driven 3D GP baseline, our timing-driven 3D GP framework achieves at least 33.2% and 43.2% average improvements in total negative slack (TNS) and worst negative slack (WNS), respectively.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH2139

Flip-FET-Based VLSI Design Framework with Congestion-Aware Dual-Side Routing

Sojung Park; Junghyun Yoon; Jooyeon Jeong; Heechun Park

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:From Analog to 3D Packaging: Enabling Routing and Clocking for Next-Generation Systems +1
Abstract

This paper presents a comprehensive Flip-FET-based VLSI design framework that leverages both front-side and back-side metal layers for congestion-aware routing optimization. Starting from the placement result, we formulate an integer linear programming (ILP) model to determine standard cells' pin assignments that minimize dual-side nets and alleviate regional congestion simultaneously. Routing is then performed with optimal nTSV insertion for dual-side nets using an RSMT construction approach. Experimental results demonstrate that the proposed framework achieves 78.8% reduction in maximum congestion and 68.6% reduction in DRVs over a commercial design tool.

EDAEDA7-I. Physical Design and VerificationEDA7-II. Physical Design and VerificationDesign
RESEARCH2145

Drcagent: A Design Rule Aware LLM Agent for DRC Auto-Fix with Adversarial Reinforcement Learning

Zihang Ma; Zhuoheng Wan; Haodong Lu; Kun Wang

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

As semiconductor manufacturing advances toward leading-edge nodes, such as 7 nm and below, Design Rule Checking (DRC) and violation correction have emerged as critical bottlenecks in achieving design closure. This escalation arises from intricate geometric constraints and complex rule dependencies, which traditional template-based or heuristic approaches are inadequate to resolve efficiently. These methods offer limited adaptability when migrating to new process nodes. To overcome these challenges, we propose drcAgent, a novel framework for automated DRC violation correction. The framework integrates a multimodal Large Language Model (LLM) agent with a Retrieval-Augmented Generation (RAG) mechanism grounded in a Design Rule Knowledge Graph (DRKG). We develop a self-play adversarial multi-turn reinforcement learning framework where a Generator agent and a Fixer agent co-evolve, enabling the agent to iteratively improve its correction policy through real interactions with commercial EDA tools. Experimental results on real industrial-scale design cases demonstrate that the proposed framework can be effective at fixing violations, offering a promising new learning-based solution for chip design.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2157

RFIPSA: A Reconfigurable FIP-Based Systolic Array for Various Matrix-Intensive Workloads

Zongfan Wu; Dong Jiang; Weijie Chen; Junyi Mai; Longyuan Kang; Enyi Yao

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

The growing demands of scientific computing and artificial intelligence require high-throughput, flexible and efficient matrix computation for dense, sparse, and convolutional workloads. This paper introduces a reconfigurable accelerator based on the fast inner product (FIP) algorithm, which reduces hardware resource consumption while supporting dense and sparse computation modes. The accelerator is further enhanced by a novel sparse encoding scheme that skips zero inputs with minimal overhead, and a convolution module that reduces data redundancy. The prototype is implemented in 28nm technology, achieving a 21.3% area reduction and up to 4.4× and 5.1× speedup for SpGEMM and SpMM over GEMM, respectively.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH2159

PMTPA: An Ising Model-Based Parallel-Momentum Tempering Processing Architecture for Combinatorial Optimization

Jinsa Hu; Xiangrui Wang; Dong Jiang; Junyi Mai; Zhanhong Huang; Enyi Yao

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

This work proposes a Parallel-Momentum Tempering Processing Architecture (PMTPA) featuring an elastic replica framework with time-division multiplexing and a system-level parallel pipeline. Key innovations include a similarity-driven local-field computation that halves local-field storage and simplifies arithmetic, an adaptive processing scheme supporting spin networks of varying scales, and a random-number-controlled shift mechanism to implement nonlinear functions, achieving a balance between computational accuracy and hardware cost. A prototype FPGA implementation supports up to 2,048 fully connected spins and 2–32 configurable replicas. Experimental results on large-scale Max-Cut and image-segmentation benchmarks demonstrate accuracy exceeding 99% and a 17,000× speedup compared with the CPU-based implementation.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH2169

Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

Linye Wei; Wenjue Chen; Pingzhi Tang; Xiaotian Guo; Le Ye; Runsheng Wang; Meng Li

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162× and 2.63-6.30× speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH2177

GTX: Graph Transformer for Parasitic Extraction on Analog Circuits

Youngchang Choi; Myungjun Kook; Seokhyeong Kang

Date:Monday, July 27 Location:Mtg Room 202AB Session:Generative AI and Graph-Based Learning for Next-Gen Circuit & Device Simulation +1
Abstract

We present GTX, a graph-transformer–based framework for accurate parasitic prediction in analog circuits. GTX combines a customized parasitic modeling scheme with HGNNs and global attention to capture both local device interactions and long-range interconnect dependencies. Experiments show GTX achieves higher prediction accuracy than prior ML-based methods and generates predicted-parasitics netlists whose post-layout simulations closely match commercial parasitic-extracted netlists. Moreover, parasitic inference is 232.7x faster than commercial PEX, and simulations using predicted-parasitics netlists are up to 8.4x faster while closely matching the results of commercial parasitic-extracted netlists, demonstrating GTX's effectiveness in accelerating layout-aware analog design.

EDAEDA6. Analog CAD, Simulation, Verification and TestChipletSYS4. Embedded System Design Tools and Methodologies
RESEARCH2179

Detrojan: Detecting Trojan Circuits Using a Purpose-Built Graph Neural Network with Community-Enhanced Classification

Chien-Tung (Cherie) Kuo; Chen-Ching Nieh; Jie-Hong Roland Jiang; Chung-Yang (Ric) Huang

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

Gate-level Hardware Trojan (HT) detection using Graph Neural Networks (GNNs) often suffers from limited accuracy due to the reliance on structure-only features and fragmented node-level predictions. We propose a GNN-based framework, \papertitle, to overcome these challenges through three key innovations: (1) functionality-aware feature engineering, (2) an edge-aware GNN architecture with Jumping Knowledge and global context aggregation, and (3) accurate Trojan circuit localization with community-aware classification refinement. Evaluated on the 2025 ICCAD CAD Contest Hardware Trojan benchmarks, DeTrojan achieves a 16.7% improvement in circuit-level prediction accuracy and a 47.3% relative increase in gate-level F1-score over the state-of-the-art machine learning (ML)-based methods. Furthermore, incorporating our proposed features and refinement modules into the existing approaches yield up to 16.6% gain in accuracy and a 2× improvement in the F1-score, demonstrating the effectiveness and generality of our framework.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH2184

Adana: Accelerating Large Language Models via Adaptive Nonuniform Asymmetric Quantization

Xinkuang Geng; Siting Liu; Hui Wang; Leibo Liu; Jie Han; Honglan Jiang

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

We propose \texttt{Adana}, a hardware-software co-design that enables efficient low-bit group-wise quantization for LLMs based on the adaptive nonuniform asymmetric numeric type. First, \texttt{Adana} introduces a novel numeric type that precisely captures the nonuniformity and asymmetry of data within small groups. In addition, an approximate metric for quantization error is proposed to facilitate efficient implementation of online adaptive activation quantization. Finally, a dedicated LLM acceleration microarchitecture is developed for \texttt{Adana}. Compared to state-of-the-art designs, \texttt{Adana} achieves 1.42$\times$--2.10$\times$ speedups and 18.9\%--48.5\% power savings on LLMs, while maintaining superior accuracy.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH2199

Expertflow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching and Token Scheduling

Xin He; Shunkang Zhang; Kaijie Tang; Shaohuai Shi; Yuxin Wang; Zihao Zeng; Zhenheng Tang; Xiaowen Chu; Haiyan Yin; Ivor Tsang; Yew Soon Ong

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

Sparse Mixture-of-Experts (MoE) models can outperform dense LLMs at similar computation, but their large expert parameters create high memory demand, making single-GPU deployment difficult. Offloading addresses this by storing inactive experts on CPU, yet static caches ignore dynamic routing and existing predictors for expert usage are often inaccurate or costly. We present ExpertFlow, a lightweight MoE inference system with a routing path predictor, a routing-aware token scheduler, and a predictive expert cache. Together, these components enable efficient expert loading and execution, reducing GPU memory by 93.72% and improving throughput by up to 10× on a single GPU.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH2207

PRISM: Priority Aware Shared Scale Microscaling Format for 4-Bit Quantization

Dongyoung Lee; Seokjin Kim; Yiran Chen; Ik Joon Chang

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

Microscaling formats have emerged as prominent candidates for 4-bit quantization of modern AI models due to its fine-grained group-wise quantization granularity. However, such format still exhibit fundamental accuracy degradations. In particular, MXFP4 and NVFP4 are limited by fixed shared-scale precision, and NVFP4 further cannot cover the full value range without an added FP32 scale. This paper presents PRISM, a microscaling format with a single 8-bit encoded group level shared scale that adaptively allocates shared scale's reprsentation based on the relative importance of values. Through our evaluation, PRISM surpasses conventional Microscaling format accuracy with only 0.62% area and 0.86% energy overhead.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2216

Safe-IoT : A Memory-Efficient HW/SW Co-Designed ML-DSA Accelerator for IoT Edge Devices

Yan Xu; Jingqi Zhang; Mengquan Li; Xinghua Wang; An Wang; Liehuang Zhu

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

The development of IoT devices imposes stringent requirements on data security. Existing CPU-based cryptographic schemes suffer from high latency, while dedicated hardware accelerators lack scalability and fail to support agile development. To address these challenges, we propose Safe-IoT, a memory-efficient HW/SW co-designed ML-DSA accelerator for IoT edge devices. Its core components include a memory-efficient IoT-oriented number theoretic transform (MI-NTT) circuit and a low-cost LUT-based modular multiplier. Experimental results show that, compared with state-of-the-art CPU-based designs, Safe-IoT achieves up to a 3.0× improvement in throughput. Compared with similar HW/SW co-design approaches, on-chip RAM usage is reduced by 2.6-8.3x, and overall performance, measured by area-time product (ATP), is improved by 4.28-176.53x.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH2230

WSC-Cost: An Accurate Cost Modeling Framework for Wafer-Scale Chips

Dingcheng Jiang; Guanghong Wu; Jingyao Liu; Xinru Tang; Jiamu Fu; Taiquan Wei; Jinyi Deng; Yang Hu; Shouyi Yin

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Interconnect-Driven System Architecture Innovation +1
Abstract

As Moore's Law nears its physical limits, wafer-scale chips (WSCs) are emerging as a promising platform for LLM workloads, offering extreme performance, and favorable cost. However, prior work seldom quantifies WSC cost end-to-end, hindering principled design-space exploration. We introduce WSC-Cost, a unified WSC cost model with two orthogonal dimensions. Horizontally, WSC-Cost quantifies mask-stitching costs across multiple metal layers. Vertically, it accounts for interposer fabrication and assembly costs. We validate WSC-Cost using open-source cost data and an in-house WSC prototype. WSC-Cost enables the co-optimization of wafer-scale systems and further provides cost-driven insights to unleash the full potential of wafer-scale integration.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageQuantum +1 more
RESEARCH2235

CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution

Mu-Young Son; Yi Chen; Seungjae Yoo; Soongyu Choi; Joo-Young Kim

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

The Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference. To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)–enabled CPU–GPU collaborative system that comprehensively optimizes MoE inference by combining coalesced expert execution with strategic workload orchestration for higher throughput. CoX-MoE introduces (i) a coalescing-aware orchestration policy to jointly optimize resource allocation by adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading, and (ii) a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU, mitigating PCIe transfer overhead and balancing workload for the CPU and GPU during inference. Compared to state-of-the-art frameworks, CoX-MoE delivers significant gains, achieving up to 7.1x and 2.4x higher throughput than FlexGen and MoE-Lightning, respectively.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH2243

MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition

Maoliang Li; Ke Li; Yaoyang Liu; Jiayu Chen; Zihao Zheng; Chenchen Liu; Guojie Luo; Yinjun Wu; Xiang Chen

Date:Wednesday, July 29 Location:Mtg Room 101B Session:Fast, Furious, and Fault-Tolerant: Accelerating the Generative Grind +1
Abstract

To leverage user-specific data, retrieval-augmented generation (RAG) is used in multimodal large language model (MLLM) applications. Conventional retrieval has low accuracy, and advanced multi-vector retrieval (MVR) still lacks optimal accuracy and efficiency due to ignored query-image alignment and redundant image segments. We propose MIRAGE, an efficient image retrieval framework. It uses a hierarchical paradigm for better alignment, cuts redundancy by leveraging cross-hierarchy ranking consistency and hierarchy sparsity, and auto-configures dataset parameters. Experiments show it boosts accuracy and reduces computation by up to 3.5 times versus existing MVR systems.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH2245

TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators

Chang Meng; Hanyu Wang; Yuyang Ye; Mingfei Yu; Wayne Burleson; Giovanni De Micheli

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

Reducing power consumption in AI accelerators is increasingly important. Approximate computing can reduce power consumption while keeping the accuracy loss small. Since multipliers are power-consuming in AI models, this paper focuses on synthesizing low-power approximate multipliers (AxMs). Unlike prior works that rely on manual or automatic synthesis without considering AI model contexts, we present TRAM, which jointly optimizes the AxM structure and AI model to lower power with small accuracy loss. Experiments show that compared to state-of-the-art AxMs, TRAM achieves up to 25.05% AxM power reduction on CNNs with CIFAR-10, and reduces power by 27.09% on vision transformers with ImageNet.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH2254

Instruction-Level Side-Channel Disassembly on Modern Complex CPUs via Voltage Waveform Analysis

Yucheng Ji; Ruibin Xia; Hanyuan Zou; Yingqi Zhang; Weidong Liu; Jiaxing Song; Dongsheng Wang

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

Power side-channel attacks allow adversaries to recover executed instructions and reconstruct program flow from power traces. However, existing approaches have primarily focused on low-frequency embedded platforms with relatively simple processor designs. When applied to modern commercial CPUs, complex microarchitectural behaviors, such as out-of-order execution and deep pipelining, introduce substantial execution variability and noise, severely hindering reliable instruction recovery from power measurements. In this work, we present the first systematic demonstration of instruction-level side-channel disassembly on a commercial x86 processor operating at 3.1 GHz. We construct a large-scale dataset containing 645 instructions—including AVX/SSE extensions—and more than 50,000 voltage waveform samples collected under realistic multi-threaded operating conditions. To capture cross-cycle dependencies, we propose a novel end-to-end sequence-to-sequence framework that directly reconstructs instruction sequences from voltage waveforms, integrating CNN-Conformer encoders for feature learning and Transformer decoders for sequence generation. Experiments show that our method achieves 34.70% sequence prediction accuracy, outperforming static segmentation and classifier-based baselines.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2256

ALOHA: Area-Efficient and Low-Power Stochastic Computing with Hammersley Sequences

Zhaojun Ni; Yutao Gong; Honglan Jiang; Siting Liu

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

Counter-based stochastic computing (CBSC) utilizes simple counting logic to realize multiplication. Inspired by CBSC, this paper proposes the ALOHA architecture that significantly reduces the computing latency and stochastic number generation (SNG) overhead. Specifically, ALOHA incorporates three key innovations, including input scaling, bit-reversal counter-based SNG, and sequence-aware optimization schemes. Experimental results show that the ALOHA multiplier significantly outperforms state-of-the-art CBSC designs in terms of accuracy, area and power consumption with reduced sequence length. In CNN and Transformer inference, ALOHA matches the accuracy of prior FSM-based multipliers while reducing area and energy.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH2265

Sparsee: Unlocking Sparsities in Encrypted Sparse Matrix Multiplication via Hardware-Software Co-Design

Yuntao Wei; Xueyan Wang; Song Bian; Yier Jin; Weisheng Zhao

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

Sparse Matrix Multiplication (SpMM) incurs prohibitive performance overheads in Privacy-Preserving computing based on Fully Homomorphic Encryption (FHE), since the encryption of the sparse matrix hides its inherent sparsity pattern, leading to dense matrix computations. To overcome this, we propose SparseE, a hardware-software co-design framework that enables efficient, sparsity-aware SpMM under FHE. Our novel algorithm recasts SpMM into a secure Scatter-Gather-Apply paradigm, using a homomorphic permutation network to perform the critical data gathering based on encrypted indices. This approach ties the computational cost directly to the number of non-zero elements while protecting the sparsity pattern. To further bridge this performance gap, we co-design a dedicated hardware accelerator. Its Homomorphic Permutation Engine adapts the network to a hardware-friendly Benes topology, enabling a deeply pipelined Radix-k MDC architecture that resolves the on-chip bandwidth bottleneck. Concurrently, its Homomorphic Expansion Engine performs on-the-fly decompression of compressed selector ciphertexts, mitigating the massive storage bottleneck. Experimental results demonstrate that SparseE achieves an average speedup of 401.8× and an average energy reduction of 2594.3× compared to state-of-the-art solutions.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH2268

TierANNS: Scalable Graph-Based ANNS with CXL-Enabled Tiered Data Placement

Ying Zhang; Chao Li; Xiaowei Chen; Tianyu Wang; Chenlin Ma; Zhaoyan Shen; Jiaxian Chen; Kecheng Huang

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

Performing Approximate Nearest Neighbor Search (ANNS) on large-scale vector datasets is essential in the context of AI applications. Graph-based ANNS demonstrates superior performance and accuracy, positioning itself as a leading approach within the ANNS landscape. However, its inherent dependency on random data access mandates a memory-centric deployment strategy, which in turn presents significant scalability challenges. Recent advancements in Compute Express Link (CXL) technologies, known for their high-bandwidth memory extension capabilities, offer a critical opportunity to enhance the scalability of graph-based ANNS. Nonetheless, the implications of the CXL-extended memory architecture when applied to graph-based ANNS remain largely unexplored. In this paper, we present a CXL-Oriented Graph-based ANNS (COGA) system to achieve scalable, high-speed ANNS for extensive datasets. Our key observation is the search pipeline can be enhanced through a CXL-tailored priority queue that maximizes CXL bandwidth utilization while bridging the latency gap. Furthermore, we propose a tiered data layout and placement strategy that leverages queue hints to facilitate speculative access to index data.

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH2271

Pefoo-L: A General Framework for Preconditioned Forward-Only Optimizer Enabling LLM Fine-Tuning on the Edge

Yuanfang Wang; Yu Li; Zhinan Qin

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

Fine-tuning Large Language Models (LLMs) on resource-constrained edge devices is a critical but challenging task, primarily due to the prohibitive memory and computational costs of backpropagation. While forward-only optimizers like MeZO mitigate these costs by eliminating the backward pass, they often suffer from slow and unstable convergence. To address this limitation, we introduce PeFoo, which integrates a carefully designed preconditioning strategy into the forward-only paradigm. Furthermore, we propose PeFoo-L to counteract the memory overhead introduced by the preconditioner. This approach constrains preconditioner storage and weight updates to a single layer per iteration, reducing the overall memory footprint and data traffic.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH2282

In-Memory ADC-Based Nonlinear Activation Quantization for Efficient In-Memory Computing

Shuai Dong; Junyi Yang; Biyan Zhou; Hongyang Shang; Gourav Datta; Arindam Basu

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Move Less, Compute More: Memory-Centric Architectures for AI +1
Abstract

This paper presents Boundary Suppressed K-Means Quantization (BS-KMQ), a nonlinear (NL) quantization method to reduce analog-to-digital converter (ADC) resolution in the in-memory computing (IMC) systems. ReLU and clamping often cause value accumulation near distribution edges, leading to biased clustering and suboptimal quantization. BS-KMQ mitigates this by removing such outliers before clustering, yielding more informative quantization levels. It achieves at least 3X lower quantization error compared to linear, Lloyd–Max, CDF and K-means methods. The resulting NL references are implemented using a reconfigurable in-memory NL-ADC with 7X area improvements compared to previous two works. Evaluated on ResNet-18, VGG-16, Inception-V3, and DistilBERT, BS-KMQ improves up to 66.8%, 25.4%, 66.6%, and 67.7% higher post‑training quantization accuracy compared to linear quantization. After low-bit finetuning, it maintains competitive accuracy with significantly fewer ADC levels (3/3/4/4b). System-level simulation on ResNet-18 (6/2/3b) shows up to 4X speedup and 24X energy efficiency over existing IMC accelerators.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIQuantum
RESEARCH2290

Design Technology Co-Optimization for Network on Chip in High Performance CPUs at 3nm Node Using Monolithic 3D and Backside Interconnect Technologies

Tianchi Liu; Hu Zhou; Zexu Leng; Zhiyong Zhang; Lianmao Peng; Rongmei Chen

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

With the rapidly growing demand of cloud computing and large-scale AI models, many-core systems are facing challenges of longer global interconnect distances in Network-on-Chip (NoC). Though conventional 2D NoCs can apply relative high metal layer for global routing, the extensive repeater insertion for long-distance transmission causes significant number of via-stacking, leading to performance degradation. Targeting high-performance CPU clusters at 3nm node, this work adopts a design–technology co-optimization (DTCO) framework to evaluate long-distance NoC interconnects across four implementation schemes: conventional frontside 2D (F2D), frontside 3D (F3D) with M3D integration, F2D with backside power delivery network (BSPDN), and backside 2D (B2D) leveraging wafer backside signal routing & PDN. Based on post-layout extraction of the ARM Neoverse CSS N2 computing tile, we incorporate realistic PDN characteristics, technology-dependent RC modeling, and IR-drop-aware circuit simulation. Results show that F3D and B2D reduce delay by 53% and 68%, and energy–delay product (EDP) by 32% and 63%, respectively, compared with F2D. F2D-BSPDN achieves performance comparable to F3D. System-level NoC evaluations further demonstrate that F3D/B2D enable 2.1×/3.1× feasible link frequencies of F2D, and lower average NoC latency of F2D by 23%/35%. The DTCO analysis indicates that while F2D remains adequate for small cores (Cortex-A76), B2D is the optimal choice for mid-core (Cortex-A720, X4) and large-core (Cortex-X925, CSS N2) clusters, with F3D providing secondary benefits through repeater relocation. These findings identify backside interconnect as

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH2293

Photomt : Accelerating Zero-Knowledge Proofs with a Photonic-Electronic Merkle Tree Engine

Yan Xu; Mengquan Li; Shu Li; Zhaoyuan Zhang; Kenli Li

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

Merkle tree, as a fundamental cryptographic primitive for ensuring user privacy, play a critical role in zero-knowledge proof systems (ZKPs). However, its construction involves numerous computationally intensive Poseidon hashes, creating the primary computational bottleneck in ZKP systems. To address this challenge, we propose PhotoMT, the first photonic-electronic collaborative merkle tree engine. PhotoMT leverages the photonic microring resonator (MR) array for intensive matrix-vector multiplications to boost throughput and energy efficiency, while digital circuits handle control-intensive tasks. Furthermore, PhotoMT incorporates a multi-subtree interleaved execution strategy and an S-BOX bypass computation queue, which together improve hardware utilization and reduce memory overhead. The experiments reveal that PhotoMT boosts throughput by 18.8-20.5x over state-of-the-art AISC-based designs, achieving an energy efficiency gain of 3-5 orders of magnitude against the CPU baseline.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH2294

PIMA-SecDB: Processing-in-Memory Acceleration for Fully Homomorphic Encryption Databases

Lin Ding; Pengao He; Jiliang Zhang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

Databases using fully homomorphic encryption (FHE) protect outsourced data but suffer slow query speeds due to full scans of encrypted entries. To address this, we propose PIMA-SecDB, a processing-in-memory (PIM) architecture that accelerates FHE database operations. It adopts an integrated co-design of hardware and algorithms, featuring a multi-level, multi-channel structure for parallel processing and high bandwidth. Moreover, we remove costly circuit bootstrapping from the structured query language based on FHE and propose a quantitative method to select PIM-friendly encryption parameters, further reducing computational and data load. Experiments show \NAME is 1.98 to 7.93 times faster than existing FHE accelerators.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH2297

IR-Drop Aware Global Placement with PDN Optimization and GPU-Accelerated Analysis for Mitigating Global and Local IR-Drop

Jai-Ming Lin; Pin-Yu Chen; Zhao-Xian Huang; Chen-Fa Tsai; De-Shiun Fu; Shih-Cheng Huang

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:Placing the Future Beyond Cells: From Transistors to 3D Integration +1
Abstract

This paper presents an analytical global placement framework that concurrently integrates PDN optimization, effectively addressing both global and local IR-drop challenges. To enable accurate and efficient voltage drop estimation, we implement a fast, GPU-accelerated IR-drop analysis engine based on Modified Nodal Analysis (MNA). For PDN optimization, we introduce an iterative strategy that enhances power delivery robustness by inserting additional power delivery paths to reduce global IR-drop. To further address localized IR-drop issues, we propose a novel density control mechanism that constrains cell placement based on the maximum tolerable power load at each PDN node, as determined by IR-drop severity. Experimental results show that the proposed methodology significantly reduces IR-drop violations without incurring additional wirelength, runtime, or routability overhead compared to state-of-the-art IR-drop-aware placement techniques.

EDAEDA7-I. Physical Design and VerificationSystems
RESEARCH2298

VeRA+: Vector-Based Lightweight Compensation for Drift-Resilient RRAM In-Memory Computing

Weirong Dong; Kai Zhou; Zhen Kong; Zhengke Yang; Quan Cheng; Haoyuan Li; Junkai Huang; Jun Lan; Yida Li; Masanori Hashimoto; Longyang Lin

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Leveraging AI for Silicon Health: Test, Yield & Fault Tolerance +1
Abstract

RRAM-based in-memory computing (IMC) offers high energy efficiency but suffers from conductance drift that severely degrades long-term accuracy. Existing approaches including retraining, noise-aware training, and Batch Normalization-based calibration either require RRAM rewriting, demand large storage overhead, or rely on online correction. We propose VeRA+, a lightweight drift compensation framework that reuses shared projection matrices and introduces only two compact drift-specific vectors per drift level. A drift-aware scheduling algorithm offline-trains a small set of VeRA+ parameters and selects the appropriate set over time without any on-chip retraining or data replay. VeRA+ preserves up to 99.77% of the drift-free accuracy after ten years of simulated drift and reduces storage overhead by more than three orders of magnitude compared with BN-based calibration. To validate VeRA+ under realistic device behavior, we extract one-week drift statistics from measurements on our fabricated 1T1R RRAM devices and use them to simulate realistic drifted weights. Under these measured drift conditions, VeRA+ achieves accuracy close to the drift-free baseline, providing an efficient and practical solution for long-term drift resilience in RRAM-IMC.

EDAEDA9. Test, Validation and Silicon Lifecycle ManagementAIQuantum
RESEARCH2300

Hero: Adaptive Orchestration of Retrieval-Augmented Generation on Heterogeneous Mobile SoC

Maoliang Li; Jiayu Chen; Zihao Zheng; Ziqian Li; Xinhao Sun; Guojie Luo; Chenchen Liu; Xiang Chen

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

HeRo is a framework for efficiently running agentic RAG on resource-limited, heterogeneous mobile devices. It builds a performance model that accounts for workload, accelerator features, and memory contention, then uses a lightweight, dependency-guided scheduler with shape-aware decomposition, criticality estimation, and concurrency control. This approach adapts to dynamic workloads and significantly reduces latency, achieving up to 10.94× speedup over prior methods.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2306

Are They All Safe? Practical Fault Injection Attacks on FPGA Logic Synthesis Tools

Jiaxin Li; Shikai Guo; Zhihao Xu; Qian Ma; Xiaochen Li; He Jiang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

FPGAs are widely deployed in critical systems, but flaws in EDA toolchains can enable malicious HDL injection, leading to severe hardware vulnerabilities. We propose DefVul-Risk, an automated framework for assessing fault-injection risks in FPGA synthesis tools. It constructs a defect knowledge base, generates triggerable vulnerability samples via large language models, and fine-tunes risk-assessment models to prioritize high-risk defects. DefVul-Risk enhances toolchain security evaluation and vulnerability remediation efficiency. We submitted 26 CVE reports, with 9 officially confirmed.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH2307

STAC: Spatial-Temporal Activation Contextualization for Resilient LLM Inference

Rengang Zhang; Leyan Wang; Zizhen Liu; Bin Sun; Naixing Wang; Cheng Liu; Jing Ye; Huawei Li

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

LLM reliability is compromised by soft errors, yet existing static defenses fail due to the "Outlier Dichotomy": non-uniform activation patterns between Prefill and Decode phases. We propose STAC, a phase-aware framework that resolves this asymmetry. STAC employs a Spatial Contextualizer to preserve massive functional outliers in Prefill and a token-predictive Temporal Contextualizer to track dynamic drift in Decode. By aligning protection with these intrinsic structures, STAC achieves high fault coverage with negligible overhead. Experiments confirm STAC significantly outperforms static baselines, simultaneously preserving model accuracy and ensuring robustness.

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH2313

Weaver: Stratified Expert Scheduling for Memory-Constrained MoE Inference

Han Li; Jingwei Sun; Junqing Lin; Guangzhong Sun

Date:Wednesday, July 29 Location:Mtg Room 101A Session:The Edge-Lord Chronicles: Tiny Gestures, Big LLMs, and Cluster Chaos +1
Abstract

Mixture-of-Experts (MoE) models achieve remarkable performance through sparse expert activation, yet resource-constrained deployment encounters challenges as parameter sizes exceed device capacity, necessitating CPU offloading. Existing systems leave CPU underutilized during GPU attention computation. We present Weaver, a stratified scheduling framework exploiting this idle window through score-stratified expert treatment. Our insight reveals asymmetric expert importance: low-scoring experts tolerate input approximation while high-scoring ones remain accuracy-critical. Weaver proactively executes low-score experts on CPU during attention using prior-layer inputs, initiates stratified prefetching, and reactively balances remaining workload. Evaluations on three MoE models demonstrate 1.47-3.58x speedup over state-of-the-art offloading systems while maintaining model quality.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH2317

Ultrasketchllm: Sub-1-Bit LLM Compression via Sketch and Hardware-Friendly Operators

Sunan Zou; Xueting Sun; Ziyun Zhang; Guojie Luo

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

Large language models (LLMs) require larger GPU memory space these days, necessitating efficient and extreme weight compression methods. Existing compression methods are either theoretically limited by 1 bit per weight or face severe performance degradation and inefficiency. To deploy LLMs in resource-constrained scenarios, we introduce UltraSketchLLM, compressing LLMs with data sketch. It reduces peak GPU memory footprint with a high compression rate up to 0.5 bit per weight. Combined with hardware-friendly implementation, UltraSketchLLM keeps tolerable performance degradation and extremely low latency overhead with 14.9x speedup compared to naive sketch solution.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH2319

Chipmodeler: LLM-Aided Reference Model Design for Agile Hardware Verification

Jianmin Ye; Yifan Zhang; Tianyang Liu; Qi Tian; Shengchu Su; Lik Tung Fu; Nan Guan; Zhe Jiang; Xi Wang

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

As integrated circuit designs grow in complexity, reference model development for functional verification faces increasing challenges. We propose ChipModeler, an LLM-assisted platform that streamlines reference model generation and verification through design standardization and hierarchical agile modeling. By employing a building-block generation strategy, ChipModeler significantly enhances both efficiency and quality. Evaluation on 300 diverse designs shows up to 58.99% improvement in performance, a 9.18× increase in generation capacity, and a 7.11× acceleration in design and validation cycles compared to manual methods, demonstrating ChipModeler's effectiveness in automating reference model development.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2329

Shortcircuit: Alphazero-Driven Generative Circuit Design

Dimitrios Tsaras; Antoine Grosnit; Lei Chen; Zhiyao Xie; Haitham Bou Ammar; Mingxuan Yuan

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

Traditional logic synthesis heuristics have plateaued. We introduce ShortCircuit, a novel transformer-based architecture generating AND-Inverter Graphs (AIGs) directly from truth tables. Unlike prior deep learning efforts, we employ a two-phase training process combining supervised learning and an AlphaZero variant to navigate the doubly exponential state space. Evaluated on 500 8-input truth tables from real-world circuits, ShortCircuit guarantees functional correctness and outperforms the state-of-the-art tool ABC in circuit size. Furthermore, our greedy rollout mechanism achieves a 31×speedup, demonstrating significant efficiency gains.

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH2330

HP-CIM: A Computing-in-Memory Transformer Accelerator with ReRAM-Based Hash Predictor for Attention Sparsity Exploitation

Tong Li; Zhiwei Zhou; Jiancong Li; Yuyang Fu; Yingjie Yu; Tong Hu; Jia Chen; Yi Li; Xiangshui Miao

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Move Less, Compute More: Memory-Centric Architectures for AI +1
Abstract

Transformers excel at sequential modeling but attention and large matrix computation incur high latency and energy. The potential of the hybrid ReRAM-SRAM computing-in-memory(CIM) accelerator is constrained by the redundant attention sparsity. Exploiting sparsity introduces significant overhead in Top-K query-key identification or causes accuracy degradation in prior works. We present HP-CIM, a hybrid accelerator with a ReRAM-based hash predictor(ReHP) that exploits device variability for low-cost projections and couples ReRAM CAM with a K-winner-take-all(K-WTA) circuit for Top-K selection. Furthermore, an optimizable bias-softmax mechanism compensates information loss. Across diverse tasks, HP-CIM delivers 9.05–310.04× energy efficiency and 2.48-16.93× speedups over state-of-the-art CIM-based transformer accelerators.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIQuantum
RESEARCH2333

PBA-rGAT-Edge: Arc-Level Path-Based Timing Prediction with Scalable Residual Edge-Aware Graph Attention

Fangjian Liu; Chenyang Lv; Chunyang Feng

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Path-Based Analysis (PBA) is accurate but computationally expensive, while Graph-Based Analysis (GBA) is fast but overly pessimistic. We propose PBA-rGAT-Edge, a residual edge-aware graph attention model that performs arc-level delay and slew prediction on a pin-level timing graph. The model uses a compact residual attention backbone with a lightweight fusion and dual-task head for stable and efficient learning. Experiments on million-scale industrial benchmarks show state-of-the-art accuracy (R²: 0.965 slew, 0.997 delay) and faster convergence compared to prior GNN-based methods, including DeepEdgeGAT (ASPDAC'23).

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH2336

A Physics-Prior Intelligent Compact Modeling Framework for BEOL-Compatible DTCO: EKAN-Based Distillation from Neural to Symbolic Models

Xufan Li; Yi Li; Ning Lin; Xiaoyi Zhang; Yue Zhao; Jiaye Shen; xiaojuan qi; Zhongrui Wang; Han Wang; Zhenjie Yao; Lingfei Wang; ling li; Ming Liu

Date:Tuesday, July 28 Location:Mtg Room 101B Session:AI Graces the Cell Party: Transformers, LLMs and Zero Drama Extraction +1
Abstract

Neural compact models are increasingly explored for design–technology co-optimization (DTCO), yet their black-box nature hinders physical interpretability and seamless SPICE deployment. We introduce a physics-prior neural-to-symbolic compact modeling framework based on Efficient Kolmogorov–Arnold Networks (EKAN) trained on multidimensional oxide-FET data. EKAN first learns a smooth, bias-aware log-current surrogate; its spline activations are then distilled into a closed-form current expression via KAN-derived one-dimensional atoms, physics-guided feature libraries, and weighted sparse regression with monotonicity regularization. The resulting Verilog-A model is SPICE-ready, preserves key device trends across bias and process, and attains accuracy comparable to neural compact models while remaining interpretable.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH2345

Scoutattention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-Computation for LLM Inference

Qiuyang Zhang; Kai Zhou; Ding Tang; Kai Lu; Cheng Li; Zhenyu Yang; Peng Xu; Jiguang Wan

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

Large language models encounter critical GPU memory capacity constraints during long-context inference, where KV cache memory consumption severely limits decode batch sizes. While existing research has explored offloading KV cache to DRAM, these approaches either demand frequent GPU-CPU data transfers or impose extensive CPU computation requirements, resulting in poor GPU utilization as the system waits for I/O operations or CPU processing to complete. We propose ScoutAttention, a novel KV cache offloading framework that accelerates LLM inference through collaborative GPU-CPU attention computation. To prevent CPU computation from bottlenecking the system, ScoutAttention introduces GPU-CPU collaborative block-wise sparse attention that significantly reduces CPU load. Unlike conventional parallel computing approaches, our framework features a novel layer-ahead CPU pre-computation algorithm, enabling the CPU to initiate attention computation one layer in advance, complemented by asynchronous periodic recall mechanisms to maintain minimal CPU compute load. Experimental results demonstrate that ScoutAttention maintains accuracy within 2.4% of baseline while achieving 2.1x speedup compared to existing offloading methods.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH2346

Weak-Form Physics-Informed Neural Network for Self-Supervised Learning in Semiconductor Device Simulation

Hanshi Yang; Tianrun Yang; Quan Zhao; Yadong Wei; Ying Ma

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

Technology Computer-Aided Design (TCAD) simulation of semiconductor devices is time-consuming. Existing machine learning surrogate models offer acceleration but require extensive labeled data from these slow TCAD simulations. To address this data dependency issue, we propose the Weak-Form Physics-Informed Neural Network (WF-PINN), a self-supervised method driven solely by physical laws and Dirichlet boundary conditions. WF-PINN utilizes an integral-based weak-form loss to eliminate the need for internal labeled solution data. Experimental results show that WF-PINN effectively simulates the physical characteristics of Fin Field-Effect Transistors, with all solution errors less than 5.36×10^(−2). Furthermore, WF-PINN's inference speed is 2.16×10^(4) times that of TCAD simulation.

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH2347

METRIC: An MDS-Analog ECC with Minimum Two-Column Redundancy for Fault-Tolerant, Layer-Aware IMC Architecture

Jiajun Gao; Chengping He; Haozheng Wan; Zhanzhi Liu; Mingrui Jiang; Can Li

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

Neural networks from CNNs to LLMs achieve remarkable success, yet analog computing-in-memory (CIM) using memristor crossbars suffers from computational inaccuracies caused by device defects and circuit noise. While tolerating small errors, significant outliers must be corrected. We propose MECA-CIM, a Maximum Distance Separable (MDS) error correction framework requiring only two redundancy columns. Leveraging Hessian trace-based sensitivity quantification, we introduce layer-wise adaptive protection allocation. Our analog-domain syndrome decoder enables efficient error detection and correction with minimal latency. Experimental results demonstrate robust outlier resistance across CNNs and Transformers, achieving large area-energy-delay product (AEDP) reduction compared to conventional ECC systems.

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH2348

Vten: Tensor-Centric Verification Framework for Domain-Specific Accelerators

Chanmin Baek; Keehyuk Lee; Mincheol Cha; Somi Hong; Xuan Truong Nguyen; Hyuk-Jae Lee

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

The semantic gap between tensor-centric software models and signal-level hardware testbenches creates significant productivity bottlenecks in verifying Domain-Specific Accelerators (DSAs). Existing frameworks like UVM and Cocotb suffer from prohibitive synchronization overheads due to fine-grained interactions. To address this, we propose vTen, a data-centric framework that strictly decouples verification intent from execution mechanics. By leveraging a declarative DSL and kernel-granular batching, vTen minimizes host-simulator interaction frequency. Evaluation on a production-scale 3D U-Net accelerator demonstrates that vTen achieves a 2x performance improvement in simulation latency and a 60.3% reduction in code complexity compared to Cocotb.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH2358

Lightningfno: Ultra-Fast Modeling and Inverse Design for Photonic Matrix-Vector Multiplication Computing Units

Mingyuan Li; Qisheng Yang; Yichi Zhang; xiaoxuan Wu; Guochang Lin; Tian-ling Ren

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

Metasurface-based photonic computing delivers ultra-high computing power density and enables ultrafast, low-power MVM. However, inverse design and system-level simulation remain prohibitively expensive due to large-scale Maxwell solvers. We present LightningFNO, a dual Fourier Neural Operator (FNO) framework that unifies inverse design and fast forward modeling for photonic MVM units. Within this framework, an inverse FNO generates candidate topologies, and a convolutional FNO subsequently predicts their optical responses, thereby facilitating deployment at the neural network level. LightningFNO achieved a speedup of 1,245,550 times compared to adjoint-based optimization for 4x4 designs while maintaining comparable accuracy. Furthermore, fabricated 3x3 prototypes demonstrated a root mean square error (RMSE) of 0.043, and the system attained 97.69% accuracy on the MNIST dataset using metasurface-deployed kernels.

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH2360

Hierarchical Pareto-Prompted Optimization for Probabilistic DAG Scheduling on Real-Time Systems

Yili Guo; Weijie Wang; Xingchen Liu; Zuoyan Qin; Liangliang He; Weichen Liu; Wanli Chang

Date:Monday, July 27 Location:Mtg Room 203C Session:Streamlining Silicon: From Real-Time Scheduling to Efficient GPU Execution +1
Abstract

Real-time automotive workloads modeled as directed acyclic graphs (DAGs) with probabilistic execution times face severe latency bottlenecks, where shared tasks couple multiple sensor-to-actuator paths. Minimizing end-to-end latency under such variability requires navigating a complex joint space of task mappings and offsets. To address this, we propose a Pareto-prompted metaheuristic with hierarchical mapping-then-offset refinement, utilizing Pareto dominance to search for non-dominated schedules. Evaluated on production navigation-on-autopilot workloads running on NVIDIA Orin-N–based vehicles, our method satisfies all constraints and significantly outperforms an industrial static baseline, reducing mean latency by up to 12% and 99th-percentile latency by up to 15%.

SystemsSYS4. Embedded System Design Tools and MethodologiesChipletEDA
RESEARCH2361

Svacoder: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

Yutong Wu; Chenrui Cao; Pengwei Jin; Di Huang; Rui Zhang; Xishan Zhang; Zidong Du; Qi Guo; Xing Hu

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of real-world SVA corpora and the lack of methods to promote the NL-SVA semantic equivalence. For the former, large-scale RTL code is used to guide LLMs to generate real-world SVAs; for the latter, bidirectional NL-SVA translation maintains semantic consistency. With the synthesized data, we train SVACoder, a series of SVA generation models. Notably, SVACoder-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2364

Adaptive Multiply-Accumulate Unit with Scalable Critical Path for Aggressive Voltage Underscaling DNN Accelerators

Tongjing Wu; Siting Liu; Tong Li; Hui Wang; Zhigang Mao; Jie Han; Honglan Jiang

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

Deep neural network (DNN) accelerators have been investigated for efficient inference. Underscaling the supply voltage of MACs can effectively reduce the power dissipation. In this paper, we propose a bit-width adjustment circuit (denoted as ADA) for arbitrary MAC unit under aggressive voltage underscaling. Incurring 16% area overhead, the proposed ADA enables a MAC array to achieve zero accuracy loss. While preserving accuracy, the MAC array equipped with ADA achieves up to 48% power reduction compared to that without ADA. Furthermore, we propose ADA-Plus to optimize the MACs in output stationary systolic arrays, which reduces the area of ADA by 23%.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH2369

Reconnaissance: Correctly Tracking Multiple Secrets Using Reveal/conceal Paradigms

Amund Bergland Kvalsvik; Magnus Sjalander

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

Secure speculation schemes are critical for defeating speculative side-channel attacks, but often incur notable performance penalties, encouraging research into optimization strategies. One such optimization strategy, ReCon, improves performance by exploiting previous non-speculative data leakage to remove protections for already leaked data. However, we find that ReCon can leak secrets when an instruction depends on multiple sources of potentially secret data. We show how interactions between Speculative Taint Tracking (STT) and ReCon results in ReCon undermining STT's security guarantees. We address this oversight through a low-impact multi-source tracking mechanism that we call ReConnaissance. Together with a detailed microarchitecture, we show that, on average, ReConnaissance only lowers the overhead-reduction of ReCon from 23.5% to 18.3% on SPEC2017.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH2372

Tsetlinwisard: On-Chip Training of Weightless Neural Networks Using Tsetlin Automata on FPGAs

Shengyu Duan; Marcos Sartori; Rishad Shafik; Alex Yakovlev

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

The growing need for environmental adaptability, privacy, and security in edge applications drives the search for alternatives to deep learning that are both computationally efficient and hardware-friendly. WiSARD Weightless Neural Networks (WNNs) meet this need through simple table lookups, offering low latency and minimal computation. In this work, we propose TsetlinWiSARD, an FPGA-based on-chip training architecture for WiSARD WNNs that leverages Tsetlin Automata (TAs) to enable probabilistic, feedback-driven learning. By mapping logical LUTs and TAs onto physical LUTs on FPGAs, TsetlinWiSARD achieves state-of-the-art accuracy, over 1000× faster training, and reduced resource use (22%), latency (93.3%), and power (64.2%).

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH2374

MXP: A Posit-Inspired Microscaling Format for Vision-Language Tasks

Dohyun Kim; Byungkuk Yoon; Jehun Lee; Juheun Lee; Jae-Joon Kim

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

Group-wise quantization is an effective strategy for low-bit inference, where a shared scale is assigned to each smaller group of tensor values. However, existing formats struggle to capture the diverse value distributions found across modern workloads. In this work, we introduce MXP, a posit-inspired microscaling format that incorporates tapered precision to dynamically adjust exponent and fraction bits at the element level. MXP achieves an optimal balance between precision and dynamic range, thereby reducing quantization errors. We further develop a hardware accelerator tailored to MXP. Experimental results demonstrate that MXP incurs less performance degradation, while reducing area and power consumption.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH2378

UBFP-IMC: Ultralarge Bit-Width Digital Floating-Point In-Memory Computing Accelerator for Scientific Computing

Yongjian Xu; Biwei Liu; Tianbo Ming; Zhenyu Wu; Yuanxi Peng

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Are We There Yet? That is the Question ... for Computing near and in Memory. +1
Abstract

SRAM-based digital in-memory computing excels in low-precision AI, but high-precision floating-point for scientific computing faces challenges: (1) ultralarge mantissa increases area and reduces MAC speed; (2) alignment and accumulation impose excessive peripheral circuit overhead. We propose an ultralarge bit-width bit-serial mantissa MAC with 5-signal Booth encoding and two-step carry-save adder, and exponent-aligned accumulation module with exponent difference generator and FIFO adder tree. In 28 nm, UBFP-IMC with 168 MAC units and one accumulation module achieves 4.43×, 4.46×, 4.30×, and 8.53× EEF improvement, 27× lower accumulation latency, and outstanding area advantage, which provide excellent scalability, demonstrating potential for accelerating scientific computing.

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIAI5-I. AI/ML System and Platform Design
RESEARCH2385

UNCHARTED: A Universal Circuit Design Hierarchical Model Based on Auto-Regressive Spatio-Temporal Graph Neural Network with Exogenous Driving Inputs

Xinyu Chen; Wanyeong Jung

Date:Monday, July 27 Location:Mtg Room 202AB Session:Generative AI and Graph-Based Learning for Next-Gen Circuit & Device Simulation +1
Abstract

Analog circuit design is time-consuming and heavily dependent on the expertise of human designers. This paper presents a universal circuit design hierarchical model (UNCHARTED). It is the first machine learning-based framework that emulates transient simulation by modeling physical behaviors of individual circuit components through the graph neural network. The sizing of circuit elements is reflected through low-rank adaptation, and inverse design is achieved by the back-propagation mechanism. UNCHARTED can accelerate simulation by up to 241.9x, with high similarity (>95%) to SPICE simulation in more than 70\% of tested circuits. Inverse design examples of a delay chain and an amplifier also demonstrate its powerful capabilities in automatic design optimization.

EDAEDA6. Analog CAD, Simulation, Verification and TestChipletSYS4. Embedded System Design Tools and Methodologies
RESEARCH2394

B-Flex: Exploration of Broader Flip-Flop Design Space Based on FSM Exhaustive Search

Hyunsung Jeong; Kyounghun Kang; Wanyeong Jung; Jongeun Lee

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

Flip-flops (FFs) are a critical component affecting system-level power, performance, and area (PPA). Many logic-based FF design methods have been introduced to expand the scope of topology exploration beyond intuition. However, their exploration scope is still limited to 2-bit finite state machines (FSMs) because of inefficient search space representation. We present B-Flex, an automated FF FSM search that integrates a complete, graph-based behavioral equivalence check into a pruning-based generative search. This efficient and scalable approach vastly expands the design space and allows exhaustive exploration of 3-bit-state FSMs. B-Flex has identified over 568 million valid FF mechanisms, including many novel 3-bit-state FF designs. Topology synthesis on 20 sampled FSMs has produced several high-performance FF circuits that outperforms conventional FFs. For instance, FF1, achieves a 2.53× speedup over a transmission-gate FF (TGFF) and improved metastability window at 0.9 V.

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2395

Hestia: Hyperthread-Level Scheduling for Cloud Microservices with Interference-Aware Attention

Dingyu Yang; Fanyong Kong; Jie Dai; Shiyou Qian; Shuangwei Li; Jian Cao; Guangtao Xue; Gang Chen

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

Modern cloud servers routinely co-locate multiple latency-sensitive microservice instances to improve resource efficiency. However, the diversity of microservice behaviors, coupled with mutual performance interference under simultaneous multithreading (SMT), makes large-scale placement increasingly complex. Existing interference-aware schedulers and isolation techniques rely on coarse core-level profiling or static resource partitioning, leaving asymmetric hyperthread-level heterogeneity and SMT contention dynamics largely unmodeled. We present Hestia, a hyperthread-level, interference-aware scheduling framework powered by self-attention. Through an extensive analysis of production traces encompassing 32,408 instances across 3,132 servers, we identify two dominant contention patterns—sharing-core (SC) and sharing-socket (SS)-and reveal strong asymmetry in their impact. Guided by these insights, Hestia incorporates (1) a self-attention-based CPU usage predictor that models SC/SS contention and hardware heterogeneity, and (2) an interference scoring model that estimates pairwise contention risks to guide scheduling decisions. We evaluate Hestia through large-scale simulation and a real production deployment. Hestia reduces the 95th-percentile service latency by up to 80%, lowers overall CPU consumption by 2.3% under the same workload, and surpasses five state-of-the-art schedulers by up to 30.65% across diverse contention scenarios.

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH2410

CAPA: Convertibility-Aware Mixed-Precision Accelerator for Asymmetric GEMM in Weight-Only Quantized LLMs

Jihyun Moon; Insu Choi; Seungchan Lee; Joon-Sung Yang

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

FP-INT GEMM accelerators for weight-only quantized large language models suffer from large partial sum (psum) overhead and a lack of precision-convertibility, limiting scalability. To mitigate this, we propose CAPA, an accelerator achieving high efficiency and versatility. (1) CAPA introduces Hybrid Delta Block Floating Point (HDBFP), a novel INT-based format that reduces psum overhead, preserves accuracy, and integrates BF16/FP16 arithmetic. (2) CAPA adopts Weight Decomposition to unify INT8 and INT4 arithmetic, enabling parallel processing. We demonstrate CAPA attains identical GPU accuracy, improving area efficiency by 3.98x and power efficiency by 4.38x over the FP-FP baseline.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2416

Area-Optimal and Routability-Driven Layout Synthesis for Multi-Row Complementary-FET Standard Cells

Hao Wu; Keyu Peng; Fuxing Huang; Qiyuan Chen; Zhengzhe Zheng; Jianli Chen; Ziran Zhu

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

As conventional FinFET architectures encounter severe scaling limitations, Complementary-FET (CFET) technology with vertically stacked PMOS and NMOS transistors has emerged as a promising solution for continued standard cell density scaling. However, aggressive area compaction in CFET standard cells drastically limits intra-cell routing resources, leading to routing congestion and design rule challenges. To mitigate this issue, multi-row CFET standard cell architectures have been introduced to improve intra-cell routability and alleviate block-level congestion. Nevertheless, these multi-row configurations introduce new placement-routing coupling and design rule complexities, making it challenging to achieve compact, DRC-clean, and routable layouts. Therefore, this work proposes an area-optimal and routability-driven layout synthesis framework for multi-row CFET cells, which effectively addresses the challenges of area efficiency, constrained pin accessibility, and DRC compliance under multi-row CFET architectures. Therefore, this work proposes an area-optimal and routability-driven layout synthesis framework for multi-row CFET cells, which effectively addresses the challenges of area efficiency, constrained pin accessibility, and DRC compliance under multi-row CFET architectures. and (3) a two-stage Satisfiability Modulo Theories (SMT)-based routing flow consisting of a Multi-Commodity Flow (MCF)-based routability-guaranteed pin-access selection and an Integer Linear Programming (ILP)-enhanced hierarchical routing to ensure DRC/LVS closure. Compared with state-of-the-art multi-row CFET cell generators, experimental results show that our algorithm consistently achieves the optimal layout area, while delivering significant improvements in solution quality and efficiency.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH2417

Hbspec: A Hybrid-Bonding-Based Heterogeneous Accelerator for Efficient Tree-Structured Speculative Decoding

Runze Wang; Qinggang Wang; Haifeng Liu; Xinyu Zhu; Chenggang Duan; Long Zheng; Xiaofei Liao; Hai Jin; Jingling Xue

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

We propose HBSpec, a hybrid-bonding (HB)-based heterogeneous accelerator for tree-based speculative LLM inference. Unlike existing near-memory processing (NMP)-enabled heterogeneous accelerators that place processing elements (PEs) on DRAM dies with limited computation capability, HBSpec customizes PEs on the incorporated logic die to enhance NMP computation capacity. HBSpec is also equipped with a communication-aware data mapping policy to reduce cross-bank access overhead, an arithmetic intensity-aware scheduler to dynamically assign operators to the most suitable hardware units, and a Branch-KV pool to eliminate memory fragmentation. Experimental results show that HBSpec outperforms NPU-only, NPU+LPDDR, and NPU+HB baselines by 15.56x, 2.36x, and 2.13x, respectively.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2424

XRel-Graph: Graph Learning-Driven Cross-Layer Reliability Management in Embedded Mixed-Criticality Systems

Behnaz Ranjbar; Siva Satyendra Sahoo; Rohit Yalavarthy; Akash Kumar

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

Ensuring real-time and safety requirements is essential in mixed-criticality systems. While some studies have explored applying fault-tolerance techniques in multiple abstraction layers separately, such isolated-layer-specific fault-mitigation incurs high overheads in peak-power, energy, and timing. In this regard, we propose a novel scheme to address cross-layer reliability in mixed-criticality systems at design-time, by providing application-specific, low-cost fault-tolerance. This approach distributes fault-mitigation across multiple layers to minimize it on hardware layer, ultimately resulting in more cost-effective system designs. To achieve this, we introduce a machine-learning-based approach to select efficient fault-tolerance methods across multiple layers, reducing overheads while meeting real-time and reliability requirements.

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH2425

COM-BNN: A Configurable Low Power MRAM-Based Computing-in-Memory Accelerator for Binary Convolutional Computations

Keni Qiu; Can Gao; Xuejin Li; Chuting Xu; Li Su; Kaiwei Zou; Yongpan Liu

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

In resource-constrained tiny edge devices like MCUs, deploying data/compute-intensive AI like CNNs is challenging. Compute-in-Memory (CIM) and Binary Neural Networks (BNN) offer promising solutions by reducing memory access and simplifying computations. We propose RES-BNN, a lightweight, reconfigurable accelerator integrating MRAM-based CIM with serial BNN computation. It introduces a temporal reconfigurable input mechanism and an Integrate-and-Fire adder for serial MAC operations, plus an Exact Result Inferring method for efficient binary convolutional computations. RES-BNN achieves up to 85% energy and 83% power reduction over ADC-based CIM baselines, and enables dynamic latency-power tradeoffs, dropping power to 0.03% in serial mode for edge adaptability.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH2432

Hybrid PIM-Oriented Architecture-Dataflow Co-Design for Heterogeneous LLM Inference

Hai Huang; Feng Qiu; Xiang Chen; Weisheng Zhao; Chenchen Liu

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

Recent studies show that incorporating Processing-in-Memory (PIM) into neural processing units (NPUs) can significantly improve the efficiency of large language model (LLM) inference on edge devices. However, the limited compute capability of DRAM-PIM and the use of fixed dataflow strategies still constrain overall performance gains. To address these limitations, we propose a hybrid PIM architecture that integrates non-volatile spin-orbit torque magnetic random-access memory (SOT-MRAM)-based PIM into existing NPU-PIM systems, fully exploiting the high compute capability and bandwidth of SOT-MRAM together with the large storage capacity of DRAM. On top of this architecture, we further design an adaptive compute scheduling and dataflow optimization framework. Using NSGA-II-based multi-objective dataflow space exploration, our framework identifies Pareto-optimal hardware resource allocations and dataflow configurations under different deployment objectives. Experimental results show that our approach reduces inference latency by up to 8.72× and power consumption by 11.74× compared to NPUs, and further achieves 6.24× lower latency and 7.76× higher power efficiency than NPU-PIM baselines.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2443

Stitch: Assertion-Guided Patching of On-Chip Protocol Implementations Using LLMs

Melisande Zonta-Roudes; Nora Hinderling; Shweta Shinde

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

Verification flows use Verification IPs (VIPs) to identify assertion violations. This is well-suited for on-chip protocols (e.g., AXI, AHB) with standard specifications (e.g., AMBA) for developers to write assertions that detect violations. Patching these violations, however, still requires manual effort. A good patch must not only fix the violation but also preserve functionality (e.g., pass the testsuite) without causing new violations. We propose Stitch, based on the intuition that given an implementation, a violated assertion, and a counterexample an LLM can try to synthesize patches. To evaluate Stitch, we develop a new dataset that comprises 100 violations in 11 implementations of 5 protocols (AXI, AHB, APB, Wishbone, and TileLink). Our first experiment reports 19% success rate across 4 LLMs (GPT4, GPT5, Gemini, and Claude). By analyzing the failed cases, we devise 3 improvement strategies: patch localization, violation specific context (cone of influence, counterexample), iterative feedback from model checker outputs over candidate patches. Our experiments show these strategies help Stitch achieve 61% patch success, with GPT5 dominating with 56%. We validate Stitch patches through simulation and on real hardware. We compare Stitch to 3 state-of-the-art non-LLM tools and existing patches.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2468

Pillarsparse: Rethinking Unstructured Sparse Formats for Tensor Cores

Junyu Gu; Jue Wang; Zhikuang Xin; Zhiqiang Liang; Zongguo Wang; Hongyu Gao; Peng DI; Yangang Wang

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

Accelerating SpMM and SDDMM on Tensor Cores is vital for GNNs and HPC. Current methods suffer from severe bottlenecks, including multi-level pointer chasing, lack of metadata coalescing for vectorization, and high format conversion overhead. We propose PillarSparse with the Pillar CSR format. It organizes non-zeros into "Pillar" (non-zero column) units and uses metadata coalescing to pack all metadata into two integers. This design removes a pointer-chasing level, enables a single vectorized load, and allows a high-speed conversion algorithm using Thrust primitives. The co-designed SpMM and SDDMM kernels use warp specialization producer-consumer pipeline and tensor memory accelerator for efficient execution. Experiments show competitive performance.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH2476

Flowplace: Flow Matching for Chip Placement

Peng Xie; Ke Xue; Yunqi Shi; Ruo-Tong Chen; Chengrui Gao; Siyuan Xu; Chenjian Ding; Mingxuan Yuan; Chao Qian

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Chip placement plays an important role in physical design. While generative models like diffusion models offer promising learning-based solutions, current methods have the following limitations: they use random synthetic data for pre-training, require long sampling times, and often result in overlaps due to their dependence on gradient-based solvers during the sampling process. To overcome these issues, we propose FlowPlace, which features mask-guided synthetic data generation, flow-based efficient training with flexible prior injection, and hard constraint sampling for overlap-free layouts. Experiments on OpenROAD and ICCAD 2015 benchmarks show FlowPlace achieves better PPA metrics, 10-50x faster sampling efficiency, and zero overlaps.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH2481

SCOUT: Thermal-Aware SRAM Allocation for Real-Time DNN Tasks on Edge TPU

Changhun Han; Chihun Choi; Namseok Lee; Kilho Lee; Hoon Sung Chwa; Sangeun Oh

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Predictable Precision: Resilience and Timing in the Era of AI +1
Abstract

Specialized hardware accelerators facilitate timing guarantees for real-time DNNs, but high computational demands cause severe thermal stress. We present SCOUT, a real-time framework for Google Edge TPU that jointly satisfies timing/thermal constraints. The key idea couples task-level dynamic frequency scaling (DFS) with runtime on-chip SRAM allocation to offset DFS-induced performance loss. To this end, SCOUT provides i) a thermal model that reflects SRAM's thermal impact and ii) an SRAM reallocation technique that safely adjusts allocations while mitigating write overheads. Our experiments show SCOUT schedules up to 46% more DNN task sets than prior approaches, while adhering to thermal constraints.

SystemsSYS6. Time-Critical and Fault-Tolerant System DesignAIAI5-I. AI/ML System and Platform Design
RESEARCH2489

Nash: A Neighbor-Aware Shared Memory Design on GPU for Accelerating AI Workloads

Hanqing Li; Tiejun Li; Sheng ma; Jianmin Zhang; Hanzhi Xun; Lizhou Wu

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

Driven by the dominance of AI workloads and the resulting pressure on in-core SRAM, shared memory serves as a critical intermediate buffer in GPUs. We observe substantial data redundancy, with up to 60.2% of subsequent memory requests reaccessing data already loaded into shared memory. This redundancy remains largely underexplored, increasing bandwidth demands. We propose Nash, a lightweight mechanism that tracks data provenance via a Source Location Table (SLT) to detect and redirect redundant requests within or across neighboring SMs. Cycle-accurate simulation shows Nash achieves up to 18.3% performance improvement, 9.8% energy savings, and 25.3% reduction in global memory traffic.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH2495

ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

Hangyeol Lee; Joo-Young Kim

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal practical acceleration. To overcome these limitations, we propose ORBIS, an SW–HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH2504

EASY-ZKP: An End-to-End FPGA-Accelerated System for Zero Knowledge Proofs

Yuanming Song; Shiyao Li; Lianghao Chu; Shiqing Li; Wei Zhang; Lei Ju

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

Zero-knowledge proofs (ZKPs) enable parties to prove possession of information without disclosure, strengthening privacy and security. However, ZKP proof generation suffers from high computational and memory overheads. We present EASY-ZKP, an end-to-end FPGA-accelerated ZKP system with multi-scalar multiplication (MSM) and number theoretic transform (NTT) architectures that improve performance and balance resources for efficient FPGA co-deployment. We further develop an automated design space exploration framework that minimizes latency under resource constraints. Prototyped on a Xilinx Alveo U280, EASY-ZKP achieves up to 19.5× speedup over a CPU implementation and up to 7.7× better energy efficiency than a GPU implementation.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH2512

Dropping Multiple Literals per SAT Call in IC3 Model Checking

Tianjun Bu; Zhichao Lyu; Qiusong Yang

Date:Wednesday, July 29 Location:Mtg Room 202AB Session:From SAT to IC3: Modern Proof Techniques for Hardware Correctness +1
Abstract

IC3 is the state-of-the-art model checking algorithm where generalizing cubes by dropping literals one-by-one is the most computationally expensive step. We propose multi-literal drop strategies that eliminate two or more literals simultaneously. Successful n-drop saves n−1 SAT invocations. To mitigate performance losses from failures, we introduce deduction mechanisms that analyze counterexamples to generalization and identify non-droppable literals early. With these, failed multi-drop attempts are sometimes as useful as conventional single-drops. Additionally we analyze the diminishing returns of higher-order drops. Implementation on ABC solves 28 unique and 16 more cases than vanilla ABC and implementation on rIC3 runs 6% faster.

EDAEDA2. Design Verification and ValidationAIAI5-I. AI/ML System and Platform Design
RESEARCH2525

Coordinated Clip-Wise Gradient Scheduling for Full-Chip ILT via Policy Learning

Ziyang Yu; Shuo Yin; Su Zheng; Xiaoxiao Liang; Yuzhe Ma; Bei Yu

Date:Tuesday, July 28 Location:Mtg Room 202AB Session:Differentiable Dreams: GPU-Powered Litho Gets Its Gradient Groove Back +1
Abstract

Full-chip inverse lithography technology (ILT) is critical for semiconductor manufacturing but remains difficult to maintain global solution integrity. While gradient fusion flow mitigates mask stitching artifacts, it suffers from a uniform, unweighted update policy. We identify heterogeneous smoothness and inter-clip coupling as key factors that create conflicts between local clip convergence and global integrity. We propose a coordinated clip-wise gradient scheduling framework trained via policy learning to resolve these conflicts. The method constructs a full-chip state by combining per-clip static geometric descriptors and dynamic optimization signals, and aggregates them with a graph neural network to encode inter-clip relations. From this state, a scheduler generates continuous, correlated gradient weights learned with flow-matching policy gradients, capturing cross-clip dependencies. On industry-scale layouts, the approach outperforms state-of-the-art full-chip ILT baselines.

EDAEDA8. Design for Manufacturability and ReliabilityQuantumDES3. Emerging Models of Computation
RESEARCH2531

PRISM: Dynamic Primitive-Based Forecasting for Large-Scale GPU Cluster Workloads

Xin Wu; Fei Teng; XingWang Li; Bin Zheng; Qiang Duan

Date:Wednesday, July 29 Location:Mtg Room 101A Session:The Edge-Lord Chronicles: Tiny Gestures, Big LLMs, and Cluster Chaos +1
Abstract

Accurately forecasting GPU workloads is essential for AI infrastructure, enabling efficient scheduling, resource allocation, and power management. Modern workloads are highly volatile, multiple periodicity, and heterogeneous, making them challenging for traditional predictors. We propose PRISM, a primitive-based compositional forecasting framework combining dictionary-driven temporal decomposition with adaptive spectral refinement. This dual representation extracts stable, interpretable workload signatures across diverse GPU jobs. Evaluated on large-scale production traces, PRISM achieves state-of-the-art results. It significantly reduces burst-phase errors, providing a robust, architecture-aware foundation for dynamic resource management in GPU-powered AI platforms.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH2533

Beyond Fuzzer Islands: CPU Fuzzing via Smart Coordination

Yuchen Hu; Jialin Sun; Yushu Du; Renshuang Jiang; Ning Wang; Weiwei Shan; Xinwei Fang; Xi Wang; Nan Guan; Zhe Jiang

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

Despite recent progress, every CPU fuzzer explores only a narrow slice of the vast micro-architectural state space due to its fixed mutation and feedback biases. While different fuzzers thus excel in disjoint regions, naïve combination fails because of conflicting strategies, seed pollution, and early saturation. LiFU introduces micro-architecture-aware orchestration that dynamically profiles heterogeneous fuzzers, detects complementary strengths, decreases harmful interactions, and steers each fuzzer in real time using coverage and bug feedback, augmented by semantic seed triage. Evaluated on the BOOM core, LiFU achieves 93.4% line, 43.6% FSM, and 95.7% condition coverage (90.1%, 75.0%, 78.3% on Rocket) with around 40% fewer tests than the best standalone fuzzer, consistently closing long-standing verification gaps in modern CPU.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2536

Netdetox: Adversarial and Efficient Evasion of Hardware-Security GNNs via RL-LLM Orchestration

Zeng Wang; Minghao Shao; Akashdeep Saha; Ramesh Karri; Johann Knechtel; Muhammad Shafique; Ozgur Sinanoglu

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

Graph neural networks (GNNs) are widely used in hardware security but vulnerable to adversarial netlist rewrites. Existing adversarial approaches incur high overheads. We present NetDeTox, an automated framework combining large language models (LLMs) with reinforcement learning (RL) for efficient rewriting. The RL agent identifies GNN-critical components while the LLM devises functionality-preserving rewrites that diversify motifs. Iterative feedback minimizes overheads. Compared to state-of-the-art AttackGNN, NetDeTox degrades all security schemes with fewer rewrites and substantially lower area overheads (54.50%, 25.44%, and 41.04% reductions for GNN-RE, GNN4IP, and OMLA). For larger circuits, NetDeTox even reduces original area, demonstrating practical scalability.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH2537

Crosstalk Timing Prediction via Graph Prompt Learning in Aggressor-Victim RC Networks

Yuren Zhou; Fan Yang; Zihao Zeng; Yifan Wang; Yuyang Ye; Qing He

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

With continued technology scaling, crosstalk effects have become critical for timing closure. Existing graph learning-based timing models fail to accurately predict crosstalk timing due to limited modeling of coupling behaviors and multi-net interactions. We introduce a graph prompt learning framework featuring a dual-level prompt architecture. Node-level prompts capture fine-grained coupling effects while subgraph-level prompts model holistic multi-net interactions, effectively adapting pre-trained models. Experimental results demonstrate precise crosstalk timing prediction, consistently outperforming existing methods across benchmark designs.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2540

Highly-Parallel Atom-Detection Accelerator for Tweezer-Based Neutral Atom Quantum Computers

Jonas Winklmann; Yian Yu; Xiaorang Guo; Korbinian Staudacher; Martin Schulz

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

Neutral atom quantum computers (NAQCs) are among the most promising computational platforms for quantum computing. Controlling and measuring individual atoms and their states, which often requires multiple imaging and image-analysis procedures, is typically the most time-consuming task during computation and contributes significantly to overall cycle times. To resolve this challenge, we propose a highly-parallel atom-detection accelerator for tweezer-based NAQCs. Our design builds on an existing state-reconstruction method and combines an algorithm-level optimization with a Field Programmable Gate Array (FPGA) implementation to maximize parallelism and reduce the run time of the image-analysis process. We identify and overcome several challenges for an FPGA implementation, such as introducing a prefetching mechanism to improve scalability and customizing bus transfers to support large bandwidths. Tested on a Xilinx UltraScale+ FPGA, our design can analyze a 256×256-pixel fluorescence image in just 115 𝜇s, achieving 34.9× and 6.3× speedups over the original and optimized CPU baseline, respectively. Moreover, our accelerator can maintain consistent resource utilization across various atom array sizes, contributing to the ongoing efforts toward scalable and fully integrated FPGA-based control systems for NAQCs.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH2545

Integrating LLMs into a Two-Tier Traditional Methodology for Automated Verilog Code Generation

Alex Lu; Bilal El Jamal; Alex Doboli

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

Devising a design specification in a Hardware Description Language, like Verilog or VHDL, is the first step in the automated design of hardware circuits and systems. Traditionally, design specification has been a manual process, thus cumbersome and error prone. To address these limitations, this paper proposes a methodology to automatically generate Verilog code for a problem description in natural language. It combines the capabilities of Large Language Models, Retrieval Augmented Generation, classifiers, heuristics, simulation, and reasoning to understand, adaptively prompt, and elaborate, in addition to traditional learning, as means to bridge the semantic gaps in code generation. The methodology integrates the top-down code generation part with the bottom-up functional correctness and performance evaluation, e.g., code reviews, design parallelism, critical path, and hardware resource sharing. Experiments compare the proposed methodology with state-of-the-art Verilog/VHDL code generation methods.

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2550

Memflow: Characterizing Memory Traffic Contention in Disaggregated Architectures

Fujun He; Tuo Fang; Huaxiang Cai; Baolong Cui; Zhihe Zhang; Yishan Yao; Xiabing Li; Linyi Du; Chuyue Ye; Pingyi Huo; Bin Gao; Yiming Zhang; Pengfei Zheng; Yunfei Du

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

Scale-up networks support memory semantics between pooled, disaggregated processor and memories; meanwhile, it triggers complex any-processor-to-any-memory accesses and creates intricate contention of micro-architectural components. Nevertheless, existing simulators fail to adequately characterize the interference due to oversimplified memory access paths. This paper presents the MemFlow simulator. MemFlow features a modular, fine-grained hardware abstraction of disaggregated memory systems, provides a flow-centric simulation scheme to trace the complete memory access path from load-store queues through cache hierarchies to memory controllers, evaluates micro-architectural contention with detail and fidelity, and enables automatic design space exploration to mitigate contention-induce performance degradation.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH2560

How Can Reinforcement Learning Achieve Expert-Level Placement?

Ruo-Tong Chen; Ke Xue; Chengrui Gao; Yunqi Shi; Tian Xu; Peng Xie; Siyuan Xu; Mingxuan Yuan; Chao Qian; Zhi-Hua Zhou

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore often fail to achieve expert-quality layouts. We identify the reward design as the primary cause for the performance gap with experts, and instead of formalizing intricate processes, we circumvent this by directly learning from expert layouts to derive a reward model. Our approach starts from the final expert layouts to infer step-by-step expert trajectories. Using these trajectories as demonstrations or preferences, we train a model that captures the latent implicit rewards in expert results. Experiments show that our framework can efficiently learn from even a single design and generalize well to unseen cases.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH2566

Path-Based Timing Analysis Acceleration via Segment-Level Timing Arc Reuse

Yuyang Ye; Xiangfei Hu; Leyun Tian; Chang Meng; Qing He; Longxing Shi

Date:Monday, July 27 Location:Mtg Room 202AB +1 Session:Pioneering AI and GPU-Powered Techniques for Advanced Timing Analysis and Optimization +1
Abstract

Path-Based Static Timing Analysis (PBA) offers high accuracy but incurs high runtime due to redundant computations. We present a segment-level reuse framework that accelerates PBA by reducing its inherent complexity. Timing paths are decomposed into multi-fanin-bounded segments, enabling delay reuse across structurally identical subpaths. A dual-indexed SegHashMap and a slew-sensitive model with Sobolev supervision ensure safe, pessimism-preserving reuse. Our segment-centric engine reconstructs end-to-end delays with minimal overhead. Experiments show 195× average speedup over parallel CPU PBA and 2–3× gains over GPU methods—highlighting complexity reduction as a promising new direction for scalable timing sign-off.

EDAEDA3. Timing Analysis and OptimizationAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2596

Exploiting Movable Logical Qubits for Lattice Surgery Compilation

Laura Herzog; Lucas Berent; Aleksander Kubica; Robert Wille

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Charting the Architecture of Tomorrow’s Fault Tolerant Quantum Machines by Harmonizing Codes to Systems +1
Abstract

Lattice surgery is among the leading schemes for fault tolerant quantum computing motivated by superconducting hardware. Conventional lattice surgery compilation schemes follow a place-and-route paradigm, where logical qubits remain statically fixed in space throughout the computation. In this work, we introduce a paradigm shift by exploiting movable logical qubits via teleportation during the lattice surgery CNOT gate. We propose a proof-of-concept compilation scheme leveraging these movements, which can substantially reduce the routed circuit depth. This demonstrates that movable logical qubits can be used even on hardware with static physical qubits. An open-source implementation will be made available on GitHub.

DesignQuantumDES6. Quantum ComputingDES3. Emerging Models of Computation +1 more
RESEARCH2620

Deinfer: Efficient Parallel Inferencing for Decomposed Large Language Models

You-Liang Huang; Xinhao Huang; Chengxi Liao; Zeyi Wen

Date:Wednesday, July 29 Location:Mtg Room 101A Session:The Edge-Lord Chronicles: Tiny Gestures, Big LLMs, and Cluster Chaos +1
Abstract

Existing works of large language model (LLM) decomposition mainly focus on having better performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results can demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH2621

Minimizing the Number of Code Switching Operations in Fault-Tolerant Quantum Circuits

Erik Weilandt; Tom Peham; Robert Wille

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Charting the Architecture of Tomorrow’s Fault Tolerant Quantum Machines by Harmonizing Codes to Systems +1
Abstract

The fundamental challenge of noise in quantum computing necessitates encoding quantum information into logical spaces protected by quantum error-correcting codes (QECCs). However, no single code supports a fully transversal, and thus fault-tolerant, implementation of all gates required for universality. A promising approach is code switching, where logical information is transferred between QECCs that together enable a universal gate set. Since switching introduces overhead and increases error rates, minimizing such operations is crucial. This work presents an efficient min-cut–based algorithm for compiling circuits with the minimal number of code switches, providing the first automated approach to code-switching compilation.

DesignQuantumDES6. Quantum ComputingDES3. Emerging Models of Computation +1 more
RESEARCH2631

ARC-SRAM: A Memory Subarray with Local Addressing and Reduced Access Energy

Ismail Bourhaeil; Geerten Verweij; Mahmood Naderan-Tahan; Dawit Abdi; Prashant Dubey; Dwaipayan Biswas; Georgi Gaydadjiev; Francky Catthoor; Said Hamdioui

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

Conventional SRAMs are designed for random access, resulting in significant energy waste under regular access patterns common in image processing, computer vision, deep learning, and dense linear algebra to name a few. We propose an energy-efficient ARC-SRAM subarray architecture that integrates three synergetic energy reduction schemes: (1) Address decoding energy reduction by relocating the address generation to the memory periphery; (2) Read energy reduction by reducing wordline activity to one pulse per row and minimizing precharge activity to suppress half-selection energy; (3) Write energy reduction by reusing the bitline charge across consecutive rows, and by minimizing wordline activation. We implement a 64KB instance of our ARC-SRAM using an advanced gate all around nanosheet 1.4 nm research process design kit. At the cell-array level, the design achieves a 63–76\% energy reduction for reads and 68–72\% for writes in loop-nest dominated applications, with only 4\% area overhead.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH2637

From Block to Range: Rethinking the Fundamental Unit of Log-Structured File Systems for ZNS SSDs

Zhenhua Tan; Kan Zhong; Zhiwang Yu; Haorui He; Linbo Long; Duo Liu; Jingcheng Shen

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Scaling the Bedrock: OS-Hardware Co-Design for High-Reliability Storage +1
Abstract

Log-structured file systems (LFSs) have shown promising potential on emerging Zoned Namespace (ZNS) SSDs. However, current ZNS-enabled LFSs still follow the block-based abstraction inherited from traditional SSDs. This abstraction enforces block-aligned writes and triggers costly read-modify-write operations for unaligned writes. Since ZNS SSDs natively support arbitrary-sized writes, such block alignment is unnecessary and instead results in significant performance degradation. To address this issue, we present RangeLFS, an LFS that replaces fixed-size blocks with flexible ranges to write directly arbitrary-sized file segments without alignment. First, we propose a range-based file structure, where each range directly maps an arbitrary-sized file segment to its device address. These ranges in a file are maintained in an enhanced B+ tree, enabling efficient lookups while reducing cascading address updates and structural modifications. Second, we propose a range-augmented page cache to support partial-page writes and perform range-based writeback, eliminating the traditional page-granularity constraint that would otherwise prevent arbitrary-sized writes. Evaluated on a production ZNS SSD, RangeLFS improves write bandwidth by up to 6.95x compared with block-based LFSs, while maintaining comparable read performance across diverse workloads.

SystemsSYS5. Embedded Memory and Storage SystemsQuantumDES3. Emerging Models of Computation +1 more
RESEARCH2642

LLM4SDC: Leveraging Multi-Agent System for Automated SDC Generation and Benchmarking

Peiyi Han; Yuntao Lu; Haiyang Liu; Fangzhou Liu; Xufeng Yao; Yuyang Ye

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

High-quality timing constraints decisively influence the results of logic synthesis, yet their creation still relies on time-consuming, expert-level manual work. Although large language models (LLMs) and multi-agent systems have recently delivered strong performance on related EDA automation tasks—most notably script generation—their capability for constraint authoring remains largely untapped. We propose an LLM-centric multi-agent framework that automatically produces Synopsys Design Constraint (SDC) files from natural-language specifications and the corresponding RTL. First, we introduce a rig- orously aligned data-construction flow that couples specifications, RTL, and expert-validated SDCs. The resulting corpus simultaneously serves as a reproducible benchmark. Second, we instantiate three cooperative agents: (i) a specification-parser agent that extracts timing and design constraints from textual documents; (ii) a false-path discovery agent that identifies and annotates false and multi-cycle paths, and (iii) an SDC-generation agent that emits complete, tool-ready constraint files. Evaluated on open-source designs, the framework improves constraint correctness, completeness, and coverage over both rule-based baselines and state-of-the-art general-purpose LLMs by 9.8% on average, while reducing human effort from hours to minutes. These results substantiate the viability of LLM-powered constraint generation and represent a further step toward fully autonomous, AI-driven EDA toolchains.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2646

TOPCELL: Topology Optimization of Standard Cell via LLMs

Zhan Song; Yu-Tung Liu; Chen Chen; Guoheng Sun; Jiaqi Yin; Chia-tung Ho; Ang Li; Haoxing Ren; Cunxi Yu

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

Transistor topology optimization is a critical step in standard cell design, directly dictating diffusion sharing efficiency and downstream routability. However, identifying optimal topologies remains a persistent bottleneck, as conventional exhaustive search methods become computationally intractable with increasing circuit complexity in advanced nodes. This paper introduces TOPCELL, a novel and scalable framework that reformulates high-dimensional topology exploration as a generative task using Large Language Models (LLMs). We employ Group Relative Policy Optimization (GRPO) to fine-tune the model, aligning its topology optimization strategy with logical (circuit) and spatial (layout) constraints. Experimental results within an industrial flow targeting an advanced 2nm technology node demonstrate that TOPCELL significantly outperforms foundation models in discovering routable, physically-aware topologies. When integrated into a state-of-the-art (SOTA) automation flow for a 7nm library generation task, TOPCELL exhibits robust zero-shot generalization and matches the layout quality of exhaustive solvers while achieving an 85.91X speedup.

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH2648

PD-Net: Learning Device-Invariant Representations for Heterogeneous Cross-Device Side-Channel Attacks

Dalin He; Wei Cheng; Yuejun Liu; Jingdian Ming; Yongbin Zhou

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

Heterogeneous cross-device side-channel attacks remain a critical yet underexplored challenge, as models trained on one device often fail to generalize across architectures. This paper presents PD-Net, a domain generalization framework that learns device-invariant features by disentangling algorithmic content from device-specific style and aligning feature distributions using prototypical and Maximum Mean Discrepancy (MMD) losses. PD-Net is trained on nine heterogeneous source domains spanning ARM/AVR/FPGA and power/electromagnetic leakage modalities, including 32-bit ARM Cortex-M0/M1/M3/M4, 8-bit AVR ATmega (three series), and 128-bit Xilinx Virtex-5 FPGA, and evaluated in a zero-shot setting without target-specific adaptation. Experimental results demonstrate robust zero-shot cross-architecture transfers between 8-bit and 32-bit devices, with consistent gains over existing generalization and transfer-learning approaches. In particular, PD-Net delivers 29 successful attacks with only 10 divergences across 70 settings, markedly outperforming the state of the art, which succeeds in only 4 cases and diverges 19 times. To the best of our knowledge, this is the first domain generalization (DG)-based deep learning framework to systematically demonstrate practical zero-shot heterogeneous cross-device side-channel attacks.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH2650

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Ziyan Wang; Enmao Diao; Qi Le; Pu Wang; Guanchu Wang; Minwoo Lee; Shu-ping Yeh; Li Yang

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient & Edge-Ready AI Systems +1
Abstract

Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20–30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

AIAI2-II. AI/ML Algorithms and ModelsEDA7-II. Physical Design and VerificationDesign
RESEARCH2670

HIP-MaN: Hippocampus-Inspired Periodic Mapping and Navigation for Autonomous Mobile Robot

Daeyoung Kim; Seonghan Kwon; Seongsik Park; Hyunjae Jang; Jaewook Kim; YeonJoo Jeong; Inho Kim; Jong-Kook Kim; Jongkil Park

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Where Biology Meets Computation: Next‑Gen Bio‑Cyber‑Physical Innovations +1
Abstract

Real-time path planning is essential for autonomous mobile robots operating in complex and dynamic real-world environments. Conventional path planning algorithms are limited by the need to store the entire environment map in memory and to maintain either the cost values of explored nodes or their connectivity relationships. Consequently, both memory and computational loads increase rapidly with the scale and complexity of the environment. This overhead is particularly exacerbated when replanning is frequent due to unpredictable dynamic obstacles. In this paper, we propose the hippocampus-inspired periodic mapping and navigation (HIP-MaN) algorithm. This employs multi-periodic grid modules to encode unique spatial locations as phase combinations, effectively compressing the entire environment into a compact periodic representation. HIP-MaN directly computes the goal direction based solely on the phase differences between grid modules and generates detours only when a collision is predicted, minimizing replanning costs. In a 200x200 m^2 environment, HIP-MaN demonstrates near-optimal path planning quality, showing only a 5-22% increase over the optimal path length while achieving 3-50x and 5-346x faster path generation in static and dynamic environments, respectively.

DesignDES3. Emerging Models of ComputationQuantumEDA
RESEARCH2686

Adaptive Read Retry Table Extension for Enhanced Data Reliability in 3D NAND Flash Memory

Han-Yu Liao; Chen-Xing Chang; Jen-Wei Hsieh; Hung-Pin Chen; Yuan-Hao Chang

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Scaling the Bedrock: OS-Hardware Co-Design for High-Reliability Storage +1
Abstract

3D NAND flash is widely used from PCs to data centers, but its major weakness is data reliability. As data is retained, charge leakage causes read errors. Vendors mitigate this using the read retry mechanism, which iteratively applies predefined read reference voltage combinations, called read retry parameters (RRPs), stored in a read retry table (RRT). However, the limited number of RRPs can lead to RRP burnout, a condition where all read retry parameter entries have been exhausted without successful data recovery. In such cases, the data becomes undecodable and the corresponding block is marked as bad, degrading both capacity and lifetime of the NAND flash. Experiment showed that even a small portion of recovey failures may lead to significant capacity loss. We propose an Adaptive RRT Extension framework, integrating two key mechanisms: WL-Aware Interpolated Retry (WIR) and Multi-Dimensional Guided Retry (MDGR), to expand recovery capability. Experiments show that our method recovers up to 99\% of failed pages and preserves up to 99\% and 84\% of usable capacity under 12- and 30-month retention, respectively—where the baseline retains only 90\% or reaches end-of-life.

SystemsSYS5. Embedded Memory and Storage SystemsQuantumDES3. Emerging Models of Computation +1 more
RESEARCH2694

Hyspecpro: Scalable Hypergraph Partitioning via Spectral Projection Optimization

Rongjian Liang; Zhuo Feng; Haoxing Ren

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

Modern VLSI designs contain tens of billions of components, making scalable hypergraph partitioning essential for parallel or hierarchical optimization. Although the multilevel partitioning paradigm remains effective, coarsening can distort structural information—especially in hypergraphs with many high-degree hyperedges—leading to substantial refinement overhead and limited scalability. Recent works incorporate spectral information, but only heuristically and without directly targeting the partitioning objective or enforcing constraints, leaving refinement to recover quality. We introduce HySpecPro, a single-level hypergraph partitioner that performs end-to-end optimization in a spectral embedding space. HySpecPro constructs embeddings from a bipartite Laplacian and performs efficient projection-based search, supported by a fully GPU-accelerated implementation. Experiments show that HySpecPro delivers cut quality comparable to state-of-the-art multilevel methods while scaling linearly with the total hyperedge degree.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH2708

M²CAM: A Multi-Level Memristor-Based Self-Adaptive CAM Architecture for Genome Processing Acceleration

Yang Han; Lianfeng Yu; Teng Zhang; Bowen Wang; Yihang Zhu; Lei Cai; Yaoyu Tao; Yuchao Yang

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Are We There Yet? That is the Question ... for Computing near and in Memory. +1
Abstract

Genomic analysis workflows, such as single-cell RNA sequencing (scRNA-seq), demand massive sequence matching and classifications, yet are fundamentally limited by the data movement overhead and bandwidth bottleneck of von Neumann architectures. Although in-memory computing (IMC) has emerged as a promising solution, most existing IMC-based genome accelerators rely on digital encoding schemes, which introduce excessive hardware overhead and limit processing efficiency. This work presents a Multi-level Memristor-based Self-adaptive CAM (content addressable memory) Architecture (M²CAM) for genome processing acceleration. Unlike conventional digital approaches, the proposed design leverages analog CAM encoding based on multi-level memristor conductance states, enabling compact representation of nucleotides with significantly reduced device count and improved parallelism. Furthermore, a self-adaptive error correction mechanism dynamically adjusts matching precision through hierarchical operation modes, ensuring robust sequence matching under device variations. A parallelized genome classification framework is developed to demonstrate the system's efficiency, using single-cell RNA sequencing as a representative application. Experimental results show that the proposed architecture achieves a 50%~75% reduction in device usage, while on real scRNA-seq datasets, M²CAM delivers 131.2×~3088.9× and 161.2×~269.6× improvements in energy consumption and latency, compared with CPU/GPU and traditional bioinformatics tools (e.g., STAR).

DesignDES2B-II. In-memory and Near-memory Computing Architectures, Applications and SystemsAIAI5-I. AI/ML System and Platform Design
RESEARCH2713

Raise the Shields: A Modular RISC-V Extension for Post Quantum Cryptography

Alessandra Dolmeta; Valeria Piscopo; Maurizio Martina; Guido Masera; Michael Hutter

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

This work presents a unified RISC-V extension for Post-Quantum Cryptography (PQC) that emphasizes versatility and openness. The design exposes a compact set of custom scalar instructions via the Core-V eXtension Interface (CV-X-IF) and a modular PQ-ALU that accelerates the dominant kernels across hash-, lattice-, and code-based schemes: Keccak processing, randomness sampling, modular and polynomial arithmetic, finite-field operations, and coefficient compression. The system supports standardized algorithms—ML-KEM, ML-DSA, SLH-DSA, and HQC—as well as candidates under evaluation. The full hardware and software stack is released as open source to enable reproducibility and community reuse. We provide ASIC results from post-synthesis characterization in 65 nm CMOS, reporting instruction-cycle counts, along with power and energy estimates for each custom operation. Overall, the proposed extension delivers a compact (∼48 kGE) and practical path to crypto-agile, energy-efficient PQC on RISC-V while preserving software compatibility and a clean integration into existing cores.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH2717

Operator-Level Acceleration for Sparse 3D Object Detection

Hongjin Zhong; Jingyu Guo; Mingyue Cui; Shuai Liu; Yuanhai Zhang; Kai Huang

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

3D object detection is critical for robotics and autonomous driving; yet, the inference latency remains a bottleneck for efficient deployment on resource-constrained platforms. Existing acceleration strategies for sparse 3D detectors fail to fully exploit the intrinsic sparsity of voxels. To address this challenge, we propose a systematic operator-level acceleration pattern that exploits structural properties of sparse voxels through three synergistic techniques: (1) cache-aware attention fusion that redesigns slot-based aggregation via tiled tensor decomposition, achieving 10.0–33.5× speedup on GTX 4090 GPU and 12.5–50.8× speedup on A100 GPU while enabling previously infeasible high-dimensional attention heads (𝑑=512−1024); (2) sparsity-driven convolution approximation that leverages attention-induced feature redundancy to replace 3D axial convolutions with lightweight 1D channel operations, delivering 1.2–2× acceleration with negligible accuracy loss (<4% mAP); and (3) structure-aware precision reduction employing profiling-guided mixed-precision assignment that selectively applies FP16 to high-cost operators with high tolerance to precision loss, while preserving FP32 for geometric encoders, achieving 1.93–2.19× speedup. All optimizations preserve network architecture and require no retraining. Experimental results show that our method achieves a 2.1x-2.5x end-to-end speedup over state-of-the-art baselines while maintaining accuracy within 4% mAP and 2.5% NDS, establishing a novel pattern for efficient sparse 3D perception.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH2721

DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

Zebin Yang; Yijiahao Qi; Tong Xie; Bo Yu; Shaoshan Liu; Meng Li

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH2737

Insights from Verification: Training a Verilog Generation LLM Using Reinforcement Learning with Testbench Feedback

Ning Wang; Bingkun Yao; Jie Zhou; Yuchen Hu; Xi Wang; Zhe Jiang; Nan Guan

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

Large language models (LLMs) have achieved strong results in software programming tasks, which motivates their application to hardware design, especially in generating Verilog code from natural language specifications. Existing methods mainly use instruction tuning, which optimizes token-level probability and does not match the need for functional correctness in Verilog generation. To address this, we use reinforcement learning (RL) with feedback from verification tools, so that the training objective directly reflects functional correctness. RL training with verification feedback is limited by the lack of functional verification code, or testbenches. To solve this, we propose an automatic testbench generation framework that addresses the problems of hallucination and low coverage in LLM-generated testbench by decomposing the task and using additional information from electronic design automation tools. We then train Verilog generation LLMs using reinforcement learning with testbench feedback, achieving state-of-the-art performance. Our method gives consistent gains across different base models and open-source Verilog generation LLMs, showing its generalizability. All datasets, codes, and models are released to support further research in this area.

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2753

An Energy-Efficient Dataflow Architecture for Efficient MoE Model Inference

Kexin Li; Qinggang Wang; Chuanhui Qi; Shaoxian Xu; Wenkan Huang; Xiangzheng Yang; ruoshi li; Bo Liu; Long Zheng; Xiaofei Liao; Hai Jin

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

The Mixture-of-Experts (MoE) model sparsely activates a subset of experts for each token, causing each expert to process a varying number of tokens during inference. However, existing GPU-based MoE inference frameworks adopt a fixed tensor size for the inputs to each expert, requiring padding when an expert receives fewer tokens, which leads to substantial redundant computation. In response, we propose a token-stationary dataflow that uniformly abstracts the multiplication between an expert's parameters and input tensors with varying token counts. Based on this dataflow, we design a reconfigurable systolic array that eliminates padding-incurred redundant computation. Evaluation demonstrates that our design outperforms the state-of-the-art GPU-based MoE inference frameworks significantly.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2756

Lin-Search: Scaling Exact Synthesis of CNOT Circuits via Hybrid Iterative Deepening Search

Chenjian Li; Xiangzhen Zhou; Ji Guan; Fanxu Meng; Pengcheng Zhu; Yu Luo

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Pioneering the Foundations of Quantum Circuit Synthesis, Compilation and Algorithmic Optimization +1
Abstract

Exact synthesis of CNOT circuits is a core primitive in quantum compilers, but existing SAT-based and database-based methods suffer from encoding overhead, memory bottlenecks, and poor scalability. We propose Lin-search, a framework for exact CNOT synthesis based on iterative deepening search. Lin-search explores circuits in non-decreasing gate count while pruning partial solutions using variable-mismatch constraints, matrix-based lower bounds, and canonicalization under local circuit equivalences. In random benchmarks, Lin-search finds optimal circuits with up to 15 qubits and 15 CNOT gates in 600 seconds on a laptop, achieving up to speedup 100x-1000x and scalability of +3 qubits over a SAT-based baseline incorporated in Qiskit. When integrated as a rewriting kernel in a Clifford+T optimization flow, Lin-search further reduces CNOT count by 16% compared to the original circuits on average and decreases timeout counts by 30% compared to the SAT-based method, demonstrating its effectiveness as a practical exact synthesis engine for quantum compilers.

DesignQuantumDES6. Quantum ComputingAI
RESEARCH2760

Notsotiny: A Large, Living Benchmark for RTL Code Generation

Razine Ghorab; Emanuele Parisi; Cristian. Gutierrez; Miquel Alberti-Binimelis; Miquel Moreto; Dario Garcia-Gasulla; Gokcen Kestor

Date:Monday, July 27 Location:Mtg Room 101A Session:Silicon Whisperers: When LLMs Learn to Speak RTL +1
Abstract

LLMs have shown early promise in generating RTL code, yet evaluating their capabilities in realistic setups remains a challenge. So far, RTL benchmarks have been limited in scale, skewed toward trivial designs, offering minimal verification rigor, and vulnerable to data contamination. In this paper, we introduce NotSoTiny, a benchmark that assesses LLM on structurally rich and context-aware RTL generation tasks, while being resilient to contamination. Built from hundreds of representative hardware designs produced by the TinyTapeout community, our automated pipeline removes duplicates, verifies correctness using simulation and formal equivalence checking, and continuously incorporates new designs to mitigate leakage. Evaluation results show that these tasks are significantly more challenging than prior benchmarks, emphasizing NotSoTiny's effectiveness in revealing the current limitations of LLMs applied to hardware design and in guiding the refinement of this promising technology.

AIAI1. AI/ML Frontiers for Hardware DesignAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2762

Pangaea: A Unified Memory-Efficient Accelerator for Pangenome Chaining and Alignment

Shilin Tian; Fangzhou Ye; Amir Ahsaei; Wei Zhang; Hao Zheng

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Where Biology Meets Computation: Next‑Gen Bio‑Cyber‑Physical Innovations +1
Abstract

Pangenome graphs have emerged as the new standard for genomic reference, replacing linear references to capture population-level genetic variation. However, sequence-to-graph mapping tools exhibit severe memory bottlenecks, with intermediate data movement between computation stages consuming up to 86\% of total DRAM bandwidth. Through systematic profiling of state-of-the-art tools, we identify that no single computational stage dominates across different implementations, necessitating cross-stage optimization approaches rather than accelerating individual kernels. We present PANGAEA, a co-design framework that formulates cross-stage loop fusion as an optimization problem to minimize DRAM traffic under on-chip buffer constraints. Our approach fuses the three memory-intensive stages (i.e., linear chaining, graph chaining, and wavefront alignment). The framework automatically generates tiling parameters and scheduling schemes that maximize data reuse across genome analysis kernels with different dependency patterns. We design PANGAEA accelerator with a unified tri-mode processing element array supporting sparse chaining and dense alignment operations. Compared to SOTA ASIC implementations, \name achieves an average of 1.47$\times$ higher throughput (bp/s), 1.73$\times$ better energy efficiency($\text{bp}/\mu\text{J}$) and 1.62$\times$ area efficiency(bp/s/mm$^2$).

DesignDES3. Emerging Models of ComputationQuantumEDA
RESEARCH2771

PLONK-Hammer:breaking Input Privacy of PLONK Proving Systems via Rowhammer

Zhiwen Zhang; Junkai Liang; Xin Zhang; Qingni Shen; Cong Li; Yuejian Fang

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

Zero-knowledge proofs, particularly succinct non-interactive variants (zk-SNARKs), are instrumental in verifiable computation without revealing privacy. Zk-SNARKs are provably secure against cryptographic attacks; however, whether these systems can resist fault injection attacks requires further investigation. Previous related work focused on QAP-based zk-SNARK like Groth16. PLONK improves upon Groth16 with a universal and updatable trusted setup. In this work, we present PLONK-Hammer, which breaks input privacy of PLONK via rowhammer. We inject faults into the secret inputs.Then we devise an algorithm to recover the secret using faulty proofs and polynomial commitment techniques. We evaluate PLONK-Hammer in gnark, successfully leak secrets by our attack.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2773

Flashhd: A Flash-Based In-Storage Hyperdimensional Computing Framework for Hierarchical Sequence Matching

Wen-Hsin Liu; Chieh-Lin Tsai; Wen Sheng Lim; Ray-Ting Huang; Han-Wen Hu; Tei-Wei Kuo; Yuan-Hao Chang

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Active Memory: Breaking the Data Movement Wall with Cognitive Storage +1
Abstract

The rapid growth of DNA data makes exact sequence matching a central task in bioinformatics. These applications are both compute- and memory-intensive due to exhaustive database scans. However, conventional systems suffer from high I/O overhead caused by frequent page faults when memory capacity is limited. We propose FlashHD, an in-storage sequence matching framework that executes exact matching directly inside 3D NAND flash. FlashHD employs a hierarchical hyperdimensional computing architecture that filters dissimilar sequences across multiple stages while preserving full recall. FlashHD introduces a hierarchy-aware hyperdimensional architecture search that automatically tunes HDC hyperparameters for low latency and energy to optimize the hierarchical searching. Our evaluation reveals that FlashHD significantly outperforms other state-of-the-art systems.

SystemsSYS5. Embedded Memory and Storage Systems
RESEARCH2778

Excavating Consistency Across Editing Steps for Effective Multi-Step Image Editing

Chunyu Qi; Zhuoran Song; Cheng Gu; Zhihui Fu; Jianglong Chang; Jun Wang; Xiaoyao Liang; Haibing Guan

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Multi-step image editing with diffusion models typically requires repeatedly executing the inversion–denoising paradigm, which leads to severe challenges in both image quality and computational efficiency. Repeated inversion introduces errors that accumulate across editing steps, degrading image quality, while regeneration of unchanged background regions incurs substantial computational overhead. In this paper, we present ExCave, a training-free multi-step editing framework that improves both image quality and computational efficiency by excavating consistency across editing steps. ExCave introduces an inversion sharing mechanism that performs inversion once and reuses its consistent features across subsequent edits, thereby significantly reducing errors. To eliminate redundant computation, we propose the CacheDiff method that regenerates only the edited regions while reusing consistent features from unchanged background regions. Finally, we design GPU-oriented optimizations to translate theoretical gains into practical reductions in end-to-end latency. Extensive experiments demonstrate that ExCave achieves superior image quality and dramatically reduces inference latency, establishing a new paradigm for accurate and efficient multi-step editing.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH2780

Real-Time, Lightweight, High Coverage Error Detection and Correction for CRYSTALS with Small-Modulus Paths

Ziying Ni; ayesha khalid; Yijun Cui; Weiqiang Liu; Maire O'Neill

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

The Number Theoretic Transform (NTT) dominates the hardware latency of lattice-based post-quantum cryptography (PQC) and is vulnerable to soft errors and fault injection. Existing protection mechanisms either require re-computing, assume a Residue Number System (RNS) datapath, or provide error detection only without real-time correction. This work presents two lightweight architectures for real-time fault detection and correction for NTT modular multipliers targeting the non-RNS primes used in the CRYSTALS lattice-based PQC schemes. Method 1 exploits the pseudo-Mersenne structure of the CRYSTALS moduli to build ultra-lightweight dual-modulus residue paths for error detection. Method 2 introduces cross-basis single-modulus checks together with lookup-table-based single-bit correction and a mixed linear CRT scheme for multi-bit recovery. On Artix-7, the detection configurations incur 55–78\% area overhead; single-bit detection achieves 100\%, and multi-bit detection at least 99.01\% or nearly 100\% under the tested settings. The design requires no re-computation and has a shorter critical path delay and surpasses state-of-the-art and Triple Modular Redundancy (TMR) baselines in area, error coverage, and latency.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH2793

Circuitdiff: Bridging Netlist Knowledge with RTL Based on Graph Denoising Diffusion

Ning Wang; Zichong Deng; Bingkun Yao; Jie Zhou; Yuchen Hu; Xi Wang; Zhe Jiang; Nan Guan

Date:Monday, July 27 Location:Mtg Room 101A +1 Session:Hot Chips & Cool Models: AI Turns Up the Heat on Silicon Design +1
Abstract

In Integrated circuit (IC) workflows, EDA tools provide accurate circuit metrics by simulating and analyzing physical characteristics, but their computational overhead makes them impractical for rapid design iterations. Thus, accurate and efficient circuit metric prediction at the register-transfer-level (RTL) stage has become a hot topic. However, current methods still struggle to efficiently bridge the gap between RTL and physical characteristics. We propose CircuitDiff, a generative pre-training framework designed to learn a unified representation space between RTL and netlist. This is achieved by encoding the RTL graph as a condition for training a netlist graph denoising diffusion model. During fine-tuning, learnable queries are used to incentivize knowledge from the pre-trained model via cross-attention mechanisms. Experimental results show that CircuitDiff achieves superior performance in the prediction of early-stage circuit metric compared to state-of-the-art models, supporting the "left-shift" paradigm while maintaining computational efficiency. Our code and data will be publicly available at \url{https://github.com/CatIIIIIIII/CircuitDiff}.

AIAI1. AI/ML Frontiers for Hardware DesignDES5. Emerging Device and Interconnect Technologies
RESEARCH2794

Scalable Multi-Task Learning Through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents

Rachmad Vidya Wicaksana Putra; Avaneesh Devkota; Muhammad Shafique

Date:Wednesday, July 29 Location:Mtg Room 203C Session:Scaling Intelligence for the Physical World +1
Abstract

Training autonomous agents on multiple tasks is crucial for adapting to diverse real-world environments. However, state-of-the-art rely on fixed task-switching intervals during its training, limiting its performance and scalability. To address this, we propose SwitchMT, a novel methodology that employs adaptive task-switching for effective and scalable multi-task learning. It leverages a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and devises an adaptive task-switching policy that leverages rewards and internal dynamics of the network parameters. Experimental results show that SwitchMT achieves superior performance in multi-task learning than state-of-the-art.

SystemsSYS1. Autonomous Systems (Automotive, Robotics, Drones)AIQuantum
RESEARCH2804

Automated Discovery, Classification, and Systematic Security Verification Methodology for Undocumented Instructions

Yooseok Lee; Junho Jo; Hanjoon Kim; Wonjun Song

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

Undocumented instructions pose a growing security threat, yet current research is limited. Existing work focuses on discovery with constrained techniques, while semantic analysis is manual and ad-hoc. Moreover, a significant gap exists in methodologies for systematic security verification. This paper presents a comprehensive framework for the automated discovery, systematic classification, and security verification of undocumented instructions. Our evaluation discovered 23 new instructions and classified 29 using a classifier with 99.8357% accuracy. Most importantly, we verified that 5 instructions have tangible security implications, demonstrating our approach's efficacy in addressing threats at the processor instruction level.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH2805

Time-Aware Active Calibration and Scheduling for N-Qubit Systems via Bayesian Optimization and Reinforcement Learning

Ziming Zhao; Tingting Li; Xiaofei Yue; Zhaoxuan Li; Jianwei Yin

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

Frequent recalibration of multi-qubit systems is essential to maintain hardware fidelity, but calibration itself is costly and subject to rapid parameter drift. This work presents a novel framework for time-aware active calibration that combines Bayesian Optimization (BO) and Reinforcement Learning (RL) to adaptively schedule and prioritize calibration experiments under uncertainty. We model the calibration landscape as a directed acyclic graph (DAG), where nodes represent hardware parameters (e.g., frequency, amplitude ratio, DRAG, or phase) and edges capture inter-parameter dependencies extracted automatically from Qiskit Experiments metadata. Each parameter's temporal drift is modeled as an Ornstein-Uhlenbeck (OU) or exponential decay process with uncertainty bounds. The BO agent selects the next calibration experiment to maximize expected information gain per unit time while respecting system-level constraints. A parallel RL/ILP scheduler coordinates cross-qubit calibration and readout tasks to minimize idle time and avoid conflicts. Our system further incorporates self-triggered recalibration based on online RB/XEB health metrics, enabling autonomous recovery from drift without human intervention. Experimental evaluation on multi-qubit testbeds demonstrates improved calibration efficiency, longer mean time between recalibrations (MTBC), and reduced overall calibration overhead compared to fixed or purely heuristic scheduling baselines. We provide an open-source implementation, Cal Orchestrator, which integrates DAG construction, BO-based experiment selection, and RL-based calibration scheduling for scalable quantum system maintenance.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH2806

Lhgstore: An In-Memory Learned Graph Storage for Fast Updates and Analytics

Pengpeng Qiao; Zhiwei Zhang; Xinzhou Wang; Zhetao Li; Xiaochun Cao; Yang Cao

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

Various real-world applications rely on in-memory dynamic graphs that must efficiently handle frequent updates while supporting low-latency analytics on evolving structures. Achieving both objectives remains challenging due to the trade-off between update efficiency and traversal locality, particularly under highly skewed degree distributions. This motivates the design of graph indexing schemes optimized for in-memory graph management on modern multi-core CPUs. We present LHGstore, a degree-aware Learned Hierarchical Graph storage that, for the first time, integrates learned indexing into graph management. LHGstore designs a two-level hierarchy that decouples vertex and edge access and further organizes each vertex's edges using data structures adaptive to its degree. Lightweight arrays are used for low-degree vertices to maximize traversal locality, while learned indexes are applied to high-degree vertices to improve update throughput. Extensive experiments show that LHGstore achieves 5.9-28.2× higher throughput and significantly faster analytics than SOTA in-memory graph storage systems.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2817

EHPC: Efficient Heterogeneous Probabilistic Computing Architecture for Combinatorial Optimization Acceleration

Weican Chen; Chenhao Xia; Haoxuan Wang; Guanwen Yao; Yuxiang Zhang; Yunyi Fu; Fei Liu

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

In this work, we propose an efficient heterogeneous probabilistic computing (EHPC) architecture based on volatile RRAM to accelerate combinatorial optimization. We fabricated the OxRAM-multiplexed EHPC circuit and successfully used it to solve max-cut problem. In hardware simulations of max-cut problems, EHPC achieves a superior solution quality to existing works in under 1s, using the same computational resources. The results of the floorplanning problems demonstrate that our EHPC architecture significantly boosts computation speed, ranging from 20× to 1,500×, with area expansion <1.7% compared to the best-performing conventional methods, highlighting its advantages in speed, efficiency, and scalability for solving COPs.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH2824

ISA-PM: A Unified ISA Extension and Microarchitecture Enabling Efficient ECPM in IoT Devices

Zhaoyuan Li; Kun Yang; Kui Ren

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

Elliptic curve cryptography (ECC) has become a preferred choice for securing resource-constrained Internet-of-Things (IoT) devices, offering strong security with compact key sizes. As the computational core of ECC, elliptic curve point multiplication (ECPM) dominates both the performance and silicon area of ECC coprocessors. However, implementing ECPM in area-constrained applications remains challenging due to its significant arithmetic complexity and the diversity of field parameters. Most existing hardware accelerators are tailored to a specific elliptic curve modulus, limiting their scalability and forcing costly redesigns to support multiple curves. To overcome this inflexibility, we propose ISA-PM, a unified instruction set architecture (ISA) extension and microarchitecture that supports flexible ECPM computation across various elliptic curves, including widely used Montgomery and short Weierstrass forms. To effectively reduce the design area of ISA-PM, we propose a three-phase computation scheme that decomposes complex ECPM operations into iterative execution of small bit-width modular operations. Furthermore, we develop a design space exploration method based on bit-width partitioning and operation parallelism to balance resource utilization and computational latency, ensuring efficient implementation under area constraints. Experimental results based on AMD Virtex-7 FPGA demonstrate that ISA-PM delivers up to 8.37x improvement in area–time product (ATP) over the state-of-the-art lightweight ECPM hardware accelerators.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH2829

LLM-Enhanced Two-Stage Signal Temporal Logic Automatic Transformation with Structured Intermediate Representation

Hongjing Qing; Jie An; Fanjiang Xu

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

Signal Temporal Logic (STL) is a formal specification language for describing real-time and real-valued properties of Cyber-Physical Systems (CPS).Accurate and automatic translation of CPS specifications in natural language (NL) into formal STL formulas is crucial. Traditional methods relying on manual templates or deep-learning models are limited and lack flexibility.Recently, large-language models (LLMs) based methods partially addressed these issues, but did not consider the usefulness of the inherent structure of STL both for the translation itself and the result evaluation.To address these issues, we propose STLGen, a novel LLM-enhanced automatic transformation framework from NL to STL, which introduces a two-stage generation process with a structured natural language, named NL2, as the intermediate representation.In Stage 1, NL is converted into well-defined NL2 through structured prompt engineering. In Stage 2, NL2 is converted into STL formulas using a converter.Leveraging LLM-aided generation and closed-loop verification with matching algorithms, as well as fine-tune models in two stages with instruction fine-tuning and LoRA.Additionally, we introduce two evaluation metrics: the structure accuracy to assess STL syntax impact on logic and the STL-SCOTES to evaluate semantic consistency via STL trajectories.Experimental results demonstrate that our method outperforms state-of-the-art methods across the classic evaluation metrics and our proposed metrics.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH2832

Nothing Left to Learn: Isomorphic Graph Transformations for Hardware IP Protection

Shaza Elsharief; Lilas Alrahis; Johann Knechtel; Ozgur Sinanoglu

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

Logic locking protects hardware designs across the global semicon- ductor supply-chain. However, recent machine learning (ML)-based attacks, especially graph learning-based ones, have undermined its security by learning circuit structures and gate compositions to reveal locked connections/gates. Although recent locking methods use explainability tools and adversarial perturbations, they remain vulnerable to ML attacks under robust training. Therefore, there is still a need for truly learning-resilient locking solutions. We propose IsoLock, a locking scheme that performs intercon- nect obfuscation to secure the circuit's structure and functionality. IsoLock models the gate-level netlist as a graph and introduces MUXes for locking (through rewiring selected interconnects), gen- erating isomorphic structures. These structures are indistinguish- able to ML models and other structural attacks, effectively blocking all learnable/predictable features and preventing key deciphering. For evaluation, first, we prove IsoLock's security against modeling attacks. Second, we lock standard ISCAS-85 and ITC-99 benchmarks and evaluate them against state-of-the-art ML attacks (SCOPE and MuxLink) and structural attacks (Redundancy and SAAM), all of which can decipher only 0–6% of the correct key on average, con- firming IsoLock's resilience also in practice.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH2838

Floquet-Based Subspace Projection Method for Time-Domain PPV Computation

Yehao Zhang; Yuncheng Xu; Chenyi Tan; Jiamei Mi; Wenjie Yan; Yangfeng Su

Date:Monday, July 27 Location:Mtg Room 202AB Session:Generative AI and Graph-Based Learning for Next-Gen Circuit & Device Simulation +1
Abstract

Accurate computation of the Perturbation Projection Vector (PPV) is essential for analyzing oscillator phase noise. The PPV corresponds to the steady-state solution of the adjoint system in small-signal oscillator analysis, but in large time-constant oscillators, it is often obscured by numerous slow-decaying modes. To address this, we propose a Floquet-based subspace projection (FSP) method, which confines PPV computation to a low-dimensional subspace spanned by the PPV and slow-decaying solutions. Utilizing the biorthogonality condition, we construct a linear system within this subspace to lock the PPV, since slow-decaying solutions are orthogonal to the large-signal derivative. This linear system is small-scale and can be solved directly, which results in high-accuracy PPVs. Numerical results demonstrate that FSP significantly enhances both accuracy and computational efficiency.

EDAEDA6. Analog CAD, Simulation, Verification and TestChipletSYS4. Embedded System Design Tools and Methodologies
RESEARCH2845

Adagscale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs

Joongho Jo; Hyerin Lim; Hanjun Choi; Jongsun Park

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Reducing the number of Gaussian-tile pairs is one of the most promising approaches to improve 3D Gaussian Splatting (3D-GS) rendering speed on GPUs. However, the importance difference existing among Gaussian-tile pairs has never been considered in the previous works. In this paper, we propose AdaGScale, a novel viewpoint-adaptive Gaussian scaling technique for reducing the number of Gaussian-tile pairs. AdaGScale is based on the observation that the peripheral tiles located far from Gaussian center contribute negligibly to pixel color accumulation. This suggests an opportunity for reducing the number of Gaussian-tile pairs based on color contribution. AdaGScale efficiently estimates the color contribution in the peripheral region of each Gaussian during a preprocessing stage and adaptively scales its size based on the peripheral score. As a result, Gaussians with lower importance intersect with fewer tiles during the intersection test, which improves rendering speed while maintaining image quality. The adjusted size is used only for tile intersection test, and the original size is retained during color accumulation to preserve visual fidelity. Experimental results show that AdaGScale achieves up to 9.47× speedup over the original 3D-GS on a GPU, with only about 0.2 dB degradation in PSNR.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH2851

An Agile and Accurate Energy Estimation Methodology for CGRA-Mapped Algorithms: The Synchoros Approach

Ritika Ratnu; Dimitrios Stathis; Ahmed Hemani

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Interconnect-Driven System Architecture Innovation +1
Abstract

Coarse-Grained Reconfigurable Arrays (CGRAs) are promising as accelerators for energy and performance benefits. Agile and accu- rate energy estimates are essential to explore the CGRA design space and guide algorithm mapping decisions. Existing estimations are agile but inaccurate since they do not factor in the impact of wires in the design. This paper presents synchoros energy es- timation methodology for post-route accurate estimations. The methodology leverages the properties of synchoros VLSI design style to incorporate the impact of interconnects in the design. Over- all, our methodology achieves 98% estimation accuracy relative to post-route results for algorithms irrespective of variations in functionality, dimensions, and parallelism.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageQuantum +1 more
RESEARCH2871

Akirarust: Re-Thinking LLM-Aided Rust Repair Using a Feedback-Guided Thinking Switch

Renshuang Jiang; Yichong Wang; Pan Dong; Xiaoxiang Fang; Zhenling Duan; Tinglue Wang; Yuchen Hu; Jie Yu; Zhe Jiang

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Intelligence-Aware Software: Efficiency, Reliability, and Security in the LLM Era +1
Abstract

Eliminating undefined behaviors (UBs) in Rust programs requires a deep semantic understanding to enable accurate and reliable repair. While existing studies have demonstrated the potential of LLMs to support Rust code analysis and repair, most frameworks remain constrained by inflexible templates or lack grounding in executable semantics, resulting in limited contextual awareness and semantic incorrectness. Here, we present AkiraRust, an LLM-driven repair and verification framework that incorporates a finite-state machine to dynamically adapt its detection and repair flow to runtime semantic conditions. AkiraRust introduces a dual-mode reasoning strategy that coordinates fast and slow thinking across multiple agents. Each agent is mapped to an FSM state, and a waveform-driven transition controller manages state switching, rollback decisions, and semantic checkpointing, enabling context-aware and runtime-adaptive repair. Experimental results show that AkiraRust achieves about 92% semantic correctness and delivers a 2.15× average speedup compared to SOTA.

SystemsSYS3. Embedded SoftwareAIAI5-I. AI/ML System and Platform Design
RESEARCH2880

Uvmarvel: an Automated LLM-Aided UVM Machine for Subsystem-Level RTL Verification

Junhao Ye; Dingrong Pan; Hanyuan Liu; Yuchen Hu; Jie Zhou; Ke Xu; Xinwei Fang; Xi Wang; Nan Guan; Zhe Jiang

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

Verification presents a major bottleneck in Integrated Circuit (IC) development, consuming nearly 70% of total effort. While the Universal Verification Methodology (UVM) improves reuse through structured verification environments, constructing subsystem-level UVM testbenches and generating high-quality stimuli still require extensive manual coding, repeated EDA tool runs, and deep protocol and micro-architectural expertise. We present UVMarvel, an automated verification framework that leverages Large Language Models (LLMs) to build UVM testbenches for subsystem-level RTL. UVMarvel introduces an Intermediate Representation (IR) and a Bus Protocol Library to translate heterogeneous specifications into protocol-correct subsystem-level UVM testbenches, and employs a Signal Tracker and a Verilog Patching Library to guide LLM-based stimuli refinement. UVMarvel is the first framework capable of automatically constructing subsystem-level UVM testbenches across mainstream bus protocols, and it achieves an average code coverage of 95.65%, while reducing verification time from several human working days to a 4.5-hour automated execution.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2884

Voyager: An End-to-End Framework for Design-Space Exploration and Generation of DNN Accelerators

Kartik Prabhu; Jeffrey Yu; Xinyuan Pan; Zhouhua Xie; Abigail Aleshire; Zihan Chen; Ammar Ratnani; Priyanka Raina

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Accelerators and Design Methods for AI Workloads +1
Abstract

While deep neural networks (DNNs) have achieved state-of-the-art performance in fields from computer vision to natural language processing, efficiently running these computationally demanding models requires specialized hardware accelerators. However, designing these accelerators is a time-consuming, labor-intensive process that does not scale well across multiple design points. While prior efforts have sought to automate DNN accelerator generation, they typically offer limited parameterization, cannot produce high-performance, tapeout-ready designs, provide limited support for multiple datatypes and quantization schemes, and lack an integrated, end-to-end software compiler. This work proposes Voyager, a high-level synthesis (HLS)-based framework for rapid design space exploration and generation of DNN accelerators. Voyager overcomes the limitations of prior work by offering extensive configurability across technology nodes, clock frequencies, and scales, with customizable parameters such as number of processing elements, on-chip buffer sizes, and external memory bandwidth. Voyager supports a much wider variety of datatypes and quantization schemes versus prior work, including both built-in arbitrary-length floating-point, posit and integer formats, as well as user-defined custom formats with both per-tensor scaling and microscaling quantization. Voyager's PyTorch-based compiler efficiently maps neural networks end-to-end on the generated hardware, with support for quantization, operation fusion, and tiling. We evaluate Voyager on state-of-the-art vision and language models. Voyager enables fast design-space exploration with full-dataset accuracy evaluation for different datatypes and quantization schemes. Generated designs achieve a high utilization across models and scales, up to 99.8%, and outperform prior generators with up to 61% lower latency and 56% lower area. Compared to hand-crafted accelerators, Voyager achieves comparable performance, while offering much greater automation in design and workload mapping.

AIAI4-II. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH2886

ATLAS: AI-Assisted Threat-to-Assertion Learning for System-on-Chip Security Verification

Ishraq Tashdid; Kimia Tasnia; Alexander Garcia; Jonathan Valamehr; Sazadur Rahman

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

This work presents ATLAS, an LLM-driven framework that bridges standardized threat modeling and property-based formal verification for System-on-Chip (SoC) security. Starting from vulnerability knowledge bases such as Common Weakness Enumeration (CWE), the framework identifies SoC-specific assets, maps relevant weaknesses, and generates assertion-based security properties and JasperGold scripts for verification. By combining asset-centric analysis with standardized threat model templates, it automates the transformation from vulnerability reasoning to formal proof. Evaluated on three HACK@DAC benchmarks, ATLAS detected 39/48 CWEs and, out of that 39, ATLAS was able to generate correct properties for 33 of the bugs, advancing automated, knowledge-driven SoC security verification toward a secure-by-design paradigm.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH2887

Hyperefs: Adaptive Clique Sampling for Scalable Effective Resistance Estimation in Large Hypergraphs

Hamed Sajadinia; Yihang Yuan; Zhuo Feng

Date:Tuesday, July 28 Location:Mtg Room 203AB Session:Structuring the Blueprint: Scalable Partitioning and Early-Stage Floorplanning +1
Abstract

We propose HyperEFs, a scalable framework for effective resistance estimation in large hypergraphs based on adaptive clique sampling. The proposed sampling strategy enables smooth transitions between star and clique expansions of hyperedges, substantially improving the quality of Krylov subspace embeddings. HyperEFs estimates pairwise effective-resistance distances in large hypergraphs in nearly linear time and integrates seamlessly into state-of-the-art multilevel hypergraph partitioning frameworks, yielding significantly improved solution quality. Extensive experiments on VLSI benchmarks demonstrate up to 100× speedup over HyperEF 2.0 with greatly reduced memory usage. On large-scale hypergraph datasets such as Titan23, our framework achieves superior cut sizes, demonstrating exceptional runtime scalability and efficiency.

EDAEDA7-II. Physical Design and VerificationDesign
RESEARCH2888

Enabling AI ASICs for Zero Knowledge Proof

Jianming Tong; Jingtian Dang; Simon Langowski; Tianhao Huang; Asra Ali; Jeremy Kun; Jevin Jiang; Srinivas Devadas; Tushar Krishna

Date:Monday, July 27 Location:Mtg Room 203C Session:Securing the Stack: Embedded and Cross-Layer Defenses from Silicon to Systems +1
Abstract

Zero-knowledge proof (ZKP) provers remain costly because multi-scalar multiplication (MSM) and number-theoretic transforms (NTTs) dominate runtime as they need significant computation. AI ASICs such as TPUs provide massive matrix throughput and SotA energy efficiency. We present MORPH, the first framework that reformulates ZKP kernels to match AI-ASIC execution. We introduce Big-T complexity, a hardware-aware complexity model that exposes heterogeneous bottlenecks and layout-transformation costs ignored by Big-O. Guided by this analysis, (1) at arithmetic level, MORPH develops an MXU-centric extended-RNS lazy reduction that converts high-precision modular arithmetic into dense low-precision GEMMs, eliminating all carry chains, and (2) at dataflow level, MORPH constructs a layout-stationary CPU–TPU Pippenger MSM and optimized 3/5-step NTT that avoid on-TPU shuffles and maintain full matrix-unit utilization. Implemented in JAX/XLA, MORPH enables TPUv5p for better energy efficiency and comparable performance on MSM and NTT than SotA implementations on GPUs.

SecuritySEC4. Embedded and Cross-Layer SecurityAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH2897

EXRA: Efficient Execution of Data-Dominated Applications via ISA Extensions for Expandable Register Files

Heshan Dissanayake; Darshana Jayasinghe; Sri Parameswaran

Date:Monday, July 27 Location:Mtg Room 203C Session:Streamlining Silicon: From Real-Time Scheduling to Efficient GPU Execution +1
Abstract

Modern processor architectures typically employ a fixed, small register file, which is well-suited for most computations due to its simplicity, energy efficiency, and ease of implementation. However, data-intensive applications often suffer from limited register availability; simply enlarging the register file increases code size, pressures the instruction cache, complicates decoding, and raises power consumption. To address these challenges, we propose an extension to the legacy RISC-V Instruction Set Architecture (ISA) that supports an expandable register file. Our design partitions the register file into multiple logical banks, each mirroring the standard 32-register configuration, allowing operands and destination registers to reside in different banks concurrently. We introduce instruction extensions, overhead reduction mechanisms, and exception-handling infrastructure to fully exploit the expanded register space on a scalar processor. The approach is implemented on the CVA6 CPU, a 6-stage RISC-V processor, and deployed on an FPGA with only 27% hardware overhead. Experimental results demonstrate substantial performance improvements: matrix multiplication achieves 60% speed-up with 17% energy reduction, convolutions improve by 48% with 22% energy reduction, and convolutional neural networks such as ResNet-50 achieve 83.5% speed-up with 45% energy reduction.

SystemsSYS4. Embedded System Design Tools and MethodologiesChipletEDA
RESEARCH2914

LUNA: LUT-Based Neural Architecture for Fast and Low-Cost Qubit Readout

Muhammad Ali Farooq; Giuseppe Di Guglielmo; Abhi Rajagopala; Nhan Tran; Vidya A. Chhabria; Aman Arora

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

Qubit readout is a critical operation in quantum computing systems, which maps the analog response of qubits into discrete classical states. Deep neural networks (DNNs) have recently emerged as a promising solution to improve readout accuracy . Prior hardware implementations of DNN-based readout are resource-intensive and suffer from high inference latency, limiting their practical use in low-latency decoding and quantum error correction (QEC) loops. This paper proposes LUNA, a fast and efficient superconducting qubit readout accelerator that combines low-cost integrator-based preprocessing with Look-Up Table (LUT) based neural networks for classification. The architecture uses simple integrators for dimensionality reduction with minimal hardware overhead, and employs LogicNets (DNNs synthesized into LUT logic) to drastically reduce resource usage while enabling ultra-low-latency inference. We integrate this with a differential evolution based exploration and optimization framework to identify high-quality design points. Our results show up to a 10.95× reduction in area and 30% lower latency with little to no loss in fidelity compared to the state-of-the-art. LUNA enables scalable, low-footprint, and high-speed qubit readout, supporting the development of larger and more reliable quantum computing systems.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH2945

Rememtier: Rethinking Memory Tiering for CXL-Based SSD via Large Granularity Memory Access

Yi-Sia Gao; Wen Sheng Lim; Tei-Wei Kuo; Yuan-Hao Chang

Date:Monday, July 27 Location:Mtg Room 202C +1 Session:The Dawn of the Computational Frontier: Pioneering Devices and Interconnects Empowering Memory-Centric, AI-Scale Systems +1
Abstract

The growing prevalence of mixed workloads in datacenter servers has intensified memory contention, where auxiliary tasks interfere with main applications, leading to excessive page swapping and degraded performance. To address this, we propose a CXL SSD–based Workload-aware Reversed Memory Tiering System (WA-RMT) that redefines memory hierarchy by dedicating CXL DRAM as the fast tier for main applications while isolating auxiliary tasks in host memory. Complementing this design, a Sparse-burst Resistant Page Replacement Policy (SBR-PRP) distinguishes short-lived bursty pages from truly hot ones to prevent memory pollution. Our implementation introduces cacheable and streamed CXL.io memory accesses with data reordering for reduced latency. Evaluations using mixed workloads such as K-means, Radiosity, and ResNet50 show that the proposed system reduces execution time by up to 35% and swap-in latency by over 60% compared to conventional tiering. Together, WA-RMT and SBR-PRP offer a robust, workload-aware memory management solution for CXL-based systems

DesignDES5. Emerging Device and Interconnect TechnologiesAI
RESEARCH2947

REACT: Rapid Error-Tolerant Activation Compressor for Efficient Transformer Training

Seungyong Lee; Geonu Yun; Xuan Truong Nguyen; Hyuk-Jae Lee

Date:Tuesday, July 28 Location:Mtg Room 101A Session:Efficient LLM and MoE Inference on Specialized Hardware +1
Abstract

Training large language models (LLMs) poses considerable challenges due to long training times, extensive memory requirements, and substantial bandwidth demands. Activation compression is a promising approach to reduce the memory footprint since activation memory accounts for the majority of memory usage in LLM training with large batch sizes and long context lengths. However, because reducing data size increases the amount of information per bit, it may become sensitive to a fault, which could adversely affect the model convergence and accuracy. In this work, we demonstrate that compression can enhance robustness to bit errors and present REACT, a Rapid Error-tolerant Activation Compressor for efficient Transformer training, which not only reduces memory usage but also enhances error tolerance. Using base-delta encoding, which stores only the differences between data, it minimize data size while significantly reducing the number of error-vulnerable bits. Additionally, an error-limiting encoding scheme is incorporated to address remaining vulnerable bits. Experimental results show that REACT maintains accuracy under bit error rate 1000 times higher while achieving a 4 times reduction in activation memory, with negligible performance overhead.

AIAI4-II. AI/ML Architecture DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2959

ENACT: Ensemble Neural Fields for Reactive Robot Control

Hammad Omar; Ismail Mohamed; Mohamed Ibrahim

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Intelligent Systems for Physical & Structured Worlds +1
Abstract

Tracking moving targets through clutter and occlusions demands real-time sensorimotor control that adapts to dynamic uncertainty. Model predictive control (MPC) incurs high computational cost, while reinforcement learning requires task-specific retraining. We introduce ENACT (Ensemble Neural Attractor Components for Tracking), a modular framework decomposing reactive control into coordinated Dynamic Neural Fields (DNFs). Each DNF module---attention, gating, memory, context---addresses a distinct sensorimotor primitive through continuous attractor dynamics, enabling runtime reconfiguration without retraining. We benchmark ENACT on both simulation and a Cortex-M7 microcontroller, showing up to 86% lower tracking RMS error, ~2.8x faster disturbance-recovery times, predictable sub-millisecond control latency, and ~9x lower SRAM footprint compared to MPC baseline.

AIAI2-II. AI/ML Algorithms and ModelsSystems
RESEARCH2970

GREEN: Towards Scalable Energy-Efficient Workload Scheduling and Placement in the Cloud

Jinghua Wang; Asser Tantawi; Olivier Tardieu; Alaa Youssef; Tamar Eilam; Chen Wang; Pradip Bose; JIAXIN WAN; Klara Nahrstedt; Deming Chen

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

The workload scheduling and placement problem has been the core of resource management in various distributed systems, including modern cloud computing systems. As modern cloud systems become larger, more heterogeneous, dynamic, complicated, and energy-hungry, developing a scalable, flexible, adaptive, and effective energy-efficient resource manager for such a cloud system is becoming very challenging. Specifically, existing machine learning-based resource managers cannot scale well, and existing heuristic algorithm-based scalable resource managers do not generate the most energy-efficient solutions. To address the limitations in existing methods and solve the problem more effectively, we propose GREEN, a graph neural network-reinforcement learning (GNN-RL) based cloud resource manager that generates high energy-efficiency workload scheduling and placement solutions and scales up to large cloud systems with hundreds and thousands of servers. GREEN reduces this challenging problem to a graph optimization problem and then uses our novel RL formulation and GNN architecture for generating scheduling and placement solutions for cloud systems in various scales. Through extensive cloud simulations using COSCO and real-world experiments using CloudLab, we found that GREEN's solutions save energy by up to 2.17x than those generated by the best previous state-of-the-art (SOTA) resource managers without compromising Service Level Objective (SLO) metrics. Most importantly, GREEN can generate similarly high quality scheduling and placement solutions on systems with 100 to 1000 servers in COSCO cloud simulations.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more
RESEARCH2975

iADS: In-Situ Adaptive Diffusion Sampling Architecture Based on Voltage-Tunable P-Bits

Chenhao Xia; Haoxuan Wang; Weican Chen; Hongjie Zeng; Guanwen Yao; Yuxiang Zhang; Yunyi Fu; Fei Liu

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

This work proposed a self-compensating p-bit unit and constructed an in-situ adaptive diffusion sampling (iADS) architecture. To the best of our knowledge, this is the first time achieving in-situ Gaussian noise generation and scaling in hardware. An iADS circuit composed of 256 p-bit units was fabricated and generated MNIST images in experiments. Furthermore, a hardware-calibrated 1,048,576 p-bit iADS architecture, achieved training and generation for CIFAR10 (FID=9.96) and CelebA-HQ, by combining hardware adaptive scaling. This approach achieves approximately 8× reduction in energy consumption and roughly 143× reduction in area relative to PRNGs, underscoring its potential for highly efficient acceleration of DMs.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH2976

REPAIR: Reliability Enhancement of NVM FPGAs Through Lifetime-Aware Differential Partial Reconfiguration

Miran Tobar; Hassan Nassar; Joerg Henkel

Date:Monday, July 27 Location:Mtg Room 203AB +1 Session:Memory That Thinks: Scalable, Reliable, and Secure Compute-in-Memory Systems +1
Abstract

Non-Volatile Memory (NVM)-based Field Programmable Gate Ar- rays (FPGAs) have attracted attention due to lower power consump- tion, higher logic density, and instant-on capability from retained configuration data. However, NVM cells have limited write en- durance and higher latency than SRAM, which can affect long-term reliability, especially under frequent Partial reconfiguration (PR). Although PR improves design efficiency by updating only part of the logic, it imposes lifetime limitations, particularly on NVM-based Look Up Tables (LUTs). In this paper, we introduce REPAIR, a framework that minimizes LUT reconfiguration through PR without changing the normal CAD flow. REPAIR uses a lifetime-aware fixed placement technique in the Preprocessing stage to reduce LUT reconfigurations. Moreover, it includes a postprocessing tool that generates differential partial bit- streams, writing only the changed LUT configuration bits between consecutive configurations. Finally, at runtime it incorporates a low-overhead scheduler to optimize LUT updates within the recon- figuration path. Together, these mechanisms reduce write stress on NVM cells, enhance the lifetime and reliability of Non-Volatile FPGAs (NVFPGAs) while preserving the performance and flexibility benefits of PR. Our experimental results show that REPAIR extends the operational lifetime of NVFPGAs by at least 7× compared to a baseline PR using commercial tools

DesignDES2A. In-memory and Near-memory Computing CircuitsAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2996

Holocode: Hybrid Optical-Electronic Edge Encoding for Privacy-Preserving Cloud Training

Ruofan Xing; Arman Akbari; Weikai Lin; Adith Boloor; Alexander McNeil; Michael Moebius; Yongmin Liu; Yuhao Zhu; Xuan Zhang

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

Privacy-preserving machine learning aims to defend against adversaries without sacrificing task accuracy. In latency-critical and resource-constrained settings, existing cryptographic and encoding approaches incur heavy overheads that translate into intolerable delays and energy costs. We present HoloCode, a hybrid optical–electronic encoding pipeline that delivers strong privacy with sub-5ms latency at a fraction of the energy of prior state-of-the-art. HoloCode encodes only task-relevant signals while shielding sensitive features, resists inversion attacks, and locks models with a private key to prevent misuse. HoloCode builds on an edge–cloud collaboration framework, where inference is pushed to the edge to cut latency, at the cost of higher edge energy. To break this trade-off, we adopt an optical–digital hybrid pipeline that leverages zero-energy optical processing to reduce latency and edge energy simultaneously. Against strong privacy-preserving baselines, HoloCode achieves10x faster inference and 50% lower edge energy, while preserving accuracy and resisting privacy feature leakage and reconstruction attacks.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH2997

WINNER: A Wireless In-Vivo/ex-Vivo Runtime Analyzer for Intermittent Computing

Justin Feng; Arman Roohi; Nader Sehatbakhsh

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Design for the Unpredictable: Resilience and Adaptation in AIoT +1
Abstract

Intermittent computing systems suffer from limited observability and lack effective runtime analysis mechanisms. Unlike conventional systems, they are vulnerable not only to typical programming errors but also to a distinct class of failures known as \textit{intermittence bugs}, which emerge due to their fragmented, non-linear execution model caused by frequent power loss. Traditional debugging tools are ill-suited for this context, as they assume persistent power and continuous connectivity. To address these challenges, we present a new Wireless In-Vivo/Ex-Vivo runtime analyzer caller WINNER. It is a lightweight task management layer that records fine-grained execution traces—such as interrupts, memory/register states—into non-volatile memory during regular operation (in-vivo), and exposes them only during debug sessions (ex-vivo) triggered automatically by a user-defined property violation such as exceeding repeated task execution. Building on this, WINNER introduces a new software-based communication paradigm that leverages controllable electromagnetic emanations (side-channel) for frequent but low-rate uplink communication, enabling software-controlled ultra-low-energy data transmission (up to 4x lower than BLE). We evaluate our solution across six applications and five platforms, demonstrating its effectiveness in detecting various bugs. The introduced overhead is minimal, with a runtime increase of less than 1% and an average binary size increase of 6%. Our tool is open-sourced.

SystemsSYS2. Design of Cyber-Physical Systems and IoTQuantumDES6. Quantum Computing
RESEARCH3019

Decoding the Past: Recovering Sensitive Data from SRAM Aging Imprints

Zakia Tamanna Tisha; Gaines Odom; Biswajit Ray; Ujjwal Guin

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

Long-term data remanence in SRAMs can pose serious security risks when discarded ICs retain sensitive information. Unlike DRAM and Flash memories, SRAMs have been largely overlooked in this field due to their very short retention periods. Prior works demonstrated that aging-induced imprints enable partial data recovery in SRAMs, but only when recorded initial power-up states are available. In this paper, we have proposed a recovery approach that eliminates this requirement by reconstructing initial states through controlled aging. By analyzing the reconstructed and aged power-up states, we demonstrated near-complete data recovery from SRAM chips after 12 hours of controlled aging using only 32 copies.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH3027

Whitespace-Free Rectilinear 3D Floorplanning with Pressure-Driven Orthogonal Laguerre-Voronoi Tessellation (POLT)

Tse-Han Pan; Paul Franzon

Date:Monday, July 27 Location:Mtg Room 201B Session:Advances in 3D/2.5D Physical Design +1
Abstract

A major challenge in 3DIC design is determining the top-level floorplan of a chip stack while accounting for die-to-die (D2D) interfaces. Existing approaches often model modules as simple rectangles and optimize area, and wirelength under limited constraints, diverging from real scenarios where hard modules have fixed outlines and some have alignment constraints, while soft modules have undefined shapes. To address these issues, we propose a two-stage, whitespace-free rectilinear 3D floorplanning process. The first stage places hard modules and determines the spatial relationship of soft-modules using a force-directed algorithm; the second tessellates each soft module into a rectilinear shape with minimal area error using POLT. The method integrates constrained hard modules and flexible soft modules to generate rectilinear, whitespace-free floorplans that meet target areas and alignment constraints. Experimental results show that our method achieves an average 5.6% reduction in HPWL on 3D floorplans with alignment constraints and reduces area error by 98% compared to existing rectilinear floorplanners, while maintaining less than 0.16% area error across all 2D and 3D stacking scenarios.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageSYS4. Embedded System Design Tools and Methodologies
RESEARCH3029

CO-MAC: A Center-Out Ordered Stochastic MAC for Low-Latency Inference

Hwangyu Cho; Tong Chen; Younghyun Kim

Date:Wednesday, July 29 Location:Mtg Room 203AB Session:Harnessing the Noise: p-Bits, ZKPs, and Resilient AI +1
Abstract

Stochastic computing offers efficient approximate arithmetic that aligns well with error-tolerant machine learning workloads, but its deployment is limited by long bitstream latency in stochastic multiply-accumulate (MAC) units.Prior work reduces MAC latency through deterministic bitstream generation and differential accumulation, but these methods do not fully exploit the statistical property of convolution weights.This work presents a novel stochastic MAC architecture named CO-MAC, which employs center-out weight ordering and an enhanced convolution engine design to reduce effective computation cycles while maintaining high accuracy.The method sorts weights by magnitude, reuses the incremental differences in magnitudes, and applies sign handling after accumulation.This shortens counter activity, maintains accuracy with long effective bitstreams, and simplifies the MAC hardware by avoiding bidirectional counters.Across convolutional neural network workloads, CO-MAC decreases MAC latency by up to 54.8% compared to prior stochastic MAC architectures, while preserving accuracy and hardware simplicity.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesSystems
RESEARCH3042

BREW-RC: Bit-Exact Recovery of ECCs from Write Timing in ReRAM Crossbars

Tyler Sheaves; Alric Althoff; Dustin Richmond

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

We present BREW-RC, a non-invasive technique that uses a timing side channel to reverse engineer error-correcting codes (ECCs) in ReRAM crossbar memories. ECCs use a parity submatrix to extend data words with parity bits, forming codewords. These parity bits cannot be read directly, but BREW-RC reveals them by using write latency variability to determine the aggregate polarity of codeword transitions. Timing perturbations caused by ECC parity bit updates are encoded as a SAT instance whose solution yields the bit-exact parity matrix. Knowledge of the parity matrix enables: (1) improved timing models for side-channel attacks/mitigations and latency-sensitive applications, and (2) ECC-aware reliability characterization of devices whose error correction capabilities are unadvertised. We demonstrate BREW-RC on two commercial ReRAM products, recovering the complete ECC from multiple samples of each.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH3050

Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs

Jun-Liang Lin; Kamesh Madduri; Mahmut Kandemir

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

Graph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited to single-GPU systems, leading to long training times or out-of-memory issues on large graphs. Moreover, parallelizing graph transformer training over the full graph is challenging, as efficiency depends heavily on both the graph structure and system characteristics, such as bandwidth and memory capacity. In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption by 78% compared to state-of-the-art frameworks. On large graph benchmarks, our proposed framework achieves up to 6x speedup with system scaling up to 8 GPUs. These results demonstrate that the proposed framework improves the scalability of graph transformers, bringing them closer to serving as practical graph foundation models.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more
RESEARCH3059

EPAR: Electromagnetic Pathways to Architectural Reliability in Quantum Processors

Navnil Choudhury; Yizhuo Tan; Jiaqi Yu; Jakub Szefer; Kanad Basu

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Visionary Advances in Quantum Engine Through Orchestrating Hardware, Cryogenic Systems, and Real-Time Control +1
Abstract

As superconducting processors scale, understanding how physical layout shapes qubit interactions is essential for architectural reliability. Existing methods offer limited insight into how electromagnetic design choices translate into execution-level behavior. We present EPAR, an electromagnetic-to-architecture framework that predicts robustness early directly from physical design by reconstructing how design distortion modifies the effective Hamiltonian, reroutes mediated connectivity, and influences control-pulse response. Across all tested layouts, EPAR's structural scores show 100% agreement with two-qubit error trends yet reveal over 10x robustness differences among edges with identical calibrated error rates, going beyond conventional metrics to provide improved and actionable compiler guidance.

DesignQuantumDES6. Quantum ComputingSystems
RESEARCH3067

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

Abdul Basit; Saim Rehman; Muhammad Shafique

Date:Wednesday, July 29 Location:Mtg Room 101A Session:The Edge-Lord Chronicles: Tiny Gestures, Big LLMs, and Cluster Chaos +1
Abstract

Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel runtime-adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A run-time controller selects an appropriate ACE mode under user-defined and battery-constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced-complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by ≈4× (from ≈6.9 mJ to ≈1.6 mJ) while maintaining high gesture-detection performance (event-level F1 ≈ 0.8–0.9) and low mean latency (≈6 ms).

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH3082

UGLMS-uSMILE: Live Measurement Sequencing for Segmented Linearity Identification in ADCs

Thorben Schey; Khaled Karoonlatifi; Michael Weyrich; Andrey Morozov

Date:Monday, July 27 Location:Mtg Room 202AB Session:Generative AI and Graph-Based Learning for Next-Gen Circuit & Device Simulation +1
Abstract

Conventional Analog-to-Digital Converter (ADC) linearity testing separates data acquisition from model extraction, resulting in long post-processing times. This work unifies two complementary strategies, Uncertainty-Guided Live Measurement Sequencing (UGLMS) for adaptive test stimulus control and ultrafast Segmented Model Identification of Linearity Error (uSMILE) for error parameter extraction, into a single closed-loop test framework. The combined UGLMS–uSMILE method performs live estimation of MSB, ISB, and LSB-level nonlinearity during measurement, guided by covariance-based uncertainty metrics. This eliminates offline processing while retaining high accuracy and stability, enabling rapid, measurement-driven characterization of ADC linearity directly within production test environments. Experimental results on Successive Approximation Register (SAR) ADCs demonstrate sub-0.2 LSB accuracy and test times below 150 ms for 16-bit resolution.

EDAEDA6. Analog CAD, Simulation, Verification and TestChipletSYS4. Embedded System Design Tools and Methodologies
RESEARCH3087

Train Once, Calibrate Always: Machine-Learning-Assisted Blind Calibration for Analog-to-Digital Converters

Shamma Nasrin; Matt Kinsinger; Anoop Bengaluru; Jia-Ching Chuang; Sumukh Bhanushali; Arindam Sanyal

Date:Monday, July 27 Location:Mtg Room 203AB Session:Architecting the Future: Cross-Layer Breakthroughs in Compute and Memory +1
Abstract

This work introduces a machine-learning (ML)-based calibration framework for analog-to-digital converters (ADCs) that is demon- strated on an over-sampled, time-interleaved band-pass delta-sigma ADC (TI-BPADC) test-chip in 65nm and a nyquist, successive ap- proximation register (SAR) ADC test-chip in 28 nm. The proposed framework employs two models: a hybrid convolutional-recurrent network (ConvRec) and a residual convolutional network (ResConv) for suppressing both static and dynamic errors in ADCs, and presents trade-offs between the two models in terms of calibration accu- racy and hardware cost. The proposed models improve signal-to- noise-and-distortion ratio (SNDR) and spurious-free-dynamic range (SFDR) by more than 20 dB on the ADC test chips without requir- ing prior knowledge of circuit architecture, error mechanism or input statistics. This is in contrast to existing calibration techniques which are either algorithmic and require prior knowledge of errors for calibration, or leverage ML for calibration but need knowledge of input statistics. Input and error agnostic property of the proposed ML framework is the key differentiation of this work over others and allows correction of errors, including errors that are unforeseen during design time.

DesignDES4. Digital and Analog CircuitsAI2-I. AI/ML Algorithms and Models
RESEARCH3091

Shared Logic Unleashed: Multiple-Node Boolean Optimization for Next-Gen Synthesis

Alessandro Tempia Calvino; Anubhaw Xess; Anika Prasad; Jacob Minz; Eleonora Testa; Walter Lau Neto; Alan Mishchenko; Giovanni De Micheli; Patrick Vuillod; Luca Amaru

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Revolutionizing Synthesis Flows +1
Abstract

We present a high-effort Boolean optimization framework that restructures multiple nodes in combinational logic simultaneously, enabling area reductions beyond the reach of conventional single-output transformations. Our method generalizes Boolean resubstitution on both nodes and edges to operate across shared logic while leveraging don't-care conditions. Our approach supports coordinated removal and resynthesis of shared logic regions and expands the optimization scope to multiple-output windows. When deployed in an industrial standard-cell design flow, our framework achieves up to 6.79% area reduction and 7.85% switching power reduction post-synthesis. Additionally, it establishes 19 new best results in the EPFL synthesis competition on LUT networks.

EDAEDA5. RTL/Logic Level and High-level SynthesisEDA7-II. Physical Design and VerificationDesign
RESEARCH3092

Securerouter: Encrypted Routing for Efficient Secure Inference

Yukuan Zhang; Mengxin Zheng; Qian Lou

Date:Monday, July 27 Location:Mtg Room 203C +1 Session:Hidden in Plain Sight: Designing Systems for Private Intelligence +1
Abstract

Cryptographically secure neural network inference typically relies on secure computing techniques such as Secure Multi-Party Computation (MPC), enabling cloud servers to process client inputs without decrypting them. Although prior privacy-preserving inference systems co-design network optimizations with MPC, they remain slow and costly, limiting real-world deployment. A major bottleneck is their use of a single, fixed transformer model for all encrypted inputs, ignoring that different inputs require different model sizes to balance efficiency and accuracy. We present \textit{SecureRouter}, an end-to-end encrypted routing and inference framework that accelerates secure transformer inference through input-adaptive model selection under encryption. SecureRouter establishes a unified encrypted pipeline that integrates a secure router with an MPC-optimized model pool, enabling coordinated routing, inference, and protocol execution while preserving full data and model confidentiality. The framework includes training-phase and inference-phase components: an MPC-cost-aware secure router that predicts per-model utility and cost from encrypted features, and an MPC-optimized model pool whose architectures and quantization schemes are co-trained to minimize MPC communication and computation overhead. Compared to prior work, SecureRouter achieves a latency reduction by 1.95$\times$ with negligible accuracy loss, offering a practical path toward scalable and efficient secure AI inference.

SecuritySEC1. AI/ML Security/PrivacyAIDES5. Emerging Device and Interconnect Technologies
RESEARCH3100

Escaping Flatland: A Placement Flow for Enabling 3D FPGAs

Cong "Callie" Hao; Andrew Kahng; Bodhisatta Pramanik; Ismael Youssef

Date:Tuesday, July 28 Location:Mtg Room 202C Session:Composable Systems- The Catalyst for Memory, Acceleration, and Design Tools +1
Abstract

3D field-programmable gate arrays (FPGAs) promise higher performance through vertical integration. However, existing placement tools, largely inherited from 2D frameworks, fail to capture the unique delay characteristics and optimization dynamics of 3D fabrics. We introduce a 3D FPGA placement flow that integrates partitioning-based initialization, adaptive cost scheduling, refined delay estimation, and a simulated annealing move set — all targeted at 3D FPGA architecture. Together, these enhancements improve timing estimates and the exploration of layer assignments during placement. Compared to Verilog-To-Routing (VTR), our experiments show geometric-mean (max) critical-path delay reductions of ∼3% (∼7%), ∼2% (∼4%), ∼3% (∼8%), and ∼6% (∼18%) for four 3D architectures: 3D CB, 3D CB-O, 3D CB-I, and 3D SB, respectively. We also achieve geometric-mean (max) routed wirelength reductions of ∼1% (∼3%), ∼2% (∼8%), < 1% (∼5%), and ∼5% (∼10%), respectively. Our work will be permissively open-sourced on GitHub.

DesignDES1-I. SoC, Heterogeneous, and Reconfigurable ArchitecturesEDA7-II. Physical Design and Verification
RESEARCH3120

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Alish Kanani; Sangwan Lee; Han Lyu; Jiahao Lin; Jaehyun Park; Umit Ogras

Date:Wednesday, July 29 Location:Mtg Room 101A Session:Architectures for AI of Many Shapes +1
Abstract

Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba–Transformer models inherit this asymmetry while adding state-space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector–matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.

AIAI4-I. AI/ML Architecture DesignAI5-I. AI/ML System and Platform Design
RESEARCH3121

LEXI: Lossless Exponent Coding for Efficient Inter-Chiplet Communication in Hybrid LLMs

Miao Sun; Alish Kanani; Kaushik SHROFF; Umit Ogras

Date:Tuesday, July 28 Location:Mtg Room 201B Session:Interconnect-Driven System Architecture Innovation +1
Abstract

State-of-the-art large language models rely on the bfloat16 (BF16) format for stable training, effectively avoiding underflow and overflow. However, inference is heavily bottlenecked by data movement across chiplet systems. While floating-point standards allocate fixed bits, our profiling reveals that BF16 exponent streams exhibit surprisingly low Shannon entropy (less than 3 bits), indicating high inherent compressibility. To exploit this, we propose LEXI, a novel lossless, exponent-only scheme based on Huffman coding. LEXI enables on-the-fly compression of activations and caches and stores weights in compressed form for just-in-time decompression near compute, without degrading overall system throughput. Integrated into a GF 22 nm Simba architecture, LEXI reduces inter-chiplet communication time by 33–45% and end-to-end LLM inference latency by 30–35% on modern models (Jamba, Zamba, Qwen). With a minimal 0.09% area overhead, LEXI marks a critical step toward efficient LLM deployment on modular chiplet systems.

EDAChipletEDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-PackageQuantum +1 more
RESEARCH3129

HW-Router: Hardware-Aware Routing for Scalable Multi-LLM Serving

Ahasan Kabir; Jiaqi Xue; Mengxin Zheng; Qian Lou

Date:Wednesday, July 29 Location:Mtg Room 201B Session:Serving Intelligence at Scale: Architectural and Runtime Innovations for Large Models +1
Abstract

Modern large language model (LLM) serving platforms deploy multiple models across different GPUs, requiring routers to direct incoming queries to appropriate LLMs. However, existing routing approaches primarily rely on static model attributes such as size or FLOPs to estimate serving costs. This static cost modeling fails to capture the dynamic behavior of real deployments, where the same model can exhibit vastly different inference latencies depending on hardware type (e.g., H100 vs.\ V100), current system load (e.g., running and waiting queue lengths), and resource contention (e.g., KV-cache usage and GPU utilization). Such hardware-agnostic routing leads to suboptimal decisions, resulting in SLO violations, queue buildup, and underutilized GPUs. To address these challenges, we present \textbf{HW-Router}, a dynamic routing framework that integrates real-time hardware signals into model selection to enable accurate latency prediction and intelligent, SLO-aware routing decisions. Our approach incorporates model-specific features (architecture, size, input length) alongside hardware metrics including queue lengths, KV-cache utilization, and recent TTFT/TPOT performance, and uses a lightweight latency predictor to estimate per-model-per-GPU serving time. Evaluations across diverse workloads show that HW-Router achieves \textbf{3.4–3.9$\times$ lower end-to-end latency}, \textbf{46–48 percentage points higher SLO attainment}, \textbf{6–8$\times$ lower GPU load skew}, and a \textbf{3.1–3.4$\times$ reduction in waiting-queue fraction} compared to state-of-the-art router baselines, CARROT and IRT, with only \bm{$\sim 200 \mu s$} of additional routing overhead and no loss in output quality. These results highlight the importance of real-time hardware feedback for scalable, predictable, and well-balanced multi-LLM serving.

AIAI5-II. AI/ML System and Platform DesignSystems
RESEARCH3145

STAMP: Semiconductor Tool-Chain Attestation Through Wafer-Level Machine Learning-Based Process Monitoring

Suraag Sunil Tellakula; Ching-Yi Chang; Matthew Nigh; John Carulli; Yiorgos Makris

Date:Tuesday, July 28 Location:Mtg Room 203C Session:Silicon Integrity 2.0: Detection, Tracking, and Attestation +1
Abstract

We introduce a machine learning approach for distinguishing between wafers fabricated using a sanctioned/ratified chain of tools on a semiconductor manufacturing floor and unsanctioned/unratified versions of the tool-chain based on metrology or wafer acceptance test collected during manufacturing and testing. Our method exploits the systematic nature of process variation and captures the subtle causal relationships between tool exchanges and the resulting changes in physical dimensions or electrical characteristics of the silicon, which can then be used for enabling wafer-level tool-chain attestation. Effectiveness of our solution is demonstrated on a dataset of inline and e-test measurements from 7000 wafers fabricated with multiple tool-chain variants in the GlobalFoundries 12LP FinFET technology node.

SecuritySEC3-II. Hardware Security: Attack and DefenseQuantumDES6. Quantum Computing +1 more
RESEARCH3153

A Boolean Processor-Based Hardware Emulation System with Reconfigurable Interconnects

Ruiyao Pu; Muhan Li; Li Shang; Fan Yang

Date:Monday, July 27 Location:Mtg Room 201B Session:Fast, Smart, and Agentic: Accelerated Verification with Fuzzing, RL, and LLMs +1
Abstract

Existing Boolean processor-based (BP-based) hardware emulation systems typically rely on fixed interconnects, which often suffer from severe bandwidth underutilization under highly imbalanced traffic. We propose a BP-based hardware emulation system with hierarchical reconfigurable interconnects. At both the inter-chip and intra-chip levels, our system dynamically reallocates idle interconnect lanes from low-traffic pairs to high-demand pairs, alleviating communication bottlenecks and improving interconnect utilization without modifying the underlying ASIC fabric. The hardware is co-designed with the compiler, which profiles communication, generates per-lane configuration tables and scheduling strategies, and programs the interconnect via registers. Experimental results on industrial digital designs show up to 64% emulation performance improvement and an average 27% reduction in compilation time, with only 7.4% area overhead compared to a fixed-interconnect baseline system.

EDAEDA2. Design Verification and ValidationAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH3154

Unlocking Automated Datapath Gating via Machine Learning Power Prediction

Felipe Marranghello; Giulia Meuli; Barkha Gupta; Alessandro Tempia Calvino; Eleonora Testa; Walter Lau Neto; Patrick Vuillod; Luca Amaru

Date:Monday, July 27 Location:Mtg Room 201B +1 Session:Foundations and Frontiers: Bridging Traditional EDA and Generative AI +1
Abstract

Clock gating and data gating are established techniques for reducing dynamic power. However, performing automated data gating during logic synthesis remains challenging due to the difficulty of accurately estimating power savings and identifying common Observability Don't Care (ODC) conditions for the enable logic. This work introduces a novel methodology that addresses both challenges: it introduces a novel SAT-based technique to compute valid ODC conditions and employs a machine learning-based power model to predict power improvements with high accuracy. The approach is integrated into an industrial synthesis flow and achieves efficient, fully automated, data gating. Experimental results demonstrate an average dynamic power reduction of -1% post place & route, with negligible area and runtime overhead.

EDAEDA5. RTL/Logic Level and High-level SynthesisAIDES5. Emerging Device and Interconnect Technologies
RESEARCH3200

STELLAR: Structure-Guided LLM Assertion Retrieval and Generation for Formal Verification

Saeid Rajabi; Chengmo Yang; Satwik Patnaik

Date:Wednesday, July 29 Location:Mtg Room 202C Session:Shields Up: Post-Quantum Hardware, Confidential Computing, and Silicon-Level Resilience +1
Abstract

Formal Verification (FV) relies on high-quality SystemVerilog Assertions (SVAs), but the manual writing process is slow and error-prone. Existing LLM-based approaches either generate assertions from scratch or ignore structural patterns in hardware designs and expert-crafted assertions. This paper presents STELLAR, the first framework that guides LLM-based SVA generation with structural similarity. STELLAR represents RTL blocks as AST structural fingerprints, retrieves structurally relevant (RTL, SVA) pairs from a knowledge base, and integrates them into structure-guided prompts. Experiments show that STELLAR achieves superior syntax correctness, stylistic alignment, and functional correctness, highlighting structure-aware retrieval as a promising direction for industrial FV.

SecuritySEC2. Hardware Security: Primitives, Architecture, Design & TestSystems
RESEARCH3208

Making Sense of Job Preemption for Distributed Deep Learning Acceleration

Younghun Go; Changyong Shin; Minchul Kang; Jaehyun Hwang; Chuck Yoo; Gyeongsik Yang

Date:Wednesday, July 29 Location:Mtg Room 101A Session:The Edge-Lord Chronicles: Tiny Gestures, Big LLMs, and Cluster Chaos +1
Abstract

Preemptive scheduling is gaining attention in GPU scheduling for distributed training because it reduces job completion times (JCT). However, our analysis of production traces reveals that, surprisingly, it can increase JCT in practice. We identify two key contributors to this inefficiency. First, preemptions become "futile" when a job is preempted right after being loaded, before it begins execution. We find this futile preemption wastes the load time and so inflates JCT ∼1.6×. Second, existing GPU schedulers run at fixed intervals (e.g., 360 s) rather than upon new job arrivals or when a job completes. So, new jobs must wait until the scheduler kicks in, which, we find, increases JCT ∼2.6×. To address the problems, we introduce Lazer, a novel job scheduler that predicts when to preempt jobs based on job-specific and cluster conditions. Lazer is designed to efficiently explore the scheduling space and adapt to diverse job and GPU cluster characteristics based on Bayesian optimization. Our extensive evaluation shows that Lazer significantly outperforms state-of-the-art schedulers—reducing JCT by 1.2×–233.3×, waiting time by 2×–11690×, and futile preemptions by 23×–67×.

AIAI3-II. AI/ML Application and InfrastructureQuantum
RESEARCH3214

TopGQ: Fast GNN Post-Training Quantization Leveraging Topology Information

Dain Kwon; Kanghyun Choi; Hyeyoon Lee; Sunjong Park; Seoyong Lee; Sukjin Kim; Jinho Lee

Date:Monday, July 27 Location:Mtg Room 101B Session:Winning the Token-Time Wars: Low-Bit LLMs, Faster MoE, and KV-Cache Systems +1
Abstract

Existing GNN quantization methods suffer from considerable quantization overhead, which severely limits their practical usage in real-world scenarios. To this end, we present TopGQ, an accurate post-training quantization framework tailored for GNNs, alleviating the burden of redundant quantization overhead. We propose TopPIN, a proxy for nodes' local structure, and use it to group nodes with similar topology during quantization. On top of that, we introduce Dual-axis scale absorption, which enables activation quantization along both the outer and inner dimensions by merging one into the adjacency matrix. Experimental results show that TopGQ reduces quantization time by an order of magnitude while preserving accuracy.

AIAI2-I. AI/ML Algorithms and ModelsDesign
RESEARCH3237

Light-Bound Transformers: Hardware-Anchored Robustness for Silicon-Photonic Computer Vision Systems

Xuming Chen; Deniz Najafi; Chengwei Zhou; Pietro Mercati; Arman Roohi; Mohsen Imani; Mahdi Nikdast; Shaahin Angizi; Gourav Datta

Date:Tuesday, July 28 Location:Mtg Room 101B Session:Energy-Efficient AI Systems - Cross-Stack Co-Design from Edge to Cloud +1
Abstract

Deploying Vision Transformers (ViTs) on near-sensor analog accelerators demands training pipelines that are explicitly aligned with device-level noise and energy constraints. We introduce a compact framework for silicon-photonic execution of ViTs that integrates \textbf{measured hardware noise}, \textbf{robust attention training}, and an \textbf{energy-aware processing flow}. We first characterize bank-level noise in microring-resonator (MR) arrays—including fabrication variation, thermal drift, and amplitude noise—and convert these measurements into closed-form, activation-dependent variance proxies for attention logits and feed-forward activations. Using these proxies, we develop \emph{Chance-Constrained Training} (CCT), which enforces variance-normalized logit margins to bound attention rank flips, and a \emph{noise-aware LayerNorm} that stabilizes feature statistics without changing the optical schedule. These components yield a practical ``measure $\rightarrow$ model $\rightarrow$ train $\rightarrow$ run'' pipeline that optimizes accuracy under noise while respecting system energy limits. Hardware-in-the-loop experiments with MR photonic banks show that our approach restores near-clean accuracy under realistic noise budgets, with no in-situ learning or additional optical MACs.

AIAI5-II. AI/ML System and Platform DesignQuantumDES6. Quantum Computing +1 more