In this work, We investigate a skill-based framework for humanoid box rearrangement that enables long-horizon execution by sequencing reusable skills at the task level.
In our architecture, all skills execute through a shared, task-agnostic whole-body controller (WBC), providing a consistent closed-loop interface for skill composition,
in contrast to non-shared designs that use separate low-level controllers per skill. Additionally, we introduce Humanoid Hanoi,
a long-horizon Tower-of-Hanoi box rearrangement benchmark to evaluate long-horizon performance, and report results in simulation and on the Digit V3 humanoid robot, demonstrating fully autonomous
rearrangement over extended horizons and quantifying the benefits of the shared-WBC approach over non-shared baselines.
Under Review
We introduce SAGE, a unified multi-view transformer that bridges the gap between geometric scene reconstruction and semantic object understanding.
SAGE utilizes a cross-view attention mechanism to enforce 3D geometric consistency across multiple camera angles.
By integrating CLIP embeddings, SAGE enables open-vocabulary object
grounding from text queries without task-specific retraining. The architecture incorporates a differentiable Direct Linear
Transform (DLT) for camera estimation and a differentiable Non-Maximum Suppression (NMS) layer to resolve geometric symmetries.
On an NVIDIA GeForce RTX 3090 Ti, SAGE achieves up to 18 Hz for single-view and over 10 Hz for two-view 256x256 inputs,
providing an efficient and modular perception framework for robotic manipulation.
Under Review
We present ASM-6D, a framework for real-time category-level 6D pose, shape, and scale estimation from partial RGB-D observations.
We bridge this gap by formulating a unified optimization problem over the SIM(3) manifold and introducing a high-throughput Matrix ADMM solver.
Our approach concurrently estimates an object’s pose and shape at frequencies exceeding 100 Hz, with sub-linear
scaling with respect to batch size and active shape model complexity. For learning-based keypoint detector frontends,
we provide a pipeline for generating geometrically consistent keypoints, enabling the training of robust detection models
without manual labeling. We demonstrate the system's reliability through global optimality certification via SDP duality
and show resilience to both stochastic noise and occlusions. ASM-6D provides a mathematically rigorous and computationally
efficient foundation for real-time, closed-loop robotic manipulation.
Under Review
This research introduces a real-time framework for simultaneous 6D object pose tracking and shape estimation in dynamic, real-world environments.
The system combines Active Shape Models with ADMM-based point cloud registration to handle occlusions, deformations, and shape variability without requiring CAD models.
A novel SVGD-based multi-hypothesis tracker over SIM(3) ensures robustness under symmetry and noise. Running at 100–200Hz on GPU,
this framework enables reliable object-centric perception and closed-loop manipulation in fast-changing, human-centric tasks.
Accepted to
Equivariant Systems: Theory and Applications in State Estimation, Artificial Intelligence and Control Workshop at RSS 2025
Accepted to
TC Virtual Poster Session and Networking Event 2025
This research presents a vision-based hierarchical control framework to enhance locomotion in unstructured terrains.
The framework integrates a reinforcement learning (RL) footstep planner, which generates adaptive footsteps from a robot-centric
elevation map and the Angular Momentum Linear Inverted Pendulum model as policy input,
with a low-level Operational Space Controller (OSC) that tracks the planned trajectories.
We validate our approach on the underactuated bipedal robot Cassie and show evaluations across diverse terrain conditions
in both simulation and hardware experiments.
Accepted to
2025 IEEE-RAS 24th International Conference on Humanoid Robots [Oral Presentation]