Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

## Task
20-DOF Wato hand, in-hand cube reorientation (Isaac-Repose-Cube-WatoHand-v0).
Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.

---

Expand All @@ -29,33 +28,27 @@ Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.

---

### 3. Stagnation termination fix (WORKS — kept)
**Problem:** `orientation_error_threshold=0.12` in stagnation but success threshold=0.4 — good episodes ended prematurely.
**Fix:** Stagnation threshold raised to 0.5 rad, `stagnant_steps` 90 → 150.

---

### 4. PPO config tuning (WORKS — kept)
### 3. PPO config tuning (WORKS — kept)
- `entropy_coef: 0.002 → 0.0001` — stopped rewarding randomness.
- `num_steps_per_env: 24 → 48` — better return estimates.
- `success_bonus weight: 250 → 50` — reduced VF loss spikes.

---

### 5. Angular velocity toward goal reward (WEAK — kept at low weight)
### 4. Angular velocity toward goal reward (WEAK — kept at low weight)
**Idea:** Reward angular velocity component aligned with goal direction.
**Problem:** First attempt gave negative reward (random spin anti-aligned on average) — suppressed all rotation. Clamped to 0, re-enabled at weight=0.2 once holding stabilized.
**Outcome:** Stays flat at ~0.016 regardless of policy quality. Goal resampling on success resets the angular velocity alignment, pinning the average near zero. Kept as a weak directional signal only.

---

### 6. EMA alpha reduction — 0.95 → 0.8 (WORKS)
### 5. EMA alpha reduction — 0.95 → 0.8 (WORKS)
**alpha=0.5:** Policy didn't use extra bandwidth. Reverted.
**alpha=0.8:** Bandwidth ~1 Hz (vs ~0.25 Hz at 0.95). `action_rate_l2` rose -0.13 → -0.30. **Key unlock for rotation.** Orientation error broke below 1.5 consistently.

---

### 7. Z-axis curriculum (WORKS — converged)
### 6. Z-axis curriculum (WORKS — converged)
**Problem:** Full 3D goals too hard to explore. Policy never discovered rotation despite holding.
**Fix:** `rotation_axes = ["z"]` — goals restricted to palm-normal spin only.
**Result (5000 iters, 245M steps, 1024 envs):**
Expand All @@ -69,7 +62,7 @@ Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.

---

### 8. Instanceable USD + scaling to 1024 envs (WORKS)
### 7. Instanceable USD + scaling to 1024 envs (WORKS)
`replicate_physics = True` was already set. The existing `hand_urdf.usd` has no companion `_meshes.usd` but runs at 1024 envs without OOM — effectively sufficient.

**To make fully instanceable (not yet done):** Isaac Sim GUI → URDF Importer → "Create Instanceable Asset" → produces `hand_urdf.usd` + `instanceable_meshes.usd`. Update `_HAND_USD_PATH` in `wato_hand_cfg.py`.
Expand All @@ -81,7 +74,7 @@ Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.

---

### 9. Reward rebalancing to unblock rotation (WORKS)
### 8. Reward rebalancing to unblock rotation (WORKS)
**Problem:** Policy held well (~43%) but didn't rotate — `object_held_bonus` dominated.
**Changes:**
- `track_orientation_inv_l2`: weight 5.0 → **10.0**
Expand All @@ -92,7 +85,7 @@ Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.

---

### 10. Full 3D expansion attempt (FAILED — geometric limitation)
### 9. Full 3D expansion attempt (FAILED — geometric limitation)
**Setup:** Resumed from z-axis checkpoint (iter 5000) with `rotation_axes = ["x", "y", "z"]`.
**Observed:** orientation_error jumped 1.0 → 2.18. `action_rate_l2` dropped -0.30 → -0.08. After 500+ iters, no improvement.

Expand All @@ -105,7 +98,7 @@ Shadow/Allegro hands mount the thumb on the *opposite side* of the palm, giving

---

### 11. Smoothness fixes — action_rate ×5, joint_vel ×4, alpha 0.85 (WORKS)
### 10. Smoothness fixes — action_rate ×5, joint_vel ×4, alpha 0.85 (WORKS)
**Problem:** After goal match, fingers still jerked. Jerking resets `consecutive_success` counter, pinning `max_consecutive_success` at 0.
**Changes:**
- `action_rate_l2` weight: -0.01 → **-0.05** (5×)
Expand All @@ -116,8 +109,8 @@ Shadow/Allegro hands mount the thumb on the *opposite side* of the palm, giving

---

### 12. Scale to 2048 envs (MARGINAL IMPROVEMENT — stagnation)
**Throughput:** 1024 envs ~36k steps/s → 2048 envs ~53k steps/s (1.5× not 2× — GPU near saturation).
### 11. Scale to 2048 envs (MARGINAL IMPROVEMENT — stagnation)
**Throughput:** 1024 envs ~36k steps/s → 2048 envs ~53k steps/s
**VF loss:** Improved from 150-250 range down to 82-157 — bigger batch gives better return estimates.
**Orientation error:** Oscillates 0.94-1.19, best mean batch 0.946. No sustained improvement beyond the ~1.0 floor.

Expand Down Expand Up @@ -157,10 +150,7 @@ Shadow/Allegro hands mount the thumb on the *opposite side* of the palm, giving

1. **Break the stagnation floor (~1.0 rad)** — options:
- Much stronger smoothness: `action_rate_l2` weight -0.05 → -0.15 to finally kill jerking. Risk: may suppress rotation bandwidth.
- Fresh training run with all current hyperparams — policy may have converged to a poor attractor that a new random init escapes.

2. **Full 3D reorientation** requires hardware change: reposition thumb to opposite side of palm (true opposition), or mount hand palm-sideways so z-axis becomes a tilt direction. More compute will not overcome the geometric limitation.

3. **Fully instanceable USD** — create proper `instanceable_meshes.usd` via Isaac Sim GUI for cleaner multi-env physics sharing.

4. **Scale compute** — 4096+ envs (multi-GPU) is the next throughput step if hardware is available.
3. **Scale compute** — 4096+ envs (multi-GPU) is the next throughput step if hardware is available.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
entry_point="isaaclab.envs:ManagerBasedRLEnv",
disable_env_checker=True,
kwargs={
"env_cfg_entry_point": f"{__name__}.wato_env_cfg:WatoHandCubeEnvCfg",
"env_cfg_entry_point": f"{__name__}.wato_hand_env_cfg:WatoHandCubeEnvCfg",
"rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:WatoHandCubePPORunnerCfg",
},
)
Expand All @@ -17,7 +17,7 @@
entry_point="isaaclab.envs:ManagerBasedRLEnv",
disable_env_checker=True,
kwargs={
"env_cfg_entry_point": f"{__name__}.wato_env_cfg:WatoHandCubeEnvCfg_PLAY",
"env_cfg_entry_point": f"{__name__}.wato_hand_env_cfg:WatoHandCubeEnvCfg_PLAY",
"rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:WatoHandCubePPORunnerCfg",
},
)
Expand All @@ -27,7 +27,7 @@
entry_point="isaaclab.envs:ManagerBasedRLEnv",
disable_env_checker=True,
kwargs={
"env_cfg_entry_point": f"{__name__}.wato_env_cfg:WatoHandCubeNoVelObsEnvCfg",
"env_cfg_entry_point": f"{__name__}.wato_hand_env_cfg:WatoHandCubeNoVelObsEnvCfg",
"rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:WatoHandCubeNoVelObsPPORunnerCfg",
},
)
Expand All @@ -37,7 +37,7 @@
entry_point="isaaclab.envs:ManagerBasedRLEnv",
disable_env_checker=True,
kwargs={
"env_cfg_entry_point": f"{__name__}.wato_env_cfg:WatoHandCubeNoVelObsEnvCfg_PLAY",
"env_cfg_entry_point": f"{__name__}.wato_hand_env_cfg:WatoHandCubeNoVelObsEnvCfg_PLAY",
"rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:WatoHandCubeNoVelObsPPORunnerCfg",
},
)
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,7 @@

import HumanoidRLPackage.HumanoidRLSetup.tasks.inhand.inhand_env_cfg as inhand_env_cfg
import HumanoidRLPackage.HumanoidRLSetup.tasks.inhand.mdp as inhand_mdp
from HumanoidRLPackage.HumanoidRLSetup.modelCfg.wato_hand import (
INHAND_WATO_HAND_CFG,
INHAND_CUBE_POS,
INHAND_SPREAD_RAD,
)
from HumanoidRLPackage.HumanoidRLSetup.modelCfg.wato_hand import INHAND_SPREAD_RAD


@configclass
Expand All @@ -17,22 +13,15 @@ class WatoHandCubeEnvCfg(inhand_env_cfg.InHandObjectEnvCfg):
def __post_init__(self):
super().__post_init__()

# Share physics across envs — lower RAM per env, scales to more parallel rollouts.
self.scene.replicate_physics = True
self.scene.num_envs = 2048

self.scene.robot = INHAND_WATO_HAND_CFG.replace(prim_path="{ENV_REGEX_NS}/Robot")
# Push expanded MCP_A limits (±27 deg) into PhysX, overriding the ±8.6 deg baked in the USD.
self.events.expand_abduction_limits = EventTerm(
func=inhand_mdp.apply_wato_hand_joint_limits,
mode="startup",
)

# change the cube on plam's property
self.scene.object.spawn.scale = (0.8, 0.8, 0.8)
self.scene.object.init_state.pos = INHAND_CUBE_POS
self.scene.object.init_state.rot = (1.0, 0.0, 0.0, 0.0)

_grasp_scale = [0.2, 0.2]
_splay_scale = [1.0, 1.0]
self.events.reset_robot_joints.params["position_range"] = {
Expand Down Expand Up @@ -62,7 +51,7 @@ def __post_init__(self):
# Full 3D random orientation is too hard to explore from scratch; z-axis
# rotation (spinning in the palm plane) is the most natural motion for this hand.
# Once orientation_error shows a downward trend, expand back to ["x", "y"].
self.commands.object_pose.rotation_axes = ["z"]
self.commands.cube_pose.rotation_axes = ["z"]

# Small bonus for MCP_A velocity + spread deflection.
self.rewards.spread_activity = RewTerm(
Expand All @@ -81,12 +70,13 @@ def __post_init__(self):
super().__post_init__()
self.scene.num_envs = 16
self.observations.policy.enable_corruption = False
# Keep time_out so play episodes reset instead of clamping forever.
# Keep time_out so play episodes reset faster to see the next set
self.episode_length_s = 10.0
# Nudge goal marker aside so it does not cover the physical cube.
self.commands.object_pose.marker_pos_offset = (-0.10, 0.0, 0.12)
self.commands.cube_pose.marker_pos_offset = (-0.10, 0.0, 0.12)


# Version with no velocity observation as input because in a real deployment, measuring velocity is noisy / unavailable
@configclass
class WatoHandCubeNoVelObsEnvCfg(WatoHandCubeEnvCfg):
def __post_init__(self):
Expand All @@ -100,7 +90,5 @@ def __post_init__(self):
super().__post_init__()
self.scene.num_envs = 16
self.observations.policy.enable_corruption = False
# Keep time_out so play episodes reset instead of clamping forever.
self.episode_length_s = 10.0
# Nudge goal marker aside so it does not cover the physical cube.
self.commands.object_pose.marker_pos_offset = (-0.10, 0.0, 0.12)
self.commands.cube_pose.marker_pos_offset = (-0.10, 0.0, 0.12)
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# In-Hand: Wato hand cube reorientation

In-hand dexterous manipulation for the 20-DOF Wato hand (`hand_urdf.usd`) in Isaac Lab. The policy reorients a DexCube held in the palm toward a commanded goal orientation. **Play** shows a ghost goal cube marker (enabled via `WatoHandCubeEnvCfg_PLAY`).

Task setup and MDP code are adapted from [Isaac Lab](https://github.com/isaac-sim/IsaacLab) in-hand manipulation examples. Experimentation history: `TRAINING_LOG.md`.

**Environments**

| Task ID | Scene | Mode |
| :--- | :--- | :--- |
| `Isaac-Repose-Cube-WatoHand-v0` | Palm + DexCube | Train |
| `Isaac-Repose-Cube-WatoHand-Play-v0` | Palm + DexCube | Play |
| `Isaac-Repose-Cube-WatoHand-NoVelObs-v0` | Palm + DexCube | Train (no velocity obs) |
| `Isaac-Repose-Cube-WatoHand-NoVelObs-Play-v0` | Palm + DexCube | Play (no velocity obs) |

## Train & play

Run from `HumanoidRL/` (the directory that contains `HumanoidRLPackage/`):

```bash
# Train (default 2048 envs; try 1024 or 512 if OOM)
PYTHONPATH=$(pwd) /home/hy/IsaacLab/isaaclab.sh -p HumanoidRLPackage/rsl_rl_scripts/train.py \
--task=Isaac-Repose-Cube-WatoHand-v0 --headless

# Play — omit --headless to see goal-orientation marker
PYTHONPATH=$(pwd) /home/hy/IsaacLab/isaaclab.sh -p HumanoidRLPackage/rsl_rl_scripts/play.py \
--task=Isaac-Repose-Cube-WatoHand-Play-v0 --num_envs=1

# Play — specific checkpoint
PYTHONPATH=$(pwd) /home/hy/IsaacLab/isaaclab.sh -p HumanoidRLPackage/rsl_rl_scripts/play.py \
--task=Isaac-Repose-Cube-WatoHand-Play-v0 --num_envs=1 \
--checkpoint logs/rsl_rl/wato_hand_cube/<run>/model_<iter>.pt
```

Checkpoints: `logs/rsl_rl/wato_hand_cube/`. PPO defaults: `max_iterations=5000`, `experiment_name=wato_hand_cube` (`config/wato_hand/agents/rsl_rl_ppo_cfg.py`).

## Scene & command

| Item | Value |
| :--- | :--- |
| Robot | `INHAND_WATO_HAND_CFG` (`modelCfg/wato_hand.py`) — palm-up, 20 DOF |
| Cube | Isaac Nucleus DexCube (instanceable), scale `(0.8, 0.8, 0.8)`, spawn `INHAND_CUBE_POS` |
| Goal command | `InHandReOrientationCommand` — resampled **on success** (not on a timer) |
| Goal position | Default cube spawn + `init_pos_offset = (0, 0, -0.04)` m (hold-in-palm target) |
| Goal orientation | Sampled on allowed `rotation_axes`; success when error **< 0.4 rad** |
| Curriculum (current) | `rotation_axes = ["z"]` — palm-normal spin only (`wato_hand_env_cfg.py`) |
| Startup event | `expand_abduction_limits` — overrides USD-baked MCP_A limits to ±27° |

`replicate_physics=True` on Wato hand (instanceable-friendly).

## Reward

All base reward terms are defined in `RewardsCfg` (`inhand_env_cfg.py`). Implementations live in `mdp/rewards.py` (task-specific) and Isaac Lab `mdp` (generic penalties). `WatoHandCubeEnvCfg` adds one extra term — `spread_activity` — in `config/wato_hand/wato_hand_env_cfg.py`.

| Category | Reward Function | Weight | Description |
| :--- | :--- | :--- | :--- |
| **Task** | Position tracking (`track_pos_l2`) | −3.0 | L2 distance to goal position (encourages holding cube in palm). |
| | Orientation tracking (`track_orientation_inv_l2`) | 10.0 | $1 / (\text{orientation\_error} + 0.1)$ — dense rotation signal. |
| | Success bonus (`success_bonus`) | 50.0 | Binary reward when orientation error **< 0.4 rad**. |
| | Object held bonus (`object_held_bonus`) | 0.5 | +1/step when cube within **0.10 m** of goal position. |
| | Angular velocity toward goal (`object_ang_vel_toward_goal`) | 0.2 | Positive component of cube spin aligned with goal (clamped ≥ 0). |
| | Spread activity (`mcp_a_spread_activity`) | 0.03 | Encourages MCP_A abduction velocity + deflection (Wato only). |
| **Penalties** | Object away (`object_away_penalty`) | −5.0 | Terminal penalty when `object_out_of_reach` fires. |
| | Action rate L2 | −0.05 | Penalises jerky finger commands. |
| | Joint velocity L2 | $-1.0 \times 10^{-4}$ | Penalises high joint speeds. |
| | Action L2 | $-1.0 \times 10^{-4}$ | Penalises large action magnitudes. |

## Terminations

| Termination | Condition |
| :--- | :--- |
| Time out | Episode length exceeds **20 s** (10 s in Play). |
| Max consecutive success | Goal reached **50 times** in a row within one episode. |
| Object out of reach | Cube drifts **> 0.3 m** from robot root. |
| Orientation stagnation | Orientation error stays **> 0.5 rad** for **150** consecutive steps. |

## Sim settings

| Setting | Value |
| :--- | :--- |
| `decimation` | 4 |
| `sim.dt` | 1/120 s (~120 Hz) |
| `episode_length_s` | 20.0 |
| Default `num_envs` | 2048 |
| Action smoothing | EMA joint-position targets, `alpha = 0.85` |
| PPO `num_steps_per_env` | 48 |
| PPO `gamma` | 0.998 |
| PPO `entropy_coef` | 0.0001 |
Loading
Loading