WATonomous · wilsonchenghy · Jun 16, 2026 · Jun 16, 2026
diff --git a/..._Wato/HumanoidRL/HumanoidRLPackage/HumanoidRLSetup/tasks/inhand/TRAINING_LOG.md b/..._Wato/HumanoidRL/HumanoidRLPackage/HumanoidRLSetup/tasks/inhand/TRAINING_LOG.md
@@ -5,7 +5,6 @@
 
 ## Task
 20-DOF Wato hand, in-hand cube reorientation (Isaac-Repose-Cube-WatoHand-v0).
-Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.
 
 ---
 
@@ -29,33 +28,27 @@ Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.
 
 ---
 
-### 3. Stagnation termination fix (WORKS — kept)
-**Problem:** `orientation_error_threshold=0.12` in stagnation but success threshold=0.4 — good episodes ended prematurely.
-**Fix:** Stagnation threshold raised to 0.5 rad, `stagnant_steps` 90 → 150.
-
----
-
-### 4. PPO config tuning (WORKS — kept)
+### 3. PPO config tuning (WORKS — kept)
 - `entropy_coef: 0.002 → 0.0001` — stopped rewarding randomness.
 - `num_steps_per_env: 24 → 48` — better return estimates.
 - `success_bonus weight: 250 → 50` — reduced VF loss spikes.
 
 ---
 
-### 5. Angular velocity toward goal reward (WEAK — kept at low weight)
+### 4. Angular velocity toward goal reward (WEAK — kept at low weight)
 **Idea:** Reward angular velocity component aligned with goal direction.
 **Problem:** First attempt gave negative reward (random spin anti-aligned on average) — suppressed all rotation. Clamped to 0, re-enabled at weight=0.2 once holding stabilized.
 **Outcome:** Stays flat at ~0.016 regardless of policy quality. Goal resampling on success resets the angular velocity alignment, pinning the average near zero. Kept as a weak directional signal only.
 
 ---
 
-### 6. EMA alpha reduction — 0.95 → 0.8 (WORKS)
+### 5. EMA alpha reduction — 0.95 → 0.8 (WORKS)
 **alpha=0.5:** Policy didn't use extra bandwidth. Reverted.
 **alpha=0.8:** Bandwidth ~1 Hz (vs ~0.25 Hz at 0.95). `action_rate_l2` rose -0.13 → -0.30. **Key unlock for rotation.** Orientation error broke below 1.5 consistently.
 
 ---
 
-### 7. Z-axis curriculum (WORKS — converged)
+### 6. Z-axis curriculum (WORKS — converged)
 **Problem:** Full 3D goals too hard to explore. Policy never discovered rotation despite holding.
 **Fix:** `rotation_axes = ["z"]` — goals restricted to palm-normal spin only.
 **Result (5000 iters, 245M steps, 1024 envs):**
@@ -69,7 +62,7 @@ Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.
 
 ---
 
-### 8. Instanceable USD + scaling to 1024 envs (WORKS)
+### 7. Instanceable USD + scaling to 1024 envs (WORKS)
 `replicate_physics = True` was already set. The existing `hand_urdf.usd` has no companion `_meshes.usd` but runs at 1024 envs without OOM — effectively sufficient.
 
 **To make fully instanceable (not yet done):** Isaac Sim GUI → URDF Importer → "Create Instanceable Asset" → produces `hand_urdf.usd` + `instanceable_meshes.usd`. Update `_HAND_USD_PATH` in `wato_hand_cfg.py`.
@@ -81,7 +74,7 @@ Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.
 
 ---
 
-### 9. Reward rebalancing to unblock rotation (WORKS)
+### 8. Reward rebalancing to unblock rotation (WORKS)
 **Problem:** Policy held well (~43%) but didn't rotate — `object_held_bonus` dominated.
 **Changes:**
 - `track_orientation_inv_l2`: weight 5.0 → **10.0**
@@ -92,7 +85,7 @@ Palm-up orientation, cube spawned at `(-0.01, 0.09, 0.5)` in the palm.
 
 ---
 
-### 10. Full 3D expansion attempt (FAILED — geometric limitation)
+### 9. Full 3D expansion attempt (FAILED — geometric limitation)
 **Setup:** Resumed from z-axis checkpoint (iter 5000) with `rotation_axes = ["x", "y", "z"]`.
 **Observed:** orientation_error jumped 1.0 → 2.18. `action_rate_l2` dropped -0.30 → -0.08. After 500+ iters, no improvement.
 
@@ -105,7 +98,7 @@ Shadow/Allegro hands mount the thumb on the *opposite side* of the palm, giving
 
 ---
 
-### 11. Smoothness fixes — action_rate ×5, joint_vel ×4, alpha 0.85 (WORKS)
+### 10. Smoothness fixes — action_rate ×5, joint_vel ×4, alpha 0.85 (WORKS)
 **Problem:** After goal match, fingers still jerked. Jerking resets `consecutive_success` counter, pinning `max_consecutive_success` at 0.
 **Changes:**
 - `action_rate_l2` weight: -0.01 → **-0.05** (5×)
@@ -116,8 +109,8 @@ Shadow/Allegro hands mount the thumb on the *opposite side* of the palm, giving
 
 ---
 
-### 12. Scale to 2048 envs (MARGINAL IMPROVEMENT — stagnation)
-**Throughput:** 1024 envs ~36k steps/s → 2048 envs ~53k steps/s (1.5× not 2× — GPU near saturation).
+### 11. Scale to 2048 envs (MARGINAL IMPROVEMENT — stagnation)
+**Throughput:** 1024 envs ~36k steps/s → 2048 envs ~53k steps/s
 **VF loss:** Improved from 150-250 range down to 82-157 — bigger batch gives better return estimates.
 **Orientation error:** Oscillates 0.94-1.19, best mean batch 0.946. No sustained improvement beyond the ~1.0 floor.
 
@@ -157,10 +150,7 @@ Shadow/Allegro hands mount the thumb on the *opposite side* of the palm, giving
 
 1. **Break the stagnation floor (~1.0 rad)** — options:
    - Much stronger smoothness: `action_rate_l2` weight -0.05 → -0.15 to finally kill jerking. Risk: may suppress rotation bandwidth.
-   - Fresh training run with all current hyperparams — policy may have converged to a poor attractor that a new random init escapes.
 
 2. **Full 3D reorientation** requires hardware change: reposition thumb to opposite side of palm (true opposition), or mount hand palm-sideways so z-axis becomes a tilt direction. More compute will not overcome the geometric limitation.
 
-3. **Fully instanceable USD** — create proper `instanceable_meshes.usd` via Isaac Sim GUI for cleaner multi-env physics sharing.
-
-4. **Scale compute** — 4096+ envs (multi-GPU) is the next throughput step if hardware is available.
+3. **Scale compute** — 4096+ envs (multi-GPU) is the next throughput step if hardware is available.
diff --git a/...to/HumanoidRL/HumanoidRLPackage/HumanoidRLSetup/tasks/inhand/config/wato_hand/__init__.py b/...to/HumanoidRL/HumanoidRLPackage/HumanoidRLSetup/tasks/inhand/config/wato_hand/__init__.py
@@ -7,7 +7,7 @@
     entry_point="isaaclab.envs:ManagerBasedRLEnv",
     disable_env_checker=True,
     kwargs={
-        "env_cfg_entry_point": f"{__name__}.wato_env_cfg:WatoHandCubeEnvCfg",
+        "env_cfg_entry_point": f"{__name__}.wato_hand_env_cfg:WatoHandCubeEnvCfg",
         "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:WatoHandCubePPORunnerCfg",
     },
 )
@@ -17,7 +17,7 @@
     entry_point="isaaclab.envs:ManagerBasedRLEnv",
     disable_env_checker=True,
     kwargs={
-        "env_cfg_entry_point": f"{__name__}.wato_env_cfg:WatoHandCubeEnvCfg_PLAY",
+        "env_cfg_entry_point": f"{__name__}.wato_hand_env_cfg:WatoHandCubeEnvCfg_PLAY",
         "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:WatoHandCubePPORunnerCfg",
     },
 )
@@ -27,7 +27,7 @@
     entry_point="isaaclab.envs:ManagerBasedRLEnv",
     disable_env_checker=True,
     kwargs={
-        "env_cfg_entry_point": f"{__name__}.wato_env_cfg:WatoHandCubeNoVelObsEnvCfg",
+        "env_cfg_entry_point": f"{__name__}.wato_hand_env_cfg:WatoHandCubeNoVelObsEnvCfg",
         "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:WatoHandCubeNoVelObsPPORunnerCfg",
     },
 )
@@ -37,7 +37,7 @@
     entry_point="isaaclab.envs:ManagerBasedRLEnv",
     disable_env_checker=True,
     kwargs={
-        "env_cfg_entry_point": f"{__name__}.wato_env_cfg:WatoHandCubeNoVelObsEnvCfg_PLAY",
+        "env_cfg_entry_point": f"{__name__}.wato_hand_env_cfg:WatoHandCubeNoVelObsEnvCfg_PLAY",
         "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:WatoHandCubeNoVelObsPPORunnerCfg",
     },
 )
diff --git a/...s/inhand/config/wato_hand/wato_env_cfg.py → ...and/config/wato_hand/wato_hand_env_cfg.py b/...s/inhand/config/wato_hand/wato_env_cfg.py → ...and/config/wato_hand/wato_hand_env_cfg.py
@@ -3,11 +3,7 @@
 
 import HumanoidRLPackage.HumanoidRLSetup.tasks.inhand.inhand_env_cfg as inhand_env_cfg
 import HumanoidRLPackage.HumanoidRLSetup.tasks.inhand.mdp as inhand_mdp
-from HumanoidRLPackage.HumanoidRLSetup.modelCfg.wato_hand import (
-    INHAND_WATO_HAND_CFG,
-    INHAND_CUBE_POS,
-    INHAND_SPREAD_RAD,
-)
+from HumanoidRLPackage.HumanoidRLSetup.modelCfg.wato_hand import INHAND_SPREAD_RAD
 
 
 @configclass
@@ -17,22 +13,15 @@ class WatoHandCubeEnvCfg(inhand_env_cfg.InHandObjectEnvCfg):
     def __post_init__(self):
         super().__post_init__()
 
-        # Share physics across envs — lower RAM per env, scales to more parallel rollouts.
         self.scene.replicate_physics = True
         self.scene.num_envs = 2048
 
-        self.scene.robot = INHAND_WATO_HAND_CFG.replace(prim_path="{ENV_REGEX_NS}/Robot")
         # Push expanded MCP_A limits (±27 deg) into PhysX, overriding the ±8.6 deg baked in the USD.
         self.events.expand_abduction_limits = EventTerm(
             func=inhand_mdp.apply_wato_hand_joint_limits,
             mode="startup",
         )
 
-        # change the cube on plam's property
-        self.scene.object.spawn.scale = (0.8, 0.8, 0.8)
-        self.scene.object.init_state.pos = INHAND_CUBE_POS
-        self.scene.object.init_state.rot = (1.0, 0.0, 0.0, 0.0)
-
         _grasp_scale = [0.2, 0.2]
         _splay_scale = [1.0, 1.0]
         self.events.reset_robot_joints.params["position_range"] = {
@@ -62,7 +51,7 @@ def __post_init__(self):
         # Full 3D random orientation is too hard to explore from scratch; z-axis
         # rotation (spinning in the palm plane) is the most natural motion for this hand.
         # Once orientation_error shows a downward trend, expand back to ["x", "y"].
-        self.commands.object_pose.rotation_axes = ["z"]
+        self.commands.cube_pose.rotation_axes = ["z"]
 
         # Small bonus for MCP_A velocity + spread deflection.
         self.rewards.spread_activity = RewTerm(
@@ -81,12 +70,13 @@ def __post_init__(self):
         super().__post_init__()
         self.scene.num_envs = 16
         self.observations.policy.enable_corruption = False
-        # Keep time_out so play episodes reset instead of clamping forever.
+        # Keep time_out so play episodes reset faster to see the next set
         self.episode_length_s = 10.0
         # Nudge goal marker aside so it does not cover the physical cube.
-        self.commands.object_pose.marker_pos_offset = (-0.10, 0.0, 0.12)
+        self.commands.cube_pose.marker_pos_offset = (-0.10, 0.0, 0.12)
 
 
+# Version with no velocity observation as input because in a real deployment, measuring velocity is noisy / unavailable
 @configclass
 class WatoHandCubeNoVelObsEnvCfg(WatoHandCubeEnvCfg):
     def __post_init__(self):
@@ -100,7 +90,5 @@ def __post_init__(self):
         super().__post_init__()
         self.scene.num_envs = 16
         self.observations.policy.enable_corruption = False
-        # Keep time_out so play episodes reset instead of clamping forever.
         self.episode_length_s = 10.0
-        # Nudge goal marker aside so it does not cover the physical cube.
-        self.commands.object_pose.marker_pos_offset = (-0.10, 0.0, 0.12)
+        self.commands.cube_pose.marker_pos_offset = (-0.10, 0.0, 0.12)
diff --git a/...manoid_Wato/HumanoidRL/HumanoidRLPackage/HumanoidRLSetup/tasks/inhand/inhand.md b/...manoid_Wato/HumanoidRL/HumanoidRLPackage/HumanoidRLSetup/tasks/inhand/inhand.md
@@ -0,0 +1,88 @@
+# In-Hand: Wato hand cube reorientation
+
+In-hand dexterous manipulation for the 20-DOF Wato hand (`hand_urdf.usd`) in Isaac Lab. The policy reorients a DexCube held in the palm toward a commanded goal orientation. **Play** shows a ghost goal cube marker (enabled via `WatoHandCubeEnvCfg_PLAY`).
+
+Task setup and MDP code are adapted from [Isaac Lab](https://github.com/isaac-sim/IsaacLab) in-hand manipulation examples. Experimentation history: `TRAINING_LOG.md`.
+
+**Environments**
+
+| Task ID | Scene | Mode |
+| :--- | :--- | :--- |
+| `Isaac-Repose-Cube-WatoHand-v0` | Palm + DexCube | Train |
+| `Isaac-Repose-Cube-WatoHand-Play-v0` | Palm + DexCube | Play |
+| `Isaac-Repose-Cube-WatoHand-NoVelObs-v0` | Palm + DexCube | Train (no velocity obs) |
+| `Isaac-Repose-Cube-WatoHand-NoVelObs-Play-v0` | Palm + DexCube | Play (no velocity obs) |
+
+## Train & play
+
+Run from `HumanoidRL/` (the directory that contains `HumanoidRLPackage/`):
+
+```bash
+# Train (default 2048 envs; try 1024 or 512 if OOM)
+PYTHONPATH=$(pwd) /home/hy/IsaacLab/isaaclab.sh -p HumanoidRLPackage/rsl_rl_scripts/train.py \
+  --task=Isaac-Repose-Cube-WatoHand-v0 --headless
+
+# Play — omit --headless to see goal-orientation marker
+PYTHONPATH=$(pwd) /home/hy/IsaacLab/isaaclab.sh -p HumanoidRLPackage/rsl_rl_scripts/play.py \
+  --task=Isaac-Repose-Cube-WatoHand-Play-v0 --num_envs=1
+
+# Play — specific checkpoint
+PYTHONPATH=$(pwd) /home/hy/IsaacLab/isaaclab.sh -p HumanoidRLPackage/rsl_rl_scripts/play.py \
+  --task=Isaac-Repose-Cube-WatoHand-Play-v0 --num_envs=1 \
+  --checkpoint logs/rsl_rl/wato_hand_cube/<run>/model_<iter>.pt
+```
+
+Checkpoints: `logs/rsl_rl/wato_hand_cube/`. PPO defaults: `max_iterations=5000`, `experiment_name=wato_hand_cube` (`config/wato_hand/agents/rsl_rl_ppo_cfg.py`).
+
+## Scene & command
+
+| Item | Value |
+| :--- | :--- |
+| Robot | `INHAND_WATO_HAND_CFG` (`modelCfg/wato_hand.py`) — palm-up, 20 DOF |
+| Cube | Isaac Nucleus DexCube (instanceable), scale `(0.8, 0.8, 0.8)`, spawn `INHAND_CUBE_POS` |
+| Goal command | `InHandReOrientationCommand` — resampled **on success** (not on a timer) |
+| Goal position | Default cube spawn + `init_pos_offset = (0, 0, -0.04)` m (hold-in-palm target) |
+| Goal orientation | Sampled on allowed `rotation_axes`; success when error **< 0.4 rad** |
+| Curriculum (current) | `rotation_axes = ["z"]` — palm-normal spin only (`wato_hand_env_cfg.py`) |
+| Startup event | `expand_abduction_limits` — overrides USD-baked MCP_A limits to ±27° |
+
+`replicate_physics=True` on Wato hand (instanceable-friendly).
+
+## Reward
+
+All base reward terms are defined in `RewardsCfg` (`inhand_env_cfg.py`). Implementations live in `mdp/rewards.py` (task-specific) and Isaac Lab `mdp` (generic penalties). `WatoHandCubeEnvCfg` adds one extra term — `spread_activity` — in `config/wato_hand/wato_hand_env_cfg.py`.
+
+| Category | Reward Function | Weight | Description |
+| :--- | :--- | :--- | :--- |
+| **Task** | Position tracking (`track_pos_l2`) | −3.0 | L2 distance to goal position (encourages holding cube in palm). |
+| | Orientation tracking (`track_orientation_inv_l2`) | 10.0 | $1 / (\text{orientation\_error} + 0.1)$ — dense rotation signal. |
+| | Success bonus (`success_bonus`) | 50.0 | Binary reward when orientation error **< 0.4 rad**. |
+| | Object held bonus (`object_held_bonus`) | 0.5 | +1/step when cube within **0.10 m** of goal position. |
+| | Angular velocity toward goal (`object_ang_vel_toward_goal`) | 0.2 | Positive component of cube spin aligned with goal (clamped ≥ 0). |
+| | Spread activity (`mcp_a_spread_activity`) | 0.03 | Encourages MCP_A abduction velocity + deflection (Wato only). |
+| **Penalties** | Object away (`object_away_penalty`) | −5.0 | Terminal penalty when `object_out_of_reach` fires. |
+| | Action rate L2 | −0.05 | Penalises jerky finger commands. |
+| | Joint velocity L2 | $-1.0 \times 10^{-4}$ | Penalises high joint speeds. |
+| | Action L2 | $-1.0 \times 10^{-4}$ | Penalises large action magnitudes. |
+
+## Terminations
+
+| Termination | Condition |
+| :--- | :--- |
+| Time out | Episode length exceeds **20 s** (10 s in Play). |
+| Max consecutive success | Goal reached **50 times** in a row within one episode. |
+| Object out of reach | Cube drifts **> 0.3 m** from robot root. |
+| Orientation stagnation | Orientation error stays **> 0.5 rad** for **150** consecutive steps. |
+
+## Sim settings
+
+| Setting | Value |
+| :--- | :--- |
+| `decimation` | 4 |
+| `sim.dt` | 1/120 s (~120 Hz) |
+| `episode_length_s` | 20.0 |
+| Default `num_envs` | 2048 |
+| Action smoothing | EMA joint-position targets, `alpha = 0.85` |
+| PPO `num_steps_per_env` | 48 |
+| PPO `gamma` | 0.998 |
+| PPO `entropy_coef` | 0.0001 |