WorldString is a neural-based interactive digital twin for skinning, articulable, and soft objects. It takes keypoint-based state as input and produces 3D point clouds as output, and is capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams.
Links: Paper (arXiv) · Project Page · Data Generation · Code (this repo)
| Status | Milestone |
|---|---|
| ✅ | Project page |
| ✅ | Data generation code (sim and real world) |
| ✅ | Checkpoints and visualize demo (XHand, Unitree Go2, Unitree H1) |
| ⬜ | Open-source training data |
| ⬜ | Training code and model |
Local checkpoint visualization (python demo_worldstring.py): drag joint sliders and compare simulator ground truth (left) vs. neural prediction (right) for XHand, Go2, and H1 in one page.
Auto-looping preview of the unified Gradio demo. Full recording (MP4)
The instructions below set up a Conda environment for checkpoint-based interactive visualization. No training data download is required for the demo.
| Robot | Simulator | Checkpoint | Demo tab |
|---|---|---|---|
| XHand Right | PyBullet | ckpts/xhand_right/xhand_best.pth |
XHand Right |
| Unitree Go2 | MuJoCo | ckpts/go2/go2_best.pth |
Unitree Go2 |
| Unitree H1 | MuJoCo | ckpts/h1/h1_best.pth |
Unitree H1 |
- OS: Linux recommended (Ubuntu 20.04+). macOS may work for CPU-only demo; Windows is not tested.
- GPU (optional): NVIDIA GPU with CUDA for faster inference. CPU-only works but first render is slow.
- Conda: Miniconda or Anaconda.
- Git: to clone this repository.
git clone git@github.com:MaureenZOU/worldstring.git
cd worldstringWe recommend Python 3.11:
conda create -n worldstring python=3.11 -y
conda activate worldstringWith NVIDIA GPU (CUDA 12.x, recommended):
pip install torch --index-url https://download.pytorch.org/whl/cu130Verify:
python -c "import torch; print(torch.__version__, 'cuda:', torch.cuda.is_available())"pip install \
gradio \
numpy \
scipy \
pyyaml \
plotly \
open3d \
pybullet \
mujoco \
trimeshVersions tested locally:
| Package | Version |
|---|---|
| Python | 3.11 |
| torch | 2.12 |
| gradio | 6.16 |
| mujoco | 3.9 |
| pybullet | 3.2 |
| open3d | 0.19 |
| plotly | 6.8 |
| trimesh | 4.12 |
Note:
flash_attnis optional. If it is not installed, the model automatically falls back to PyTorch scaled dot-product attention.
Ensure the following layout exists under ckpts/:
ckpts/
├── xhand_right/
│ ├── xhand_best.pth
│ ├── config.yaml
│ └── demo/current_pose/ # created at runtime; pose_current.da written here
├── go2/
│ ├── go2_best.pth
│ ├── config.yaml
│ └── demo/current_pose/
│ ├── go2_init_state.da # reference init frame for keypoint binding
│ ├── init_world_min_max.json # normalization bounds for keypoints / inference
│ └── pose_joint_state_init.json
└── h1/
├── h1_best.pth
├── config.yaml
└── demo/current_pose/
├── h1_init_state.da
├── init_world_min_max.json
└── pose_joint_state_init.json
Robot meshes / URDFs are already under assets/:
assets/
├── xhand_right/urdf/xhand_right.urdf
├── go2/go2.xml
└── unitree_h1/h1.xml
If checkpoint files are distributed separately (e.g. Google Drive / Hugging Face), download them into the paths above before launching the demo.
From the repository root:
conda activate worldstring
python demo_worldstring.pyOpen in your browser:
http://127.0.0.1:6040
Use the tabs XHand Right, Unitree Go2, and Unitree H1 to switch robots. Each tab has joint sliders on the left and two point-cloud views on the right:
- Left plot: ground truth from the simulator mesh (PyBullet or MuJoCo)
- Right plot: neural prediction from the loaded checkpoint
Click Submit after moving sliders, or Reset to Initial Pose to restore the default configuration.
First load: the page runs inference for all three robots once at startup and may take 1–3 minutes depending on GPU/CPU. Subsequent updates per tab are faster.
| Script | Robot | URL |
|---|---|---|
python demo_xhand.py |
XHand | http://127.0.0.1:6037 |
python demo_go2_mujoco.py |
Go2 | http://127.0.0.1:6038 |
python demo_h1_mujoco.py |
H1 | http://127.0.0.1:6039 |
| Robot | GT & neural panels |
|---|---|
| XHand | Y-up normalized training frame |
| Go2 / H1 | MuJoCo Z-up world coordinates (inference output is denormalized with init_world_min_max.json) |
Model inputs are always normalized keypoints written to pose_current.da at runtime; only the displayed point clouds are transformed for consistent viewing.
worldstring/
├── assets/ # Robot URDF / MJCF and meshes
├── ckpts/ # Pretrained weights, configs, demo init data
├── config/ # YAML config loader
├── dataset/ # Inference-time keypoint / voxel dataset
├── modeling/ # AWR model (PolytopeModel)
├── robot_backends/ # PyBullet / MuJoCo robots, keypoint trackers, demo sessions
├── demo_worldstring.py # Unified Gradio demo (recommended)
├── demo_xhand.py
├── demo_go2_mujoco.py
└── demo_h1_mujoco.py
Training data generation (sim + real world) lives in the separate repo WorldString_data_gen.
If you use WorldString in your research, please cite:
@misc{xu2026worldstringactionableworldrepresentation,
title={WorldString: Actionable World Representation},
author={Kunqi Xu and Jitao Li and Jianglong Ye and Tianshu Tang and Isabella Liu and Sifei Liu and Xueyan Zou},
year={2026},
eprint={2605.18743},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.18743},
}