# CARD

**Repository Path**: ByteDance/CARD

## Basic Information

- **Project Name**: CARD
- **Description**: The official repository for "CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation" published in ICML 2026.
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-05-30
- **Last Updated**: 2026-06-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation

![workflow](./media/workflow.png)

## 🧬 Introduction

This is the official repository for our paper "CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation" published at ICML 2026. By leveraging a novel radix-based decomposition technique, we bring autoregressive modeling to exact free energy estimation and, for the first time, achieve generalization across both systems and dimensionalities for this task. CARD has demonstrated promising performance on small molecules and short peptides. In future work, we plan to further optimize the framework at the engineering level to extend its applicability to larger molecular systems, such as protein-ligand complexes.

## 🚀 Setup

### Dependencies

We provide conda dependencies for `cuda>=12.0`. You can directly create and activate a new conda environment by:
```
mamba env create -f env.yaml
conda activate card
```

### Datasets

Due to compliance restrictions, we do not directly release the raw MD data used for training and evaluation. Nevertheless, under the `datas` directory, we provide the dataset partitions for the three tasks in the paper, along with the corresponding data processing scripts and MD simulation code, so that users can reproduce the datasets on their own.

**Solvation Free Energy**

The dataset splits `{train,val,test}.csv` and preprocessing scripts are provided under `datas/zinc`. You may first run the following script to generate topology files from the corresponding SMILES using GAFF:

```bash
python3 datas/zinc/prepare.py
```

The generated topology file of `{zinc_id}` are saved to `dataset/md/{zinc_id}/` by default.

Next, you can run `run_md.py` to perform MD simulations under the force fields corresponding to vacuum, toluene, and water:

```bash
python3 datas/zinc/run_md.py
```

Finally, run `pack.py` to package the generated DCD trajectories into `.pt` files suitable for model input:

```bash
python3 datas/zinc/pack.py
```

**Endstate Correction**

The dataset splits and preprocessing scripts are provided under `datas/hipen`. Similarly, please run `datas/hipen/run_md.py` and `datas/hipen/pack.py` sequentially to prepare the dataset.

**Tautomer Free Energy**

Due to the GPL 2.0 license of the original data, we are unable to directly distribute the dataset splits. Please obtain the data from [here](https://github.com/xiaolinpan/sPhysNet-Taut/tree/main) and follow the preprocessing procedures described in the paper.


### Multistate Free Energy Simulation

The reference values of the relative free energies are computed using Multistate Free Energy Simulation (MFES). Following the protocol provided in [endstate_correction](https://github.com/wiederm/endstate_correction/tree/main/endstate_correction), we provide the corresponding MFES scripts for the solvation free energy task and the endstate correction task, namely `fep/run_fep_sfe.py` and `fep/run_fep_hipen.py`, respectively. In particular, the reference values used in the experiments of this paper have already been provided in `datas/hipen/test.csv`, `datas/zinc/sfe_vacuum_toluene.csv`, and `datas/zinc/sfe_vacuum_water.csv`, and therefore do not need to be recomputed.

### Model Weights

For reproducibility, we provide the model checkpoints for the main experiments reported in the paper, which are stored in the `ckpt/` directory.

## 👀 Usage

### Training

‼️ Before running training scripts, make sure all the datasets have been well prepared.

We provide the `run.sh` script for multi-GPU training. The general usage is as follows:

```bash
GPU=0,1,2,3 bash train.sh pretrain/train.py [arguments]
```

The arguments should be specified according to the training stage. Specifically:

- Training stage I

```bash
GPU=0,1,2,3 bash train.sh pretrain/train.py --data [zinc/mm/ani2x] 
                                            --solvent [vacuum/toluene/water] 
                                            --steps 10000 --batch_size 200 
                                            --samples 1000
                                            --radix 4 --depth 3 
                                            --box_size 30 
                                            --dim 512 --nhead 8 --K 16 
                                            --layers1 8 --layers2 8 
                                            --lr 1e-3 
```

- Training stage II

```bash
GPU=0,1,2,3 bash train.sh pretrain/train.py --data [zinc/mm/ani2x] 
                                            --solvent [vacuum/toluene/water]
                                            --steps 10000 --batch_size 200 
                                            --samples 1000 
                                            --radix 4 --depth 3 
                                            --box_size 30 
                                            --dim 512 --nhead 8 --K 16 
                                            --layers1 8 --layers2 8 
                                            --lr 2e-4 
                                            --dU 0.01
```

`--data` accepts three options: `zinc`, `mm`, and `ani2x`. Here, `zinc` denotes the ZINC-derived dataset for the solvation free energy task under the GAFF force field, whereas `mm` and `ani2x` denote the datasets for the endstate correction task under the OpenFF and ANI-2x force fields, respectively. The `--solvent` argument only applies when `--data zinc`, where it selects MD simulation data for a specific solvent.

You can distinguish different training runs by setting the environment variable `TASK_NAME` . The logs and model checkpoints will be saved under `trials/{TASK_NAME}` . If `TASK_NAME` is not specified, the script will automatically generate a random name for this run.

### Inference

We provide scripts for free energy estimation of the three tasks in the paper. Specifically:

- Solvation free energy

```bash
GPU=0 bash run.sh pretrain/test_sfe.py --trial xxx 
                                       --solvent vacuum 
                                       --solvent_tgt toluene 
                                       --ckpt_src path/to/zinc_vacuum/checkpoint 
                                       --ckpt_tgt path/to/zinc_toluene/checkpoint 
                                       --samples 2000
```

The arguments `--solvent` and `--solvent_tgt` specify the source and target solvents, respectively. The argument `--trial` specifies the name of the inference run, and the results will be saved under `trials/{trial}`.

- Endstate correction

```bash
GPU=0 bash run.sh pretrain/test_hipen.py --trial xxx 
                                         --ckpt_src path/to/mm_vacuum/checkpoint 
                                         --ckpt_tgt path/to/ani2x_vacuum/checkpoint 
                                         --samples 2000
```

- Tautomer free energy

```bash
GPU=0 bash run.sh pretrain/test_tautomer.py --trial xxx 
                                            --ckpt_src path/to/zinc_vacuum/checkpoint 
                                            --ckpt_tgt path/to/zinc_water/checkpoint 
                                            --ckpt_nnp path/to/ani2x_vacuum/checkpoint
                                            --samples 2000
```

### Evaluation

Evaluation scripts for the three tasks in the paper are provided in the `scripts` directory. For example, to evaluate the solvation free energy task, please run:

```bash
python3 scripts/plot_sfe.py --trial xxx --solvent vacuum --solvent_tgt toluene
```

where `--trial` should be the trial name for the corresponding inference run. The evaluation results and plots will be saved under `analysis/sfe/{trial}`.

## 💡 Contact

Please feel free to contact us by creating issues in the github repo or sending emails to yu-zy24@mails.tsinghua.edu.cn or heyi@bytedance.com for any concerns about our project. We thank you for your interest in our work and your contribution to making it better!

## Reference

```
@inproceedings{
  yu2026card,
  title={CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation},
  author={Yu, Ziyang and He, Yi and Huang, Wenbing and Yan, Wen and Liu, Yang},
  booktitle={The Forty-Third International Conference on Machine Learning},
  year={2026}
}
```

## License

The source code in this project is licensed under the Apache License 2.0, whereas the model weights are licensed under CC BY-NC 4.0.