# dlexperiment-skill

**Repository Path**: BlueRocket/dlexperiment-skill

## Basic Information

- **Project Name**: dlexperiment-skill
- **Description**: dlexperiment-skill
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-05-29
- **Last Updated**: 2026-06-01

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# dlexperiment-skill

`dlexperiment-skill` is a reusable Claude skill for managing deep learning experiments as a disciplined **Experiment OS**.

It helps Claude do more than run commands:

```text
define experiment
→ create a standardized directory
→ record goal and hypothesis
→ snapshot code/environment
→ record every change with rationale
→ run or prepare the experiment
→ save logs/metrics/figures
→ analyze results
→ update an experiment index
→ recommend the next iteration
```

The skill is designed for deep learning research workflows such as anomaly detection, transfer learning, ablation studies, architecture/config changes, threshold tuning, and evaluation campaigns.

## What this skill provides

- Standard experiment directory protocol
- `daily` and `full` operating modes
- Structured experiment logs
- Automatic experiment index maintenance guidance
- Result analysis templates
- Next-experiment recommendation workflow
- Optional project-local slash command templates
- Evaluation prompts for testing the skill behavior

## Repository layout

```text
dlexperiment-skill/
├── SKILL.md                     # Required Claude skill file
├── README.md                    # Project documentation
├── INSTALL.md                   # Installation and smoke-test guide
├── LICENSE                      # MIT license
├── .gitignore                   # Local artifacts and experiment outputs to ignore
├── references/
│   └── experiment-os.md          # Additional reference notes
├── templates/
│   └── commands/                 # Optional Claude Code slash commands
│       ├── README.md
│       ├── exp-start.md
│       ├── exp-log.md
│       ├── exp-analyze.md
│       ├── exp-next.md
│       ├── exp-wrap.md
│       └── exp-auto.md
└── evals/
    └── evals.json                # Test prompts and assertions
```

## Install as a Claude skill

Copy or clone this directory into your Claude skills directory:

```text
~/.claude/skills/dlexperiment-skill/
```

On Windows, the user-level skills directory is typically:

```text
C:\Users\<you>\.claude\skills\dlexperiment-skill\
```

After installation, Claude can trigger the skill from natural language prompts such as:

```text
Start a new anomaly detection experiment for VisA transfer.
Record this ablation change in the current experiment.
Analyze the result of exp_20260529_001_visa_transfer.
Run an automatic 5-round threshold tuning experiment.
```

## Optional Claude Code slash commands

This repository includes command templates under `templates/commands/`.

To use them inside a project, copy the files into that project's `.claude/commands/` directory:

```text
<your-project>/
└── .claude/
    └── commands/
        ├── exp-start.md
        ├── exp-log.md
        ├── exp-analyze.md
        ├── exp-next.md
        ├── exp-wrap.md
        └── exp-auto.md
```

Then call them in Claude Code:

```text
/exp-start daily 目标：验证 response-gap 是否提升 VisA image AUROC
/exp-log daily 把 threshold 从 0.45 调到 0.55
/exp-analyze full 分析 exp_20260529_001_visa_transfer
/exp-auto daily 自动调 selector threshold，最多 5 轮
/exp-wrap full 收尾这个论文级实验
```

## Operating modes

The skill supports two explicit modes.

### `daily`

Default mode. Use it for fast research iteration.

```text
/exp-start daily ...
/exp-log daily ...
/exp-analyze daily ...
/exp-next daily ...
/exp-wrap daily ...
/exp-auto daily ...
```

Characteristics:

- lightweight records
- fast startup
- less documentation overhead
- good for frequent ablations and exploratory runs

### `full`

Use it for important experiments that may need reproduction, paper tables, or collaborator review.

```text
/exp-start full ...
/exp-auto full ...
/exp-wrap full ...
```

Characteristics:

- full experiment archive
- code/environment snapshots
- git diff patches
- logs/metrics/figures structure
- stronger conclusion and reproducibility records

If no mode is specified, the skill assumes `daily`.

## Standard experiment layout

In `full` mode, experiments follow this structure:

```text
experiments/
├── index.md
└── exp_YYYYMMDD_NNN_slug/
    ├── README.md
    ├── goal.md
    ├── hypothesis.md
    ├── todo.md
    ├── progress.md
    ├── result.md
    ├── conclusion.md
    ├── configs/
    ├── patches/
    ├── checkpoints/
    ├── outputs/
    ├── logs/
    │   └── claude.log
    ├── metrics/
    ├── figures/
    ├── scripts/
    │   └── run.sh
    └── snapshots/
        ├── git_commit.txt
        ├── git_diff.patch
        ├── pip_freeze.txt
        ├── python_version.txt
        └── gpu_info.txt
```

Experiment names follow:

```text
exp_YYYYMMDD_NNN_slug
```

Example:

```text
exp_20260529_001_visa_transfer
```

## Main workflows

### `/exp-start`

Create a new experiment directory, initialize experiment documents, record the goal/hypothesis, prepare `scripts/run.sh`, and update `experiments/index.md`.

### `/exp-log`

Record a change, run event, failure, metric observation, or decision.

Standard change record:

```md
## Modification record - YYYY-MM-DD HH:MM

Purpose:

Changed files:

Core change:

Expected impact:

Risk:

Related command/log/metric:

Next step:
```

### `/exp-analyze`

Extract metrics from logs/metrics files, compare against baseline, update `result.md`, update `conclusion.md`, and refresh `experiments/index.md`.

### `/exp-next`

Recommend the next controlled experiment. Prefer one primary variable per experiment.

### `/exp-wrap`

Finalize an experiment as `done`, `abandoned`, or `superseded`; update the index and final conclusion.

### `/exp-auto` (Fully Autonomous Optimization)

Run an autonomous, end-to-end experiment loop. Triggered explicitly or automatically via natural language goals (e.g., "optimize this algorithm", "run grid search").

```text
plan search space and target
→ prepare local background job OR cluster sbatch script
→ persist state in progress.md (to survive disconnects)
→ submit batch of jobs
→ monitor via polling/timers
→ auto-recover from crashes (e.g., OOM, syntax errors)
→ extract metrics and summarize
→ automatically spawn the next iteration
→ stop when target is met or budget exhausted
```

Use `/exp-auto` when you want Claude to act as an autonomous researcher, running optimizations and recovering from errors without waiting for human permission.

## Example prompts

```text
/exp-start daily 目标：验证 response-gap 是否比 defect-only baseline 更好；数据集 VisA；指标 image AUROC、pixel AUROC、AUPRO
```

```text
/exp-log daily 目的：验证 selector_fix 能否减少 normal false positive；修改 models/selector.py 和 configs/train.yaml；threshold 从 0.45 到 0.55；风险是 defect recall 下降
```

```text
/exp-analyze full 分析 exp_20260529_001_visa_transfer，baseline image_auc=0.884，当前 image_auc=0.912
```

```text
/exp-auto daily 自动测试 normal branch threshold: 0.45, 0.50, 0.55, 0.60；停止条件：image AUROC 不再提升或最多 4 轮
```

**Using Claude Code's native `/goal` loop:**
```text
/goal "使用 dlexperiment-skill 全自动网格搜索 threshold 从 0.4 到 0.6。遇到报错就自动修复并重试，必须跑到所有参数测试完毕并在 result.md 中生成对比表格后才算完成目标。"
```

## Testing the skill

Evaluation prompts live in:

```text
evals/evals.json
```

They cover:

- starting a standardized experiment
- logging a model/config modification
- analyzing metrics and recommending the next step

If you have the `skill-creator` skill installed, you can use its evaluation workflow to compare behavior with and without this skill.

## Design philosophy

This skill intentionally favors protocol over free-form experimentation.

Deep learning research becomes hard to trust when experiments are scattered across terminal history, unnamed output folders, and undocumented config changes. This skill keeps the experiment state structured so Claude and the researcher can reliably answer:

- What was tested?
- Why was it changed?
- How was it run?
- What changed in the metrics?
- Was the hypothesis supported?
- What should be tried next?

## License

MIT License. See [LICENSE](LICENSE).