# skill **Repository Path**: mstitop/skill ## Basic Information - **Project Name**: skill - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-17 - **Last Updated**: 2026-04-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 🦀 PinchBench **Real-world benchmarks for AI coding agents** [![Leaderboard](https://img.shields.io/badge/leaderboard-pinchbench.com-blue)](https://pinchbench.com) [![License](https://img.shields.io/badge/license-MIT-green)](LICENSE) ![Tasks](https://img.shields.io/badge/tasks-53-orange) > **Note:** This repository contains the benchmark skill/tasks. It is NOT the source of official leaderboard results. To add models to the official results, modify [pinchbench/scripts/default-models.yml](https://github.com/pinchbench/scripts/blob/main/default-models.yml). PinchBench measures how well LLM models perform as the brain of an [OpenClaw](https://github.com/openclaw/openclaw) agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files. Results are collected on a public leaderboard at **[pinchbench.com](https://pinchbench.com)**. ![PinchBench](pinchbench.png) ## Why PinchBench? Most LLM benchmarks test isolated capabilities. PinchBench tests what actually matters for coding agents: - **Tool usage** — Can the model call the right tools with the right parameters? - **Multi-step reasoning** — Can it chain together actions to complete complex tasks? - **Real-world messiness** — Can it handle ambiguous instructions and incomplete information? - **Practical outcomes** — Did it actually create the file, send the email, or schedule the meeting? ## Quick Start ```bash # Clone the skill git clone https://github.com/pinchbench/skill.git cd skill # Run benchmarks with your model of choice ./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4 # Or run specific tasks ./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_calendar,task_stock ``` > **Note:** Model IDs must include their provider prefix (e.g. `openrouter/`, `anthropic/`). [OpenRouter](https://openrouter.ai) is the default provider used for routing. **Requirements:** - Python 3.10+ - [uv](https://docs.astral.sh/uv/) package manager - A running OpenClaw instance ## What Gets Tested PinchBench includes 53 tasks across real-world categories: | Category | Tasks | What's tested | | ---------------- | --------------------------------------- | ---------------------------------------- | | **Productivity** | Calendar, daily summaries | Event creation, time parsing, scheduling | | **Research** | Stock prices, conferences, markets | Web search, data extraction, synthesis | | **Writing** | Blog posts, emails, humanization | Content generation, tone, formatting | | **Coding** | Weather scripts, file structures | Code generation, file operations | | **Analysis** | Spreadsheets, PDFs, documents | Data processing, summarization | | **Email** | Triage, search | Inbox management, filtering | | **Memory** | Context retrieval, knowledge management | Long-term memory, recall | | **Skills** | ClawHub, skill discovery | OpenClaw ecosystem integration | Each task is graded automatically, by an LLM judge, or both — ensuring both objective and nuanced evaluation. ## Submitting Results To get your results on the leaderboard: ```bash # Register for an API token (one-time) ./scripts/run.sh --register # Run benchmark — results auto-upload with your token ./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4 ``` Skip uploading with `--no-upload` if you just want local results. ### Official Results To submit an official run (marked on the leaderboard): ```bash # Using environment variable export PINCHBENCH_OFFICIAL_KEY=your_official_key ./scripts/run.sh --model anthropic/claude-sonnet-4 # Using command line flag ./scripts/run.sh --model anthropic/claude-sonnet-4 --official-key your_official_key ``` ## Command Reference | Flag | Description | | ------------------------ | ----------------------------------------------------------------------------- | | `--model MODEL` | Model to test (e.g., `openrouter/anthropic/claude-sonnet-4`) | | `--judge MODEL` | Judge model for LLM grading; uses direct API when set (see below) | | `--suite SUITE` | `all`, `automated-only`, or comma-separated task IDs | | `--runs N` | Number of runs per task for averaging | | `--timeout-multiplier N` | Scale timeouts for slower models | | `--output-dir DIR` | Where to save results (default: `results/`) | | `--no-upload` | Skip uploading to leaderboard | | `--register` | Request an API token for submissions | | `--upload FILE` | Upload a previous results JSON | | `--official-key KEY` | Mark submission as official (or use `PINCHBENCH_OFFICIAL_KEY` env var) | ### Judge By default (no `--judge` flag), the LLM judge runs as an OpenClaw agent session. When `--judge` is specified, it calls the model API directly instead, bypassing OpenClaw personality injection. ```bash # Default: OpenClaw agent session (no --judge needed) ./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4 # Direct API via OpenRouter ./scripts/run.sh --model openai/gpt-4o --judge openrouter/anthropic/claude-sonnet-4-5 # Direct API via Anthropic ./scripts/run.sh --model openai/gpt-4o --judge anthropic/claude-sonnet-4-5-20250514 # Direct API via OpenAI ./scripts/run.sh --model openai/gpt-4o --judge openai/gpt-4o # Headless Claude CLI ./scripts/run.sh --model openai/gpt-4o --judge claude ``` Required env vars: `OPENROUTER_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY` depending on the judge model prefix. ## Contributing Tasks We welcome new tasks! Check out [`tasks/TASK_TEMPLATE.md`](tasks/TASK_TEMPLATE.md) for the format. Good tasks are: - **Real-world** — Something an actual user would ask an agent to do - **Measurable** — Clear success criteria that can be graded - **Reproducible** — Same task should produce consistent grading - **Challenging** — Tests agent capabilities, not just LLM knowledge ### Transcript Archive Session transcripts are automatically saved to `results/{run_id}_transcripts/` alongside the results JSON. Each task's full agent conversation is preserved as a JSONL file (e.g. `task_calendar.jsonl`) for post-run analysis. ## Links - **Leaderboard:** [pinchbench.com](https://pinchbench.com) - **OpenClaw:** [github.com/openclaw/openclaw](https://github.com/openclaw/openclaw) - **Issues:** [github.com/pinchbench/skill/issues](https://github.com/pinchbench/skill/issues) ## Star History Star History Chart ## License MIT — see [LICENSE](LICENSE) for details. --- _Claw-some AI agent testing_ 🦞