没养龙虾（OpenClaw），先养个马（Hermes）来做生物信息学

简佐义的博客

308人浏览 · 2026-05-12 20:42:04

简佐义的博客 · 2026-05-12 20:42:04 发布

AI 发展日新月异，还没来得及养龙虾，马（Hermes）又来了。“弃龙虾（OpenClaw）、选爱马仕（Hermes）” ，似乎正在形成共识。真是应了那句话：只要学得慢，就不用学。

既然如此，那么我们就先不管龙虾，今天先来安装一个马试试。

安装

我们先看一下 Hermes 的 GitHub 主页：https://github.com/nousresearch/hermes-agent

今天（2026-05-12），hermes 在 GitHub 上有 14.5 万颗星，这对于才火起来 1 个月左右的项目来说，已经非常成功了。

进入文档页面：https://hermes-agent.nousresearch.com/

我们复制安装命令：

curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash

运行安装——我这里是 Linux 系统。在运行这条命令之前，先要确保系统安装了Python 3.11。

配置

运行hermes setup

我们选择第一项：Quick setup。

设置大模型提供商

我们这里选择使用 DeepSeek。

我们先到 DeepSeek 官网的开放平台申请好 API Key：

取一个名字，比如：hermes

然后回到终端，在这里输入刚才创建的 API key：

接着 Base URL 填写这个：https://api.deepseek.com

选择模型：

选择终端后端

保持默认选项就好了。

设置消息平台

选择 QQ：

如果终端没有出现二维码，那么就复制提示的链接，到 QQ 开放平台创建一个 QQ 机器人，我们给它取一个名字：Hermes

配置好之后：

最后我们希望小马能常驻服务器，于是同意安装程序将其设为系统服务，这样重启服务器 hermes 也自动启动了。

但是出现了一条错误，原始是我们的当前账号galaxy没有添加系统服务的权限。

在 root 下执行：

# 1. 赋予 galaxy 用户常驻权限（确保退出登录后服务依然运行）
loginctl enable-linger galaxy

# 2. 切换到 galaxy 用户并进入对应目录
su - galaxy
cd /home/galaxy/.hermes/hermes-agent

# 3. 运行安装命令
hermes gateway install

打声招呼

要在 QQ 上跟 hermes 对话，首先必须要创建好机器人并且跟 hermes 配对好，就像蓝牙配对一样。其次hermes它只是一个智能体，它能工作要依赖于大模型，因此必须为其配置好大模型，我们这里用的是 DeepSeek v4。

可以看到，前面跟机器人对话它没反应，就是没有配置好。

单细胞学习路线图

不得不说，hermes给的单细胞学习建议还是非常科学客观的。先是打语言基础，然后是核心工具的掌握：Seurat，Scanpy。接着是标准分析，进阶分析。

单细胞标准分析

大家接下来看hermes的实战表演吧。

下面让它补一个celltypist分析：

最终 UMAP 图：

最后总结

有意思的是，hermes会自动将分析过程提取成skill，这或许就是它跟OpenClaw很大的不同之处。后者更依赖人工编写skill。

比如我们刚才的分析就自动保存在

/home/galaxy/.hermes/skills/data-science/single-cell-scanpy目录下：

.
├── references
│   ├── celltypist_pitfalls.md
│   └── rank_genes_groups_structured_array.md
├── SKILL.md
└── templates
    ├── celltypist_annotation.py
    └── pbmc3k_pipeline.py

2 directories, 5 files

SKILL.md

---
name: single-cell-scanpy
description: Single-cell RNA-seq analysis with Scanpy — QC, normalization, HVG, PCA, UMAP, clustering, marker detection, and cell-type annotation (manual marker + CellTypist automated).
---

# Single-Cell RNA-seq Analysis with Scanpy

Standard scRNA-seq analysis pipeline using Scanpy (Python). Covers loading data through cell-type annotation, with tool-specific pitfalls documented below.

## Triggers

- User asks to run single-cell analysis, scRNA-seq, Scanpy, PBMC analysis
- User mentions UMAP, Leiden clustering, marker genes, Seurat/Scanpy
- User wants to analyze `.h5ad` files or 10x Genomics data

## Standard Pipeline (10 Steps)

1. **Load data** — `sc.datasets.pbmc3k()` or `sc.read_h5ad()`
2. **QC filtering** — mark MT/ribo genes, `sc.pp.calculate_qc_metrics`, filter cells by n_genes and %MT
3. **Normalization** — `sc.pp.normalize_total(target_sum=1e4)` + `sc.pp.log1p()`
4. **HVG selection** — `sc.pp.highly_variable_genes(n_top_genes=2000)` (use default `seurat` flavor unless `scikit-misc` installed)
5. **Regress + scale** — `sc.pp.regress_out(['total_counts', 'pct_counts_mt'])` then `sc.pp.scale(max_value=10)`
6. **PCA** — `sc.tl.pca(svd_solver='arpack', n_comps=50)`
7. **Neighbors + UMAP** — `sc.pp.neighbors(n_pcs=15)`, `sc.tl.umap()`
8. **Clustering** — `sc.tl.leiden(resolution=1.0)`
9. **Marker genes** — `sc.tl.rank_genes_groups(groupby, method='wilcoxon', use_raw=True)`
10. **Cell-type annotation** — manual marker-based or automated via **CellTypist** (recommended: majority voting per cluster, far more accurate than manual markers)

## CellTypist Automated Annotation (Step 10-b)

CellTypist (Oxford Teichmann Lab) uses a pre-trained model with 98 immune cell types and 4164 gene features. **Always prefer this over manual marker-based annotation** — manual markers routinely misclassify MAIT cells, CD8+ Tem/Trm, and DCs (all of which express NKG7/CCL5/GZMB and get confused with NK cells).

```python
from celltypist import models
model = models.Model.load(model='Immune_All_Low.pkl')  # auto-downloads

# CellTypist needs log-norm data; create clean copy
adata_ct = adata.raw.to_adata()
adata_ct.obs = adata.obs.copy()
sc.pp.normalize_total(adata_ct, target_sum=1e4)
sc.pp.log1p(adata_ct)

predictions = celltypist.annotate(
    adata_ct, model='Immune_All_Low.pkl',
    majority_voting=True, over_clustering='leiden'
)
adata.obs['celltypist_label'] = predictions.predicted_labels['majority_voting'].values
```


See `references/celltypist_pitfalls.md` for model output columns, data prep, and manual-vs-automated comparison.
Full working script: `templates/celltypist_annotation.py`.

## Critical Pitfalls

### P1: `rank_genes_groups` yields structured numpy arrays

- `adata.uns['rank_genes_groups']['names']` is a 1D structured recarray
- Shape = `(n_genes,)` — NOT `(n_genes, n_clusters)`
- Number of clusters = `len(names.dtype.names)` — NOT `shape[1]`
- Access pattern: `names[rank][cluster_index]` — rank is gene rank (0=top), cluster_index is integer
- Always cast genes to `str()` before passing to plotting functions

```python
names = adata.uns['rank_genes_groups']['names']
n_clusters = len(names.dtype.names)
top_gene_cluster0 = str(names[0][0])  # top gene for cluster 0
```

### P2: `sc.pl.pca_variance_ratio` does not accept `ax` parameter

- Use separate `plt.figure()` calls; save and close each individually
- Same for `sc.pl.pca()` — it manages its own figure

### P3: Dotplot `var_names` must be plain Python strings

- Passing numpy record types (from structured arrays) causes `TypeError: unhashable type`
- Always convert: `var_names=[str(g) for g in gene_list]`

### P4: `sc.tl.score_genes` needs a flat gene list

- Passing `list(marker_dict.values())` gives list-of-lists — wrong
- Flatten: `[g for genes in marker_dict.values() for g in genes]`

### P5: `sc.tl.rank_genes_groups` should use log-normalized data

- Despite `use_raw=True`, log-normalize before calling or expect a warning
- The pipeline above normalizes + log1p before HVG, so data is ready

### P6: Vanilla `seurat` HVG flavor works out of the box

- `flavor='seurat_v3'` requires `scikit-misc` (not commonly pre-installed)
- Default `flavor='seurat'` needs no extra packages

### P7: CellTypist — majority voting output has no `conf_score` column

- `predictions.predicted_labels` columns: `['predicted_labels', 'over_clustering', 'majority_voting']`
- Use `majority_voting` for cluster-level consensus; no separate confidence column
- Must pass log-normalized data (not regressed/scaled) — create fresh `adata_ct` from `.raw`
- Model auto-downloads on first use; wrap in `stdbuf -oL -eL` to avoid buffered-hang appearance

## Running with real-time output

Always use `stdbuf -oL -eL python3 -u script.py` or `PYTHONUNBUFFERED=1 python3 -u script.py` to avoid buffered stdout in long analyses.

## Files

| Path                                               | Purpose                                                            |
| -------------------------------------------------- | ------------------------------------------------------------------ |
| `templates/pbmc3k_pipeline.py`                     | Complete 10-step pipeline — copy and modify for new datasets       |
| `templates/celltypist_annotation.py`               | CellTypist automated annotation with UMAP cluster labeling         |
| `references/rank_genes_groups_structured_array.md` | Deep-dive on the structured array access pattern (hardest pitfall) |
| `references/celltypist_pitfalls.md`                | CellTypist model output, data prep, manual-vs-automated comparison |

pbmc3k_pipeline.py

分析代码保存成了模板：

#!/usr/bin/env python3
"""
PBMC 3k scRNA-seq Analysis Pipeline (Scanpy)
==============================================
Validated template for single-cell analysis. Replace dataset loading
in Step 1 to use your own .h5ad file.

Generated by / updated by: see SKILL.md for full documentation.
"""

import scanpy as sc
import matplotlib.pyplot as plt
import os, sys
import warnings
warnings.filterwarnings('ignore')

sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=100, facecolor='white', frameon=True)

# --- Configuration ---
OUT_DIR = "./results/"
os.makedirs(OUT_DIR, exist_ok=True)

# ============================================================
# Step 1: Load data
# ============================================================
print("\n" + "="*60 + "\n  Step 1: Load data\n" + "="*60)
adata = sc.datasets.pbmc3k()  # Replace with sc.read_h5ad('your_data.h5ad')
print(f"  Dimensions: {adata.shape[0]} cells x {adata.shape[1]} genes")

# ============================================================
# Step 2: Quality Control
# ============================================================
print("\n" + "="*60 + "\n  Step 2: QC\n" + "="*60)

# Mark mitochondrial and ribosomal genes (adjust prefix for your species)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['ribo'] = adata.var_names.str.startswith(('RPS', 'RPL'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], inplace=True)

# --- QC plots (before filtering) ---
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=False, ax=axes[0], show=False)
axes[0].set_title('QC (before)')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='pct_counts_mt',
              ax=axes[1], show=False)
axes[1].set_title('Genes vs UMI')
axes[2].hist(adata.obs['n_genes_by_counts'], bins=100, alpha=0.7)
axes[2].axvline(200, color='red', linestyle='--', label='min=200')
axes[2].axvline(2500, color='darkred', linestyle='--', label='max=2500')
axes[2].set_xlabel('Number of genes')
axes[2].set_title('Gene count distribution')
axes[2].legend()
plt.tight_layout()
plt.savefig(f"{OUT_DIR}01_QC_before.png", dpi=150, bbox_inches='tight')
plt.close()

# --- Filtering ---
print(f"  Before filter: {adata.n_obs} cells")
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_cells(adata, max_genes=2500)
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()
sc.pp.filter_genes(adata, min_cells=3)
print(f"  After filter:  {adata.n_obs} cells, {adata.n_vars} genes")

# QC after filtering
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=False, ax=axes[0], show=False)
axes[0].set_title('QC (after)')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='pct_counts_mt',
              ax=axes[1], show=False)
axes[1].set_title('Genes vs UMI (clean)')
top20 = adata.var_names[adata.var['n_cells_by_counts'].argsort()[::-1][:20]]
sc.pl.highest_expr_genes(adata, n_top=20, ax=axes[2], show=False)
plt.tight_layout()
plt.savefig(f"{OUT_DIR}02_QC_after.png", dpi=150, bbox_inches='tight')
plt.close()

# ============================================================
# Step 3: Normalization
# ============================================================
print("\n" + "="*60 + "\n  Step 3: Normalization\n" + "="*60)
adata.raw = adata.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# ============================================================
# Step 4: Highly Variable Genes
# ============================================================
print("\n" + "="*60 + "\n  Step 4: HVG selection\n" + "="*60)
# Use default 'seurat' flavor (no extra deps). 'seurat_v3' needs scikit-misc.
sc.pp.highly_variable_genes(adata, n_top_genes=2000, min_mean=0.0125, max_mean=3, min_disp=0.5)
n_hvg = adata.var.highly_variable.sum()
print(f"  HVGs: {n_hvg}")

sc.pl.highly_variable_genes(adata, show=False)
plt.savefig(f"{OUT_DIR}03_HVG.png", dpi=150, bbox_inches='tight')
plt.close()

adata = adata[:, adata.var.highly_variable].copy()

# ============================================================
# Step 5: Regress + Scale
# ============================================================
print("\n" + "="*60 + "\n  Step 5: Regress + Scale\n" + "="*60)
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)

# ============================================================
# Step 6: PCA
# ============================================================
print("\n" + "="*60 + "\n  Step 6: PCA\n" + "="*60)
sc.tl.pca(adata, svd_solver='arpack', n_comps=50)

# PCA plots — note: pca_variance_ratio does NOT accept ax=
sc.pl.pca_variance_ratio(adata, n_pcs=50, show=False)
plt.title('Elbow plot')
plt.savefig(f"{OUT_DIR}04a_elbow.png", dpi=150, bbox_inches='tight')
plt.close()

sc.pl.pca(adata, color=['n_genes_by_counts', 'pct_counts_mt'], show=False)
plt.suptitle('PCA colored by QC metrics')
plt.savefig(f"{OUT_DIR}04b_pca_qc.png", dpi=150, bbox_inches='tight')
plt.close()

# ============================================================
# Step 7: Neighbors + UMAP
# ============================================================
print("\n" + "="*60 + "\n  Step 7: Neighbors + UMAP\n" + "="*60)
n_pcs = 15
sc.pp.neighbors(adata, n_pcs=n_pcs, n_neighbors=15)
sc.tl.umap(adata, min_dist=0.3, spread=1.0)

# ============================================================
# Step 8: Clustering
# ============================================================
print("\n" + "="*60 + "\n  Step 8: Clustering\n" + "="*60)
for res in [0.5, 0.8, 1.0, 1.2]:
    sc.tl.leiden(adata, resolution=res, key_added=f'leiden_r{res}')
print("  Clusters per resolution:")
for res in [0.5, 0.8, 1.0, 1.2]:
    print(f"    res={res}: {adata.obs[f'leiden_r{res}'].nunique()}")

adata.obs['leiden'] = adata.obs['leiden_r1.0'].astype(str)

# UMAP clustering visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 14))
sc.pl.umap(adata, color='leiden', legend_loc='right margin', ax=axes[0,0],
           title=f'Leiden (n={adata.obs.leiden.nunique()})', show=False)
sc.pl.umap(adata, color='n_genes_by_counts', ax=axes[0,1], title='n_genes', show=False)
sc.pl.umap(adata, color='pct_counts_mt', ax=axes[1,0], title='%MT', show=False)
sc.pl.umap(adata, color='total_counts', ax=axes[1,1], title='total UMI', show=False)
plt.tight_layout()
plt.savefig(f"{OUT_DIR}05_UMAP_clusters.png", dpi=150, bbox_inches='tight')
plt.close()

# ============================================================
# Step 9: Marker Genes
# ============================================================
print("\n" + "="*60 + "\n  Step 9: Marker Genes\n" + "="*60)
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon', use_raw=True)

# CRITICAL: rank_genes_groups['names'] is a 1D structured recarray
#   Shape = (n_genes,), clusters accessed via dtype.names
#   Access: names[rank][cluster_index]
names_struct = adata.uns['rank_genes_groups']['names']
scores_struct = adata.uns['rank_genes_groups']['scores']
n_clusters = len(names_struct.dtype.names)

print("\n  Top5 markers per cluster:")
for ci in range(n_clusters):
    clabel = names_struct.dtype.names[ci]
    markers = [str(names_struct[j][ci]) for j in range(min(5, len(names_struct)))]
    scores  = [scores_struct[j][ci] for j in range(min(5, len(names_struct)))]
    print(f"    Cluster {clabel:>2}: " + ', '.join(f"{g}({s:.1f})" for g,s in zip(markers, scores)))

# Dotplot with top-3 per cluster (cast to plain strings!)
top3_genes = []
for ci in range(n_clusters):
    for rank in range(min(3, len(names_struct))):
        top3_genes.append(str(names_struct[rank][ci]))
top3_genes = list(set(top3_genes))
sc.pl.dotplot(adata, var_names=top3_genes, groupby='leiden', show=False)
plt.savefig(f"{OUT_DIR}07_Dotplot_top3.png", dpi=150, bbox_inches='tight')
plt.close()

# ============================================================
# Step 10: Cell-Type Annotation
# ============================================================
print("\n" + "="*60 + "\n  Step 10: Cell-Type Annotation\n" + "="*60)

# Adjust marker genes for your dataset/species
marker_dict = {
    'CD14+ Monocytes':   ['CD14', 'LYZ', 'S100A9'],
    'FCGR3A+ Monocytes': ['FCGR3A', 'MS4A7', 'LST1'],
    'CD4+ T cells':      ['CD3D', 'CD3E', 'IL7R', 'CD4'],
    'CD8+ T cells':      ['CD3D', 'CD3E', 'CD8A', 'CD8B'],
    'NK cells':          ['NKG7', 'GNLY', 'KLRD1'],
    'B cells':           ['CD79A', 'MS4A1', 'CD19'],
    'Dendritic cells':   ['FCER1A', 'CST3'],
    'Megakaryocytes':    ['PPBP', 'PF4'],
}

# Average expression per cluster for each cell type
cell_type_anno = {}
for cluster in sorted(adata.obs['leiden'].unique(), key=int):
    mask = adata.obs['leiden'] == cluster
    ct_scores = {}
    for ct, genes in marker_dict.items():
        valid_genes = [g for g in genes if g in adata.raw.var_names]
        if not valid_genes:
            ct_scores[ct] = 0
            continue
        avg_expr = adata.raw[mask, valid_genes].X.mean()
        try:
            avg_expr = float(avg_expr)
        except:
            avg_expr = 0
        ct_scores[ct] = avg_expr

    best_ct = max(ct_scores, key=ct_scores.get)
    cell_type_anno[cluster] = best_ct
    sorted_by_score = sorted(ct_scores.items(), key=lambda x: x[1], reverse=True)
    candidates = ' / '.join(f"{ct}({s:.2f})" for ct, s in sorted_by_score[:2])
    print(f"  Cluster {cluster:>2}: \u2192 {best_ct} ({candidates})")

adata.obs['cell_type'] = adata.obs['leiden'].map(cell_type_anno)

# Cell-type UMAP
fig, axes = plt.subplots(1, 2, figsize=(20, 8))
sc.pl.umap(adata, color='leiden', legend_loc='right margin', ax=axes[0],
           title='Leiden Clusters', show=False)
sc.pl.umap(adata, color='cell_type', legend_loc='right margin', ax=axes[1],
           title='Predicted Cell Types', show=False)
plt.tight_layout()
plt.savefig(f"{OUT_DIR}09_UMAP_celltypes.png", dpi=150, bbox_inches='tight')
plt.close()

# Marker dotplot by cell type
marker_genes_all = list(dict.fromkeys(
    g for genes in marker_dict.values() for g in genes if g in adata.raw.var_names
))
sc.pl.dotplot(adata, var_names=marker_genes_all, groupby='cell_type',
              use_raw=True, dendrogram=True, show=False)
plt.savefig(f"{OUT_DIR}10_Dotplot_markers.png", dpi=150, bbox_inches='tight')
plt.close()

# ============================================================
# Summary
# ============================================================
print(f"""
{'='*60}
  Analysis Complete!
{'='*60}
Cells after QC:  {adata.n_obs}
HVGs:            {n_hvg}
PCs used:        {n_pcs}
Leiden clusters: {adata.obs['leiden'].nunique()}

Cell types:
""")
for ct in sorted(set(cell_type_anno.values())):
    count = (adata.obs['cell_type'] == ct).sum()
    pct = count / adata.n_obs * 100
    print(f"  {ct}: {count} cells ({pct:.1f}%)")

print(f"\nResults saved to: {OUT_DIR}")

大家可以看看它这个代码写得到底对不对，质量如何？

好了，今天养马的尝试就先到这里。大家有什么好的想法欢迎到群里交流：

中国银河生信云平台精品课程

中国银河生信云平台（UseGalaxy.cn）致力于生信平权。海量云端算力、8000+生信工具结合AI，推动生信进入3.0时代：数据分析从本地到云端，从手工到 AI。加入交流群，免费领取学习资料。

特色生信培训，助你丝滑发顶刊：

单细胞数据分析培训班（Python/Galaxy可选），不怕学不会

转录组数据分析实战（Galaxy）| 直播回放

咨询小助手：usegalaxy

DAMO开发者矩阵

DAMO开发者矩阵，由阿里巴巴达摩院和中国互联网协会联合发起，致力于探讨最前沿的技术趋势与应用成果，搭建高质量的交流与分享平台，推动技术创新与产业应用链接，围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐

机器人项目方案前置评审清单：定方案前先问清 5 个问题

接口文档不要只写正常流程，至少要补充异常状态、恢复动作、责任归属和记录证据。

DAMO开发者矩阵

20_项目实战六_企业文档智能检索平台_LangSmith监控_权限管控_数据看板

前面我们已经做过个人知识库问答助手、客服机器人、SQL 查询助手、代码审查 Agent 和多 Agent 日报系统。这些项目解决的是单点能力。多个部门都要上传文档。不同用户只能看自己有权限的文档。支持 PDF、Word、Markdown、网页、内部 Wiki。文档更新后要增量入库。用户提问时要返回答案和引用来源。管理员要看到 QPS、延迟、Token 消耗、用户活跃度。出问题时要能追踪每一次检索、