第二篇：核心架构深度解析

三层架构、AIAgent核心循环、Provider路由

1. 整体架构图

Hermes Agent 采用三层架构设计，从上到下依次为 CLI 层、Gateway 层和 Agent 层：

┌─────────────────────────────────────────────────────────┐
│                     用户交互层                            │
│  ┌──────┐  ┌──────┐  ┌──────┐  ┌──────────┐  ┌───────┐ │
│  │ CLI  │  │ API  │  │ 飞书  │  │ Telegram │  │ ...   │ │
│  │终端   │  │Server│  │      │  │          │  │19+平台│ │
│  └──┬───┘  └──┬───┘  └──┬───┘  └────┬─────┘  └───┬───┘ │
│     │         │         │           │             │     │
├─────┼─────────┼─────────┼───────────┼─────────────┼─────┤
│     │    Gateway 层 (gateway/)      │             │     │
│     │  ┌────┴────────┴──────────────┴─────────────┴┐   │
│     │  │         GatewayRunner (run.py:510)         │   │
│     │  │  ┌──────────┐  ┌──────────┐  ┌─────────┐  │   │
│     │  │  │ Session  │  │ Delivery │  │ Platform│  │   │
│     │  │  │ Store    │  │ Router   │  │ Adapter │  │   │
│     │  │  └──────────┘  └──────────┘  └─────────┘  │   │
│     │  └────────────────┬──────────────────────────┘   │
│     │                   │                               │
├─────┼───────────────────┼───────────────────────────────┤
│     │           Agent 层 (run_agent.py)                 │
│     │  ┌────────────────┴──────────────────────────┐   │
│     │  │           AIAgent (run_agent.py:492)       │   │
│     │  │                                             │   │
│     │  │  ┌──────────┐  ┌──────────┐  ┌─────────┐  │   │
│     │  │  │ Provider │  │ Tool     │  │ Context │  │   │
│     │  │  │ Router   │  │ Registry │  │Compress │  │   │
│     │  │  └──────────┘  └──────────┘  └─────────┘  │   │
│     │  │                                             │   │
│     │  │  ┌──────────┐  ┌──────────┐  ┌─────────┐  │   │
│     │  │  │ Memory   │  │ SubAgent │  │ Display │  │   │
│     │  │  │ Manager  │  │Delegate  │  │ System  │  │   │
│     │  │  └──────────┘  └──────────┘  └─────────┘  │   │
│     │  └────────────────────────────────────────────┘   │
│     │                                                   │
│     │  ┌────────────────────────────────────────────┐   │
│     │  │        工具层 (tools/)                       │   │
│     │  │  terminal / file / web / browser / vision   │   │
│     │  │  delegate / skills / memory / cronjob / ... │   │
│     │  └────────────────────────────────────────────┘   │
│     │                                                   │
└─────┴───────────────────────────────────────────────────┘

2. 消息的完整旅程

以用户在飞书发送 "帮我部署 todo-app" 为例，追踪一条消息从接收到响应的全过程。

sequenceDiagram
    participant U as 用户(飞书)
    participant FA as FeishuAdapter
    participant GR as GatewayRunner
    participant HR as HookRegistry
    participant AG as AIAgent
    participant PB as PromptBuilder
    participant LLM as LLM API
    participant TF as handle_function_call
    participant TE as TerminalTool
    participant SS as SessionStore

    U->>FA: WebSocket 事件到达
    FA->>FA: _on_message_event() 解析为 MessageEvent
    FA->>GR: callback(MessageEvent)
    GR->>GR: _handle_message() 查找/创建 SessionContext
    GR->>HR: emit("agent:start", context)
    GR->>AG: _run_agent(message, context_prompt, history)
    AG->>AG: __init__() 加载工具、记忆、技能
    AG->>PB: build_context_files_prompt() + build_skills_system_prompt()
    PB-->>AG: 完整 system_prompt
    AG->>LLM: messages → chat_completions API
    LLM-->>AG: tool_calls=[terminal_tool("git clone ...")]
    AG->>TF: handle_function_call("terminal", args)
    TF->>TE: TerminalTool 在 local/docker 后端执行
    TE-->>TF: 命令输出
    TF-->>AG: 工具结果字符串
    AG->>LLM: messages + tool_result 再次调用
    LLM-->>AG: 纯文本响应 "已完成部署..."
    AG-->>GR: agent_result
    GR->>HR: emit("agent:end", context)
    GR->>FA: adapter.send(chat_id, response)
    FA->>U: 飞书消息卡片
    GR->>SS: session_store._save() 持久化对话

步骤详解

① 飞书收到消息

源文件: gateway/platforms/feishu.py _on_message_event() (第 1744 行)

关键代码:

# feishu.py 第 1744-1760 行
def _on_message_event(self, data: Any) -> None:
    """Normalize Feishu inbound events into MessageEvent."""
    # 解析飞书事件数据，提取 text/mention/文件 等字段
    event = MessageEvent(platform=Platform.FEISHU, ...)

调试提示: 如果消息无响应，首先检查 WebSocket 连接是否存活（日志中搜索 [Feishu] Dropping inbound message），以及 App 权限是否包含 im:message:receive_v1。

② Gateway 路由

源文件: gateway/run.py _handle_message() (第 2313 行)

关键代码:

# run.py 第 2313 行
async def _handle_message(self, event: MessageEvent) -> Optional[str]:
    # 根据 chat_id 查找已有 SessionEntry，或创建新会话
    source = SessionSource(platform=event.platform, chat_id=event.chat_id, ...)

调试提示: 会话未正确关联时，检查 SessionResetPolicy 是否过早重置了会话，或 chat_id 提取逻辑是否被飞书的事件格式变化破坏。

③ Hook 触发 (agent:start)

源文件: gateway/hooks.py HookRegistry.emit() (第 138 行) + gateway/run.py (第 3487 行)

关键代码:

# run.py 第 3487 行
await self.hooks.emit("agent:start", {
    "platform": source.platform.value,
    "user_id": source.user_id,
    "session_id": session_entry.session_id,
    "message": message_text[:500],
})

调试提示: Hook 抛出的异常不会中断主流程（emit 内部 try/except），但会打印日志。如果某个 Hook 的副作用（如通知）未触发，检查 [hooks] Error loading hook 日志。

④ Agent 初始化

源文件: run_agent.py AIAgent.__init__() (第 516 行起)

关键代码:

# run_agent.py 第 976-1003 行 — 工具加载
self._tool_defs, self._tool_callables = get_tool_definitions(...)
# 第 1089-1112 行 — 记忆初始化
self.memory_manager = MemoryManager(...)

调试提示: 初始化耗时过长通常是因为工具探测（如检查 Docker 是否可用）。如果某个工具加载失败，Agent 仍会继续启动，但该工具不可用。

⑤ 系统提示词构建

源文件: agent/prompt_builder.py build_context_files_prompt() (第 948 行)、build_skills_system_prompt() (第 533 行)

关键代码:

# prompt_builder.py 第 948-960 行
def build_context_files_prompt(cwd=None, skip_soul=False):
    """加载 SOUL.md → CLAUDE.md → .cursorrules 等上下文文件"""
    # 按 priority 顺序查找，首个匹配即返回

调试提示: 如果 Agent 行为不符合预期（如缺少技能描述），在 verbose 模式下检查 system_prompt 的实际内容，确认上下文文件是否被正确发现和加载。

⑥ 第一次 LLM 调用

源文件: run_agent.py run_conversation() 内部（第 7506 行起）

关键代码:

# run_agent.py — 构建请求并发送到 LLM API
response = self.client.chat.completions.create(
    model=self.model,
    messages=messages,
    tools=tool_definitions,
    ...
)

调试提示: 调用超时或 429 限速时，Agent 会自动重试并可能切换到 Fallback Provider。检查日志中 Rate limited 或 ConnectionError 关键字。

⑦ LLM 返回 tool_calls

源文件: run_agent.py run_conversation() 主循环中解析响应

关键代码:

# LLM 返回格式示例：
# tool_calls=[{"function": {"name": "terminal", "arguments": '{"command":"git clone ..."}'}}]

调试提示: 如果模型返回了不存在的工具名，handle_function_call 会返回错误信息并被注入 messages，LLM 通常会自行修正。

⑧ 工具执行

源文件: model_tools.py handle_function_call() (第 459 行)

关键代码:

# model_tools.py 第 459-548 行
def handle_function_call(function_name, function_args, ...):
    # 路由到对应的工具函数
    # terminal_tool → TerminalTool(backend="local"|"docker")

调试提示: 命令执行失败时，工具返回值包含完整 stderr。如果命令被安全策略拦截，检查 security 配置中的命令白名单。

⑨ 结果注入

源文件: run_agent.py run_conversation() 主循环

关键代码:

# 将工具输出追加到 messages 并再次调用 LLM
messages.append({"role": "tool", "content": tool_output, "tool_call_id": ...})

调试提示: 工具输出过长时会被截断（由 log_prefix_chars 控制）。如果 LLM 似乎"忘记"了工具结果，检查上下文压缩是否过于激进。

⑩ 迭代循环

源文件: run_agent.py IterationBudget (第 170 行)

关键代码:

# run_agent.py 第 170-212 行
class IterationBudget:
    def consume(self) -> bool:
        """消耗一次迭代，耗尽返回 False"""

调试提示: 预算耗尽时 Agent 会收到一条警告提示并获得最后一次机会总结。如果任务未完成就停止了，考虑增大 agent.max_turns 配置。

⑪ Hook 触发 (agent:end)

源文件: gateway/run.py (第 3576 行)

关键代码:

# run.py 第 3576 行
await self.hooks.emit("agent:end", {
    **hook_ctx,
    "response": (response or "")[:500],
})

调试提示: agent:end Hook 可用于记录日志或触发后续通知。如果 Hook 未执行，确认 Agent 不是因异常崩溃退出（异常路径可能跳过此 emit）。

⑫ 响应发送

源文件: gateway/platforms/feishu.py FeishuAdapter.send() (第 1327 行)

关键代码:

# feishu.py 第 1327 行
async def send(self, chat_id: str, content: str, ...):
    # 如果内容超长则分段发送
    # 支持 Markdown → 飞书消息卡片渲染

调试提示: 飞书有单条消息 8000 字符限制（MAX_MESSAGE_LENGTH），超长内容会自动拆分。如果消息未到达用户，检查 Bot 是否被移出群聊或被管理员禁用。

⑬ 会话持久化

源文件: gateway/session.py SessionStore._save() + gateway/run.py 中多处调用

关键代码:

# run.py — 会话状态保存
session_entry.session_id = agent.session_id
self.session_store._save()
# session_store.rewrite_transcript() 保存完整对话历史

调试提示: 会话数据保存在 ~/.hermes/sessions/ 目录下。如果 Gateway 重启后会话丢失，确认 SQLite 后端是否正常初始化（降级到 JSONL 时功能不变但性能下降）。

3. 三层架构详解

3.1 CLI 层 — `hermes_cli/main.py`

CLI 层是 Hermes Agent 与用户交互的前端，入口函数在 hermes_cli/main.py。该文件的执行流程遵循严格顺序：

第一步：Profile 覆盖（第 83 行 _apply_profile_override()）

这是整个 CLI 流程中最早执行的逻辑，必须发生在所有 Hermes 模块导入之前。原因在于许多模块在导入时缓存 HERMES_HOME 路径（作为模块级常量），如果 Profile 覆盖延迟，这些模块会使用错误的 home 目录。

# hermes_cli/main.py 第 83-137 行
def _apply_profile_override() -> None:
    """Pre-parse --profile/-p and set HERMES_HOME before module imports."""
    argv = sys.argv[1:]
    profile_name = None
    # 1. 检查显式 -p / --profile 参数
    # 2. 检查 ~/.hermes/active_profile 文件（sticky default）
    # 3. 解析并设置 HERMES_HOME 环境变量

Profile 的解析优先级为：命令行参数 > active_profile 文件 > 默认 ~/.hermes/。

第二步：环境加载

# hermes_cli/main.py 第 140-144 行
from hermes_cli.config import get_hermes_home
from hermes_cli.env_loader import load_hermes_dotenv
load_hermes_dotenv(project_env=PROJECT_ROOT / '.env')

从 ~/.hermes/.env 加载环境变量，优先级高于系统环境变量。

第三步：子命令分发

CLI 支持以下主要子命令：

子命令	功能	关键实现
`hermes` / `hermes chat`	交互式对话	TUI 循环
`hermes gateway`	启动 Gateway 服务	调用 `gateway/run.py`
`hermes setup`	交互式配置向导	`hermes_cli/config.py`
`hermes doctor`	诊断配置问题	检查依赖和配置完整性
`hermes sessions browse`	会话浏览器	curses TUI
`hermes cron`	定时任务管理	`cron/` 模块

3.2 Gateway 层 — `gateway/run.py`

Gateway 层是 Hermes Agent 作为服务运行时的核心，负责管理多平台消息接入和路由。

GatewayRunner 类（第 510 行）

GatewayRunner 是 Gateway 层的主控制器，管理着整个服务的生命周期：

# gateway/run.py 第 510-603 行
class GatewayRunner:
    def __init__(self, config: Optional[GatewayConfig] = None):
        self.config = config or load_gateway_config()
        self.adapters: Dict[Platform, BasePlatformAdapter] = {}
        self.session_store = SessionStore(...)
        self.delivery_router = DeliveryRouter(self.config)
        # Agent 缓存（维持 Prompt Caching 状态，避免每条消息重建 AIAgent 导致缓存失效）
        self._agent_cache: Dict[str, tuple] = {}
        # 运行中的 Agent 实例（用于中断支持）
        self._running_agents: Dict[str, Any] = {}
        # 会话级模型覆盖（/model 命令）
        self._session_model_overrides: Dict[str, Dict[str, str]] = {}

SSL 自动检测（第 35-71 行）

Gateway 启动时最先执行的是 SSL 证书自动检测。这在 NixOS 等非标准系统上尤为重要，Python 可能找不到系统 CA 证书：

# gateway/run.py 第 35-71 行
def _ensure_ssl_certs() -> None:
    """Set SSL_CERT_FILE if the system doesn't expose CA certs to Python."""
    # 1. Python 编译时默认路径
    # 2. certifi 包自带的 Mozilla CA Bundle
    # 3. 各发行版的常见证书路径（Debian/RHEL/SUSE/Alpine/macOS）

配置桥接机制（第 89-207 行）

Gateway 启动时将 config.yaml 的值桥接到环境变量，使 os.getenv() 统一读取：

# gateway/run.py 第 89-136 行
_config_path = _hermes_home / 'config.yaml'
if _config_path.exists():
    # 1. 顶层简单值桥接到环境变量（不覆盖 .env）
    for _key, _val in _cfg.items():
        if isinstance(_val, (str, int, float, bool)) and _key not in os.environ:
            os.environ[_key] = str(_val)
    # 2. terminal 配置桥接到 TERMINAL_* 环境变量
    # 3. auxiliary 配置桥接到 AUXILIARY_* 环境变量
    # 4. agent 配置桥接到 HERMES_* 环境变量

配置优先级为：.env 文件 > config.yaml > 代码默认值。

平台 Adapter 管理

每个平台由一个独立的 adapter 类实现，继承自 BasePlatformAdapter（gateway/platforms/base.py）。所有 adapter 在 GatewayRunner 启动时初始化，支持：

自动重连失败的平台（_failed_platforms 字典）
优雅关闭（_shutdown_event）
热重启（_restart_requested 标志）

3.3 Agent 层 — `run_agent.py`

Agent 层是 Hermes Agent 的引擎，核心类是 AIAgent。

AIAgent 类（第 492 行）

AIAgent 是一个重量级的类，其 __init__ 方法从第 516 行起约 800 行，完成了以下初始化工作：

初始化阶段	代码位置	说明
安全 stdio 包装	第 613 行	`_install_safe_stdio()` 防止管道断裂崩溃
API 模式检测	第 645-661 行	自动选择 chat_completions / codex_responses / anthropic_messages
Provider 路由	第 825-949 行	初始化 LLM 客户端（OpenAI/Anthropic/Codex）
Fallback 链	第 950-973 行	配置备用 Provider 链
工具加载	第 976-1003 行	调用 `get_tool_definitions()` 加载可用工具
记忆系统	第 1089-1112 行	初始化 MEMORY.md / USER.md 持久记忆
检查点管理	第 1042-1047 行	文件系统快照（可选）
上下文压缩	第 1190 行附近	初始化 ContextCompressor

4. AIAgent 核心循环

4.1 run_conversation() 流程

run_conversation() 是 AIAgent 的核心方法（第 7506 行），每次用户发送消息时调用。其完整流程如下：

用户消息输入
    │
    ▼
[1] 初始化（第 7534-7597 行）
    ├── 安全 stdio 安装
    ├── Surrogate 字符清理
    ├── 重试计数器重置
    ├── 连接健康检查（清理死连接）
    └── IterationBudget 重建
    │
    ▼
[2] 系统提示构建
    ├── 加载 SOUL.md / AGENTS.md
    ├── 注入平台上下文
    ├── 注入记忆上下文
    └── 注入技能提示
    │
    ▼
[3] 主循环（迭代直到完成）
    │
    ├──[3.1] 调用 LLM API
    │   ├── 构建请求（messages + tools + parameters）
    │   ├── 处理 streaming 响应
    │   └── 解析 tool_calls 或 text content
    │
    ├──[3.2] 判断响应类型
    │   ├── 有 tool_calls → 进入工具执行
    │   └── 仅有 text → 任务完成，返回结果
    │
    ├──[3.3] 工具执行
    │   ├── 预处理（参数强制转换、安全检查）
    │   ├── 路由到 handle_function_call()
    │   ├── 并行执行判断（_should_parallelize_tool_batch）
    │   └── 将工具结果追加到 messages
    │
    ├──[3.4] 预算检查
    │   ├── iteration_budget.consume()
    │   └── 如果耗尽 → 注入预算警告，给模型最后一次机会
    │
    └──[3.5] 循环继续
    │
    ▼
[4] 后处理
    ├── 记忆 flush（满足条件时）
    ├── 会话日志写入
    └── 返回最终结果

4.2 IterationBudget 迭代预算

IterationBudget（第 170 行）是控制 Agent 行为的关键机制，防止无限循环：

# run_agent.py 第 170-212 行
class IterationBudget:
    """Thread-safe iteration counter for an agent."""

    def __init__(self, max_total: int):
        self.max_total = max_total
        self._used = 0
        self._lock = threading.Lock()  # 线程安全

    def consume(self) -> bool:
        """Try to consume one iteration. Returns True if allowed."""
        with self._lock:
            if self._used >= self.max_total:
                return False
            self._used += 1
            return True

    def refund(self) -> None:
        """Give back one iteration (e.g. for execute_code turns)."""
        with self._lock:
            if self._used > 0:
                self._used -= 1

设计要点：

线程安全：使用 threading.Lock() 保护计数器，因为子 Agent 的并行执行会从不同线程访问
独立预算：父 Agent 默认 90 次迭代，子 Agent 默认 50 次（通过 delegate_tool.py 的 DEFAULT_MAX_ITERATIONS）
退款机制：execute_code 的迭代可以退款，不计入预算消耗
优雅耗尽：预算耗尽时不会硬中断，而是注入一条提示消息，给模型最后一次机会总结

4.3 工具调用循环的并行优化

在主循环中，当模型返回多个 tool_calls 时，Hermes Agent 会判断是否可以并行执行：

# run_agent.py 第 219-231 行
_PARALLEL_SAFE_TOOLS = frozenset({
    "ha_get_state", "ha_list_entities", "ha_list_services",
    "read_file", "search_files", "session_search",
    "skill_view", "skills_list", "vision_analyze",
    "web_extract", "web_search",
})

并行执行的判断逻辑（_should_parallelize_tool_batch，第 267 行）：

如果只有 1 个工具调用，直接顺序执行
如果包含 _NEVER_PARALLEL_TOOLS（如 clarify），退化为顺序
路径作用域工具（read_file、write_file、patch）检查路径是否重叠
其余工具必须在 _PARALLEL_SAFE_TOOLS 白名单中
最多 8 个并发线程（_MAX_TOOL_WORKERS = 8）

5. Provider 路由机制

5.1 三种 API 模式

Hermes Agent 支持三种 LLM API 模式，在 AIAgent.__init__ 中自动检测：

# run_agent.py 第 645-661 行
if api_mode in {"chat_completions", "codex_responses", "anthropic_messages"}:
    self.api_mode = api_mode
elif self.provider == "openai-codex":
    self.api_mode = "codex_responses"
elif self.provider == "anthropic" or "api.anthropic.com" in self._base_url_lower:
    self.api_mode = "anthropic_messages"
elif self._base_url_lower.rstrip("/").endswith("/anthropic"):
    self.api_mode = "anthropic_messages"
else:
    self.api_mode = "chat_completions"

API 模式	协议	适用场景
`chat_completions`	OpenAI `/v1/chat/completions`	大多数兼容 API
`codex_responses`	OpenAI Responses API (`/v1/responses`)	GPT-5.x、Direct OpenAI
`anthropic_messages`	Anthropic Messages API	原生 Claude 接口

5.2 Provider 检测链

Provider 的自动检测遵循以下优先级：

1. 显式 api_mode 参数
2. provider == "openai-codex" → codex_responses
3. provider == "anthropic" 或 URL 包含 api.anthropic.com → anthropic_messages
4. URL 以 /anthropic 结尾（第三方兼容端点）→ anthropic_messages
5. 默认 → chat_completions

之后还有额外的自动升级逻辑：

# run_agent.py 第 678-682 行
# GPT-5.x 模型必须使用 Responses API
if self.api_mode == "chat_completions" and (
    self._is_direct_openai_url()
    or self._model_requires_responses_api(self.model)
):
    self.api_mode = "codex_responses"

5.3 Fallback 机制

当主 Provider 不可用时（限速、过载、连接失败），Hermes Agent 会自动切换到备用 Provider：

# run_agent.py 第 950-973 行
if isinstance(fallback_model, list):
    self._fallback_chain = [
        f for f in fallback_model
        if isinstance(f, dict) and f.get("provider") and f.get("model")
    ]
elif isinstance(fallback_model, dict):
    self._fallback_chain = [fallback_model]

Fallback 链支持列表形式，可以配置多个备用 Provider，按顺序尝试。在 run_conversation() 开始时会尝试恢复到主 Provider（_restore_primary_runtime()），给首选模型一次新的机会。

5.4 Prompt Caching 优化

对于 Anthropic Claude 模型（通过 OpenRouter 或原生 API），Hermes Agent 自动启用 Prompt Caching：

# run_agent.py 第 744-748 行
is_openrouter = self._is_openrouter_url()
is_claude = "claude" in self.model.lower()
is_native_anthropic = self.api_mode == "anthropic_messages" and self.provider == "anthropic"
self._use_prompt_caching = (is_openrouter and is_claude) or is_native_anthropic

Prompt Caching 可以将多轮对话的输入成本降低约 75%，通过缓存会话前缀实现。Gateway 模式下通过 _agent_cache 缓存 AIAgent 实例来维持缓存有效性。

6. 配置系统

6.1 配置文件层次

Hermes Agent 的配置系统采用分层设计：

优先级（从高到低）：
┌─────────────────────────────┐
│ 命令行参数（--model 等）       │  最高优先级
├─────────────────────────────┤
│ .env 文件                    │  API Key 和敏感信息
│   (~/.hermes/.env)           │
├─────────────────────────────┤
│ config.yaml                  │  结构化配置
│   (~/.hermes/config.yaml)    │
├─────────────────────────────┤
│ 代码默认值                    │  最低优先级
└─────────────────────────────┘

6.2 config.yaml 结构

hermes_cli/config.py 负责 config.yaml 的加载和管理。主要配置节：

# 模型配置
model:
  provider: openrouter          # 提供者
  default: anthropic/claude-sonnet-4-20250514  # 默认模型
  # 或使用字典形式支持多配置
  # provider: custom
  # base_url: http://localhost:8000/v1
  # api_key: sk-xxx

# 终端配置
terminal:
  backend: local                # local / docker / modal / ssh / singularity / daytona
  timeout: 120                  # 命令超时（秒）
  cwd: /home/user/projects     # 工作目录
  docker_image: python:3.12    # Docker 镜像
  persistent_shell: true       # 持久 shell

# 记忆配置
memory:
  memory_enabled: true          # 启用 MEMORY.md
  user_profile_enabled: true    # 启用 USER.md
  provider: honcho              # 外部记忆提供者（可选）

# 压缩配置
compression:
  enabled: true
  threshold: 0.85               # 触发压缩的 context 占比

# 子 Agent 配置
delegation:
  max_concurrent_children: 3    # 最大并行子 Agent
  max_iterations: 50            # 子 Agent 迭代上限

# Agent 行为
agent:
  max_turns: 90                 # 最大迭代次数
  gateway_timeout: 300          # Gateway 超时

# 安全配置
security:
  redact_secrets: true          # 日志中脱敏

# 时区
timezone: Asia/Shanghai

6.3 环境变量展开

config.yaml 支持 ${ENV_VAR} 语法引用环境变量：

# hermes_cli/config.py 中的 _expand_env_vars()
# 将 ${OPENAI_API_KEY} 展开为实际值

在 Gateway 启动时，config.yaml 的值会被桥接到环境变量（gateway/run.py 第 89-207 行），实现统一的 os.getenv() 访问。

6.4 Profile 系统

Profile 允许用户维护多套独立配置。每个 Profile 有自己的 ~/.hermes-<profile>/ 目录：

~/.hermes/              # 默认 Profile
  ├── config.yaml
  ├── .env
  ├── sessions/
  └── memory/
~/.hermes-work/          # work Profile
  ├── config.yaml
  ├── .env
  └── ...

Profile 切换通过 _apply_profile_override()（hermes_cli/main.py 第 83 行）在进程启动最早期完成。

7. 会话管理

7.1 SessionStore

SessionStore（gateway/session.py 第 495 行）管理所有 Gateway 会话的持久化：

class SessionStore:
    def __init__(self, sessions_dir: Path, config: GatewayConfig,
                 has_active_processes_fn=None):
        self.sessions_dir = sessions_dir
        self._entries: Dict[str, SessionEntry] = {}
        self._loaded = False
        self._lock = threading.Lock()
        self._db = None  # SQLite SessionDB

会话存储支持两种后端：

SQLite（hermes_state.SessionDB）：主要的会话元数据和消息存储
JSONL 文件：SQLite 不可用时的降级方案

7.2 SessionSource 与 SessionContext

SessionSource（第 67 行）描述消息的来源信息：

@dataclass
class SessionSource:
    platform: Platform           # 平台类型
    chat_id: str                 # 聊天 ID
    chat_name: Optional[str]     # 群组/频道名称
    chat_type: str               # "dm" / "group" / "channel" / "thread"
    user_id: Optional[str]       # 用户 ID
    user_name: Optional[str]     # 用户名
    thread_id: Optional[str]     # 论坛话题 / Discord 线程
    chat_topic: Optional[str]    # 频道主题

SessionContext（第 143 行）在此基础上增加了连接平台和 Home Channel 信息：

@dataclass
class SessionContext:
    source: SessionSource
    connected_platforms: List[Platform]
    home_channels: Dict[Platform, HomeChannel]
    session_key: str
    session_id: str

这些数据结构用于构建动态系统提示（build_session_context_prompt()，第 187 行），让 Agent 知道自己正在哪个平台、与谁对话、有哪些可用通道。

7.3 PII 脱敏

对于 WhatsApp、Signal、Telegram 等平台，用户 ID 可能包含敏感的个人信息（如手机号）。Hermes Agent 实现了 PII 脱敏机制：

# gateway/session.py 第 35-55 行
def _hash_id(value: str) -> str:
    """Deterministic 12-char hex hash of an identifier."""
    return hashlib.sha256(value.encode("utf-8")).hexdigest()[:12]

def _hash_sender_id(value: str) -> str:
    return f"user_{_hash_id(value)}"

脱敏规则：

_PII_SAFE_PLATFORMS（第 176 行）中的平台自动脱敏
Discord 排除在外（因为需要真实 ID 来 @mention 用户）
路由仍使用原始值，脱敏仅影响发送给 LLM 的系统提示

7.4 会话重置策略

GatewayRunner 会根据配置的重置策略（SessionResetPolicy）决定何时创建新会话：

per_message：每条消息创建新会话（无上下文延续）
per_conversation：基于超时的会话延续（默认）
never：永远不重置

当有活跃的后台进程时（通过 has_active_processes_fn 检查），会话不会被重置，以避免丢失正在运行的任务状态。

7.5 Agent 缓存与 Prompt Caching

Gateway 模式下，如果每条消息都创建新的 AIAgent 实例，会导致系统提示完全重建，破坏 Anthropic 的 Prompt Caching。为了优化这一点：

# gateway/run.py 第 578-582 行
self._agent_cache: Dict[str, tuple] = {}
self._agent_cache_lock = _threading.Lock()
# Key: session_key, Value: (AIAgent, config_signature_str)

缓存的键是 session_key，值包含 AIAgent 实例和配置签名。当配置发生变化时（如用户切换模型），缓存失效并重建。

调试指南

本节列出核心架构层面最常遇到的问题及排查方法。

Agent 无响应

症状：用户发送消息后长时间没有回复，Gateway 日志无新条目。

排查步骤：

# 查看 Gateway 最近 50 条日志
journalctl --user -u hermes-gateway -n 50

# 过滤错误级别日志
journalctl --user -u hermes-gateway -p err --since "10 minutes ago"

# 检查进程是否存活
systemctl --user status hermes-gateway

常见原因：LLM API 请求挂起（网络不通）、IterationBudget 已耗尽、审批队列阻塞。先确认进程状态，再看日志中最后一条活动的 agent:step 时间戳。

Provider 连接失败

症状：日志中出现 ConnectionError、401 Unauthorized 或 404 Not Found。

排查步骤：

# 检查 API Key 是否配置且有效
grep -E "API_KEY|api_key" ~/.hermes/.env

# 测试网络连通性（以 OpenAI 为例）
curl -s -o /dev/null -w "%{http_code}" \
     -H "Authorization: Bearer $OPENAI_API_KEY" \
     https://api.openai.com/v1/models
# 期望输出：200

# 检查 config.yaml 中的 base_url 配置
grep -A 3 "model:" ~/.hermes/config.yaml

常见原因：base_url 拼写错误、API Key 过期、代理/防火墙阻断、IPv6 连接问题（可设置 network.force_ipv4: true）。使用 hermes doctor 可自动检测部分问题。

IterationBudget 耗尽

症状：Agent 在多轮工具调用后突然停止，日志中出现 "iteration budget exhausted" 或类似提示。

排查步骤：

# 查看当前 max_turns 配置
grep -A 2 "agent:" ~/.hermes/config.yaml
# 或
grep "max_turns" ~/.hermes/config.yaml

# 搜索迭代日志（观察 "iteration X/90" 模式）
grep -E "iteration|budget|consume" ~/.hermes/logs/agent.log | tail -20

调优方法：在 config.yaml 的 agent.max_turns 中增大上限（默认 90）。如果是子 Agent 频繁耗尽预算，检查 delegation.max_iterations（默认 50）。同时审视 Agent 是否存在重复调用同一工具的死循环倾向。

配置桥接问题

症状：config.yaml 中的值没有生效，行为不符合预期。

排查步骤：

# 确认配置优先级：.env > config.yaml > 代码默认值
# 检查 .env 中是否覆盖了 config.yaml 的值
diff <(grep -E "^[A-Z_]+=" ~/.hermes/.env) <(echo "")

# 查看 Gateway 启动日志中的桥接信息
journalctl --user -u hermes-gateway | grep -i "bridge\|config\|override"

# 使用 hermes doctor 检查配置冲突
hermes doctor

常见原因：.env 文件中的值优先级高于 config.yaml，如果两边都设置了同一配置项（如 HERMES_MODEL），.env 的值会覆盖 config.yaml。确保没有拼写错误或多余的空格。修改 config.yaml 后需要重启 Gateway（或等待新会话生效）。

思考题

_apply_profile_override() 为什么必须在 import 语句之前执行？请列举至少两个会受影响的模块级常量。如果 Profile 覆盖失败（如 Profile 不存在），系统应该如何优雅降级？
Gateway 的配置桥接机制（gateway/run.py 第 89-207 行）为什么不直接让代码读取 config.yaml，而是桥接到环境变量？这种设计的优势和劣势是什么？
AIAgent.__init__ 约 800 行，这种"上帝构造函数"的设计有哪些潜在问题？如果要重构，你会如何拆分？
当 IterationBudget 耗尽时，Agent 不会硬中断，而是给模型"最后一次机会"。这种设计的考量是什么？在什么场景下可能导致问题？
分析 _agent_cache 的缓存策略。当 Gateway 长时间运行，大量不同会话的 Agent 被缓存时，会有什么内存问题？如何改进？
PII 脱敏使用确定性哈希（SHA-256 前 12 位）。这种方案有什么安全风险？如果攻击者获取了哈希值，能否还原原始 ID？

1. 整体架构图

2. 消息的完整旅程

步骤详解

① 飞书收到消息

② Gateway 路由

③ Hook 触发 (agent:start)

④ Agent 初始化

⑤ 系统提示词构建

⑥ 第一次 LLM 调用

⑦ LLM 返回 tool_calls

⑧ 工具执行

⑨ 结果注入

⑩ 迭代循环

⑪ Hook 触发 (agent:end)

⑫ 响应发送

⑬ 会话持久化

3. 三层架构详解

3.1 CLI 层 — hermes_cli/main.py

3.2 Gateway 层 — gateway/run.py

3.3 Agent 层 — run_agent.py

4. AIAgent 核心循环

4.1 run_conversation() 流程

4.2 IterationBudget 迭代预算

4.3 工具调用循环的并行优化

5. Provider 路由机制

5.1 三种 API 模式

5.2 Provider 检测链

5.3 Fallback 机制

5.4 Prompt Caching 优化

6. 配置系统

6.1 配置文件层次

6.2 config.yaml 结构

6.3 环境变量展开

6.4 Profile 系统

7. 会话管理

7.1 SessionStore

7.2 SessionSource 与 SessionContext

7.3 PII 脱敏

7.4 会话重置策略

7.5 Agent 缓存与 Prompt Caching

调试指南

Agent 无响应

Provider 连接失败

IterationBudget 耗尽

配置桥接问题

思考题

3.1 CLI 层 — `hermes_cli/main.py`

3.2 Gateway 层 — `gateway/run.py`

3.3 Agent 层 — `run_agent.py`