[{"data":1,"prerenderedAt":277},["ShallowReactive",2],{"navigation":3,"post-\u002Fposts\u002F2025\u002Fhomelab-disaster-postmortem":20,"surroundPosts-\u002Fposts\u002F2025\u002Fhomelab-disaster-postmortem":264},[4,8,12,16],{"title":5,"path":6,"stem":7},"首页","\u002F","00.index",{"title":9,"path":10,"stem":11},"文章","\u002Fposts","01.posts",{"title":13,"path":14,"stem":15},"动态","\u002Fmoments","02.moments",{"title":17,"path":18,"stem":19},"关于","\u002Fabout","09.about",{"id":21,"title":22,"body":23,"class":242,"cover":243,"coverSize":242,"date":244,"description":235,"draft":245,"extension":246,"hideComments":245,"location":242,"meta":247,"navigation":248,"path":249,"readingTime":250,"seo":255,"sitemap":256,"stem":257,"tags":258,"time":242,"weather":242,"__hash__":263},"posts\u002Fposts\u002F2025\u002F20250707.homelab-disaster-postmortem.md","一次 HomeLab 灾难级事故的复盘",{"type":24,"value":25,"toc":234},"minimark",[26,30,148,151,197,200,220,223],[27,28,29],"h2",{"id":29},"时间线",[31,32,33,41,47,53,59,65,71,77,83,89,95,100,106,112,118,124,130,136,142],"ul",{},[34,35,36,40],"li",{},[37,38,39],"strong",{},"2025-07-07 09:33",": TP-LINK 主路由设备上线告警（上次离线原因：设备重启）",[34,42,43,46],{},[37,44,45],{},"2025-07-07 09:34",": 收到群晖异常关机的邮件通知（收到该通知说明群晖已经重启过了，实际重启时间会更早一点）",[34,48,49,52],{},[37,50,51],{},"2025-07-07 09:36",": 尝试登录群晖 DSM，发现域名解析有问题，无法登录；尝试 ToDesk 远程连接家里的 PC，发现不在线（未开机）",[34,54,55,58],{},[37,56,57],{},"2025-07-07 09:40",": 收到 Uptime Kuma 监控服务的各种告警通知，多项服务不可用",[34,60,61,64],{},[37,62,63],{},"2025-07-07 09:52",": 通过 TP-LINK 商用云平台远程查看主路由，发现可连接，但由于之前为了 IPTV 改为了光猫的子路由（非桥接），无法查看到公网 IP；尝试通过电信的小翼管家查看公网 IP，发现没有入口可查",[34,66,67,70],{},[37,68,69],{},"2025-07-07 09:54",": 尝试通过群晖的 QuickConnect 远程访问，发现之前被我关闭了",[34,72,73,76],{},[37,74,75],{},"2025-07-07 10:30",": 查看自己写的 bots 服务代码（含 ddns 功能），请求失败时，有 backoff 策略，首次失败休眠 1 分钟，然后再失败休眠 10 分钟，再失败休眠 1 小时，决定再等一小时看看",[34,78,79,82],{},[37,80,81],{},"2025-07-07 11:00",": 在 TP-LINK 主路由管理页面尝试通过网络唤醒服务唤醒家里的 PC，发现无法唤醒（事后发现之前记录的网卡 MAC 不对）",[34,84,85,88],{},[37,86,87],{},"2025-07-07 11:30",": 通过米家控制办公桌的智能插座电源重启，尝试唤醒 PC，未成功；打算通过控制机柜的智能插座重启，实现所有服务的重启，但还打算再等等 bots 的 ddns 能否生效",[34,90,91,94],{},[37,92,93],{},"2025-07-07 11:44",": 等了 2 个多小时了，感觉 bots 服务可能已经不在运行，再等下去也没用了，经过深思熟虑决定重启整个机柜电源",[34,96,97,99],{},[37,98,93],{},": 通过米家控制智能插座关闭电源，发现状态未更新，再次点击发现操作失败，此时发现智能插座设备已离线，意识到机柜一旦断电，所有米家设备也无法控制了，再也无法打开",[34,101,102,105],{},[37,103,104],{},"2025-07-07 11:50",": 出发回家，准备手动重启机柜电源",[34,107,108,111],{},[37,109,110],{},"2025-07-07 12:49",": 到家，手动开启机柜智能插座电源",[34,113,114,117],{},[37,115,116],{},"2025-07-07 12:50",": 打开 PC，发现 主板 PCI-E 设备唤醒是 Enabled",[34,119,120,123],{},[37,121,122],{},"2025-07-07 12:51",": 进入 PC 系统，发现网卡的允许设备唤醒也是启用的，但网卡 MAC 地址和之前配置的不一样，原因后面详述",[34,125,126,129],{},[37,127,128],{},"2025-07-07 12:53",": 通过 PC 内网登录 portainer，发现 bots 容器处于 stopped 状态（Stopped for 3 hours with exit code 127），finished 时间为 09:33:52",[34,131,132,135],{},[37,133,134],{},"2025-07-07 12:54",": 手动重新启动 bots 容器，正常启动",[34,137,138,141],{},[37,139,140],{},"2025-07-07 12:55",": bots 服务已正常更新域名解析，手机切换到蜂窝测试，已经可正常访问",[34,143,144,147],{},[37,145,146],{},"2025-07-07 13:01",": 出门赶回公司",[27,149,150],{"id":150},"原因分析",[31,152,153,159,165,171,191],{},[34,154,155,158],{},[37,156,157],{},"导火索","：家里异常断电（TP-LINK 和群晖都在机柜里，他俩同时重启，可断定机柜掉电了；光猫在弱电箱里，查看光猫的启动时间，也在同一时间重启过，可判断是全屋断电了）",[34,160,161,164],{},[37,162,163],{},"直接原因","：自建的 DDNS 服务在光猫重启后公网 IP 发生变化的情况下未更新解析，导致所有服务无法远程访问",[34,166,167,170],{},[37,168,169],{},"根本原因","：包含了 DDNS 服务的 bots 容器在宿主机重启后未能重启成功，经过分析发现因为 bots 容器启动过程中挂载了群晖中的一个目录，用来更新 clash 的配置文件，但是群晖启动会比 bots 容器所在的宿主机慢，可能导致了启动失败",[34,172,173,176,177],{},[37,174,175],{},"处理慢的原因（多种补救措施失效）","：\n",[31,178,179,182,185,188],{},[34,180,181],{},"家里的 PC 未开机，无法通过 ToDesk 远程连接处理（之前几次类似问题都是通过 ToDesk 远程修复）",[34,183,184],{},"家里没人，无法帮忙手动启动 PC",[34,186,187],{},"PC 的远程唤醒功能失效，原因是网卡 MAC 地址记录不正确，这是因为之前记录的是一个虚拟网卡的 MAC，上次去掉了虚拟网卡，直接走的物理网卡，但是忘记记录 MAC 地址",[34,189,190],{},"群晖的 QuickConnect 远程访问服务失效，之前感觉用不到被我手动关闭了",[34,192,193,196],{},[37,194,195],{},"故障升级原因","：由于多个补救方案失效，尝试通过机柜断电重启的方式补救，结果所有设备断电，断绝了任何远程补救的可能",[27,198,199],{"id":199},"改进措施",[31,201,202,205,208,211,214,217],{},[34,203,204],{},"✅ 购买 UPS，确保机柜设备在短暂断电时能够继续供电，避免意外断电导致的服务中断（07-08 更新: 已购买山特 SANTAK TG-BOX850 UPS）",[34,206,207],{},"✅ 提升 DDNS 服务的核心程度，从 bots 项目中独立出来，减少其他依赖（07-08 更新: 已完成）",[34,209,210],{},"✅ 启用群晖的 QuickConnect 服务， DDNS 失效后可连接到群晖上进行一些处理",[34,212,213],{},"✅ 确保 PC 的网络远程唤醒功能正常，可通过远程连接到 PC 解决问题",[34,215,216],{},"✅ 部署一个 Cloudflare Tunnel 容器，作为 DDNS 失效后的备用方案",[34,218,219],{},"✅ 把机柜的米家插座从米家 APP 首页移除，避免误操作关闭电源，吸取教训，以后不要再给机柜断电了",[27,221,222],{"id":222},"经验教训",[31,224,225,228,231],{},[34,226,227],{},"之前出现过一次机柜断电后 DDNS 服务不可用导致无法访问的问题，当时通过 ToDesk 远程连接到 PC，然后通过内网重启了 bots 服务解决了问题，但应该更进一步，看看为什么 bots 服务没有自动重启成功，从而可以避免这次的事故",[34,229,230],{},"核心的服务需要保障高可用，例如公网访问这件事，除了自建的 DDNS 之外，还需要通过 QuickConnect、Cloudflare Tunnel 等多种手段保证可用性",[34,232,233],{},"任何情况下都不要尝试给整个机柜断电这种操作，应该优先考虑其他补救措施",{"title":235,"searchDepth":236,"depth":236,"links":237},"",2,[238,239,240,241],{"id":29,"depth":236,"text":29},{"id":150,"depth":236,"text":150},{"id":199,"depth":236,"text":199},{"id":222,"depth":236,"text":222},null,"jpg","2025-07-07",false,"md",{},true,"\u002Fposts\u002F2025\u002Fhomelab-disaster-postmortem",{"text":251,"minutes":252,"time":253,"words":254},"8 min read",7.38,442800,1476,{"title":22,"description":235},{"loc":249,"lastmod":244},"posts\u002F2025\u002F20250707.homelab-disaster-postmortem",[259,260,261,262],"技术","HomeLab","运维","复盘","ICyfDes8hfks9c7nlNUEMy8kuWktGhB5Iu1TriECazU",[265,271],{"title":266,"path":267,"stem":268,"date":269,"description":270,"children":-1},"删除群晖 Synology 证书设置中自定义的服务","\u002Fposts\u002F2025\u002Fdelete-service-of-synology","posts\u002F2025\u002F20250708.delete-service-of-synology","2025-07-08","今天在给群晖增加一个自带的 DDNS 服务以实现在自建的 DDNS 挂掉的情况下还有备用方案。",{"title":272,"path":273,"stem":274,"date":275,"description":276,"children":-1},"关于何时解放台湾的思考","\u002Fposts\u002F2025\u002Fthoughts-on-when-to-liberate-taiwan","posts\u002F2025\u002F20250401.thoughts-on-when-to-liberate-taiwan","2025-04-01","今天一早就看到新闻，东部战区又在台岛周边开展军演，大家对“温水煮青蛙”式的军演已经习以为常。",1777580264167]