2015年3月15日 星期日

安裝於CentOS的supervisord無預警關閉 (crash with system time change - fixed)

有一隻自己寫的Linux程式跑在CentOS 6.6 x86_64上
寫的不好 執行發生錯誤時有可能會當掉
但為了某些原因它必須重新跑起來
[I installed supervisord 3.1.3 on CentOS 6.6 x86_64 to monitor a daemon
and to respawn it automatically whenever necessary]

網路文說supervisord是宣稱唯一不需要搶系統優先控制權的工具

簡單裝起來 只把自己的程式路徑名稱寫入設定檔
用預設值就跑起來了 以為一切就是這麼簡單美妙
[With almost everything in default value, supervisord started to run as exepcted]

第二天進系統檢查一下....


我的程式還苟活著 不錯

咦~ supervisord不見了 它......竟然掛了!
[However, surprise came to me on the second day. Supervisord crashed with no hint at all. It just gone!]


查遍各大論壇 終於看到有人po文表示:
自己的Linux時間常跑掉 用ntp對時後supervisord就無預警的掛了
[Some thread in a forum mentioned a symptom that supervisord crashes after ntpdate adjusted system time]

回測自己主機 確實在手動將時間往前調之後supervisord就閃退了
[I made some tests with manual 'date' command to turn back system time (hours). Oops! Same here: crash... crash... and crash.....]

再經過一番努力搜尋又測了三天 確認這個解法有用
[Luckily I found a solution by the honorable person "Mook-as". And his/hers solution was tested to work on my system]

 Use monotonic time for process state tracking by Mook-as

成因大致是supervisord處理time的方式讓它在系統時間倒退後出了問題
(例如某程序在檢查目前時間是否已到執行某工作時 發現自己回到過去了)

把所有time.time()換成monotonic_time()就好了
[For those who having the same problem as I did, apply the patch (I did it by manual input) to replace time.time() then it should work]

連結就是上方的標題

有相同需求的朋友請自行patch你的.py程式吧
[Thank you, Mook-as!]

cheers~