Watchdog
Last updated September 23, 2018 by olivernyc
Wireless networks have a bit of a reputation for instability. Modern hardware has fixed most hardware problems, but there is work that needs to be done to make the firmware reliable. You can do this with “watchdog” scripts. I haven’t had to reboot a router that is running our watchdog script.
Our firmware image (based on qMp) comes with a “bmx6health” script that checks whether the mesh software is running correctly and restarts it if necessary. This script by default runs once per day. I’ve found it better to run this every 5 minutes. You can do this by editing the crontab-
ssh into the router and in the terminal-
crontab -e
This opens a vi editor and you can change or add different scripts to run at different times. (The vi commands you need are “i” to insert, “esc” to stop editing, and “:x” to save and eXit.)
For some nodes, their main purpose is to be an internet gateway. To ensure that they always try to be online, you can add a watchdog script that pings a known website and calls “network restart” if it fails. These kind of scripts often ping 8.8.8.8, which is Google’s DNS server.
I’ve discovered 3 ways to recover a qMp mesh router that has functioning wifi but has lost internet- network restart
, bmx6 restart
and restarting dnsmasq-killall dnsmasq; dnsmasq start
. Sometimes the dns forwarder, dnsmasq will stop working correctly letting you ping some things and not others. dnsmasq will then forward bad dns info to the other routers too so it needs to be fixed quickly! killall dnsmasq; dnsmasq start
will fix it.
gwck is a qMp utility that is restarted after network restart.
Another problem I’ve had occasionally is that the wifi will lose connections. Even though the radio is on and the router lights are normal you can’t connect. I’ve written a simple script to restart wifi if both the ad-hoc and access point interfaces have no connections. It is a bit of a hack since the interface may be ok, but since nothing is connected via wifi it doesn’t hurt too much to restart it. I’ve also found that a network restart is necessary to make the wifi stable.
By default wlan0 is the ad-hoc interface that is used to mesh the routers and wlan0ap is the access point. This script checks to see the number of wireless interfaces so it works with dual-band routers and routers that are only ad-hoc or ap.
I’m using “Signal: unknown” to show there is no connection. It seems to work reliably. You could also try iwinfo wlan0 assoclist.
“sleep 5” is usual between “wifi down” and “wifi up”. I’ve found it not necessary when there are no connections, but I’ll leave it there in case.
You can download the watchdog here
in the terminal-
vi /root/mesh-watchdog.sh
and paste this:
#!/bin/sh
# mesh-watchdog v1.1.1, NYC Mesh, Brian Hall
restartWifi()
{
wifi down
sleep 5
wifi up
}
restartNetwork()
{
/etc/init.d/network restart
if /etc/init.d/gwck enabled; then
/etc/init.d/gwck restart
fi
/etc/init.d/bmx6 restart
sleep 4
killall dnsmasq
/etc/init.d/dnsmasq start
}
#gets date-time from log and exit if recently run. date-time is first two words of last line
exitIfRecentRestart()
{
if [ -e $LOG ]; then
set -- `tail -1 $LOG`
LASTRUN=`date --date="$1 $2" +%s`
if [ "$?" = "0" ]; then
#don't run for 1200s (20 minutes)
NEXTRUN=$(($LASTRUN + 1200))
NOW=`date +%s`
es=$(($NOW - $LASTRUN))
printf "time since last restartNetwork: "
printf '%dd %dh:%dm:%ds\n' $(($es/86400)) $(($es%86400/3600)) $(($es%3600/60)) $(($es%60))
if [ $NOW -lt $NEXTRUN ]; then
echo "waiting $(($NEXTRUN - $NOW)) seconds, use option -f to force"
exit 1
else
echo "run tests-"
fi
else
echo "invalid date from log, run tests-"
fi
else
echo "no log, run tests-"
fi
}
LOG="/tmp/log/mesh-watchdog.log"
FORCE=0
if [ "$1" = "-n" ]; then
echo "restartNetwork"
restartNetwork
exit 1
elif [ "$1" = "-f" ]; then
echo "force tests-"
FORCE=1
elif [ "$1" = "-w" ]; then
echo "restartWifi"
restartWifi
exit 1
elif [ "$1" = "-b" ]; then
echo "restart wifi, wait, restart network"
restartWifi; wait 60; restartNetwork
exit 1
elif [ "$1" != "" ]; then
echo -e "Usage: `basename $0` [OPTION]\n\nTests wifi and internet connections and restarts if necessary (default)\n\n\t-f\tforce test\n\t-n\trestart network\n\t-w\trestart wifi\n\t-b\trestart both wifi and network\n\t-h\toptions\n"
exit 1
fi
if [ $FORCE != 1 ]; then
exitIfRecentRestart
fi
DATE=`date +%Y-%m-%d\ %H:%M:%S`
IWINFO=`iwinfo`
# find lines containing "ESSID"|get name (previous word)|replace return with ","
WI=`echo "$IWINFO" | grep ESSID | grep -Eo '^[^ ]+' | sed ':a;N;$!ba;s/\n/, /g`
# count the number of wlan interfaces, and number of wlans with 'no signal'
WLAN=`echo "$WI" | wc -w`
NOSIGNAL=`echo "$IWINFO" | grep 'Signal: unknown' | wc -l`
if [ $WLAN -eq 0 ]; then
echo "no wlan interfaces, wifi is probably disabled"
elif [ $WLAN -eq $NOSIGNAL ]; then
# all wlan interfaces are down, so restart wifi
echo "$DATE restart wifi- wlans:$WLAN no-signal:$NOSIGNAL interfaces:$WI" | tee -a $LOG
restartWifi
sleep 60
restartNetwork
exit 1
else
echo "wifi:ok wlans:$WLAN no-signal:$NOSIGNAL interfaces:$WI"
fi
# restart network if ping google.com && 8.8.8.8 fails 4 times
count=1
while [ "$count" -le 4 ]
do
if /bin/ping -c 1 google.com >/dev/null && /bin/ping -c 1 8.8.8.8 >/dev/null; then
echo "wan:ok ping-count:$count"
exit 0
fi
let count++
done
echo "$DATE network restart" | tee -a $LOG
restartNetwork
Make it executable-
chmod +x /root/mesh-watchdog.sh
Afterwards, add the following entry with crontab -e
* * * * * /root/mesh-watchdog.sh
It can run once a minute as it detects whether a network restart has just occurred and will wait 20 minutes before restarting again. I added the 20 minute delay so the router is still functional without an internet gateway.
Thanks to Nitin for help with the wifi problem and Zach for help with dnsmasq.
Email me if you have any questions or suggestions.