Station offline troubleshooting

ivor · November 29, 2019, 8:04pm

Hi guys,

I’ve recently left testing phase as everything was working well. I’ve set up auto-scheduler and it ran just fine. This morning I receive an email that my station is offline. Once I rebooted it, everything went back online however logs seem to be a bit concerning so I would appreciate some feedback.

Long post incoming! Here’s what daemon.log was showing before I rebooted my station:

Nov 29 08:53:41 satnogs satnogs-client[483]: OperationalError: (sqlite3.OperationalError) database or disk is full [SQL: u’UPDATE apscheduler_jobs SET next_run_time=?, job_state=? WHERE apscheduler_jobs.id = ?‘] [parameters: (1575014081.323102, <read-only buffer for 0x71da8640, size -1, offset 0 at 0x6c44db00>, ‘7fcca4cba1da419b8d8f7a6ce1c1b117’)] (Background on this error at: Error Messages — SQLAlchemy 2.0 Documentation)
Nov 29 08:53:41 satnogs satnogs-client[483]: 2019-11-29 08:53:41,667 - apscheduler.executors.default - ERROR - Job “get_jobs (trigger: interval[0:01:00], next run at: 2019-11-29 07:54:41 UTC)” raised an exception
Nov 29 08:53:41 satnogs satnogs-client[483]: Traceback (most recent call last):
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/apscheduler/executors/base.py”, line 125, in run_job
Nov 29 08:53:41 satnogs satnogs-client[483]: retval = job.func(*job.args, **job.kwargs)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/scheduler/tasks.py”, line 207, in get_jobs
Nov 29 08:53:41 satnogs satnogs-client[483]: replace_existing=True)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/apscheduler/schedulers/base.py”, line 439, in add_job
Nov 29 08:53:41 satnogs satnogs-client[483]: self._real_add_job(job, jobstore, replace_existing)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/apscheduler/schedulers/base.py”, line 867, in _real_add_job
Nov 29 08:53:41 satnogs satnogs-client[483]: store.update_job(job)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/apscheduler/jobstores/sqlalchemy.py”, line 107, in update_job
Nov 29 08:53:41 satnogs satnogs-client[483]: result = self.engine.execute(update)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py”, line 2144, in execute
Nov 29 08:53:41 satnogs satnogs-client[483]: return connection.execute(statement, *multiparams, **params)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py”, line 980, in execute
Nov 29 08:53:41 satnogs satnogs-client[483]: return meth(self, multiparams, params)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py”, line 273, in _execute_on_connection
Nov 29 08:53:41 satnogs satnogs-client[483]: return connection._execute_clauseelement(self, multiparams, params)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py”, line 1099, in _execute_clauseelement
Nov 29 08:53:41 satnogs satnogs-client[483]: distilled_params,
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py”, line 1240, in _execute_context
Nov 29 08:53:41 satnogs satnogs-client[483]: e, statement, parameters, cursor, context
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py”, line 1458, in _handle_dbapi_exception
Nov 29 08:53:41 satnogs satnogs-client[483]: util.raise_from_cause(sqlalchemy_exception, exc_info)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py”, line 296, in raise_from_cause
Nov 29 08:53:41 satnogs satnogs-client[483]: reraise(type(exception), exception, tb=exc_tb, cause=cause)
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py”, line 1236, in _execute_context
Nov 29 08:53:41 satnogs satnogs-client[483]: cursor, statement, parameters, context
Nov 29 08:53:41 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py”, line 536, in do_execute
Nov 29 08:53:41 satnogs satnogs-client[483]: cursor.execute(statement, parameters)
Nov 29 08:53:41 satnogs satnogs-client[483]: OperationalError: (sqlite3.OperationalError) database or disk is full [SQL: u’UPDATE apscheduler_jobs SET next_run_time=?, job_state=? WHERE apscheduler_jobs.id = ?’] [parameters: (1575123236.0, <read-only buffer for 0x6dc04ec8, size -1, offset 0 at 0x6c466e20>, ‘1281378’)] (Background on this error at: Error Messages — SQLAlchemy 2.0 Documentation)
Nov 29 08:54:04 satnogs satnogs-client[483]: 2019-11-29 08:54:04,185 - satnogsclient - INFO - Closing socket: <socket._socketobject object at 0x6e888a40>
Nov 29 08:54:22 satnogs satnogs-client[483]: Found 1 device(s):
Nov 29 08:54:22 satnogs satnogs-client[483]: 0: Realtek, RTL2838UHIDIR, SN: 00000001
Nov 29 08:54:22 satnogs satnogs-client[483]: Using device 0: Generic RTL2832U OEM
Nov 29 08:54:22 satnogs satnogs-client[483]: usb_claim_interface error -6
Nov 29 08:54:22 satnogs satnogs-client[483]: 2019-11-29 08:54:22,751 - apscheduler.executors.default - ERROR - Job “spawn_observer (trigger: date[2019-11-29 07:43:52 UTC], next run at: 2019-11-29 07:43:52 UTC)” raised an exception
Nov 29 08:54:22 satnogs satnogs-client[483]: Traceback (most recent call last):
Nov 29 08:54:22 satnogs satnogs-client[483]: File “/usr/lib/python2.7/dist-packages/apscheduler/executors/base.py”, line 125, in run_job
Nov 29 08:54:22 satnogs satnogs-client[483]: retval = job.func(*job.args, **job.kwargs)
Nov 29 08:54:22 satnogs satnogs-client[483]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/scheduler/tasks.py”, line 110, in spawn_observer
Nov 29 08:54:22 satnogs satnogs-client[483]: observer.observe()
Nov 29 08:54:22 satnogs satnogs-client[483]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/observer/observer.py”, line 301, in observe
Nov 29 08:54:22 satnogs satnogs-client[483]: self.plot_waterfall()
Nov 29 08:54:22 satnogs satnogs-client[483]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/observer/observer.py”, line 390, in plot_waterfall
Nov 29 08:54:22 satnogs satnogs-client[483]: data = np.fromfile(wf_file, dtype=‘float32’).reshape(-1, nchan+1)

Notice the first line, database and disk full. There’s around 11GB of free space, so I doubt it’s disk space.

There’s also error for SDR, I’m not sure why that happens all of the sudden:

usb_claim_interface error -6

Then there’s this which keeps repeating across the whole log every time the scheduling

Nov 29 20:49:05 satnogs satnogs-client[6348]: 2019-11-29 20:49:05,103 - satnogsclient - ERROR - Cannot connect to socket 127.0.0.1:4533
Nov 29 20:49:05 satnogs satnogs-client[6348]: Exception in thread Thread-57:
Nov 29 20:49:05 satnogs satnogs-client[6348]: Traceback (most recent call last):
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/usr/lib/python2.7/threading.py”, line 801, in __bootstrap_inner
Nov 29 20:49:05 satnogs satnogs-client[6348]: self.run()
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/usr/lib/python2.7/threading.py”, line 754, in run
Nov 29 20:49:05 satnogs satnogs-client[6348]: self.__target(*self.__args, **self.__kwargs)
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/observer/worker.py”, line 142, in _status_interface
Nov 29 20:49:05 satnogs satnogs-client[6348]: sock.bind()
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/observer/commsocket.py”, line 104, in bind
Nov 29 20:49:05 satnogs satnogs-client[6348]: self.s.bind((self._TCP_IP, self._TCP_PORT))
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/usr/lib/python2.7/socket.py”, line 228, in meth
Nov 29 20:49:05 satnogs satnogs-client[6348]: return getattr(self._sock,name)(*args)
Nov 29 20:49:05 satnogs satnogs-client[6348]: error: [Errno 98] Address already in use
Nov 29 20:49:05 satnogs satnogs-client[6348]: 2019-11-29 20:49:05,130 - satnogsclient - ERROR - Cannot connect to socket 127.0.0.1:4533
Nov 29 20:49:05 satnogs satnogs-client[6348]: Exception in thread Thread-56:
Nov 29 20:49:05 satnogs satnogs-client[6348]: Traceback (most recent call last):
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/usr/lib/python2.7/threading.py”, line 801, in __bootstrap_inner
Nov 29 20:49:05 satnogs satnogs-client[6348]: self.run()
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/usr/lib/python2.7/threading.py”, line 754, in run
Nov 29 20:49:05 satnogs satnogs-client[6348]: self.__target(*self.__args, **self.__kwargs)
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/observer/worker.py”, line 134, in _communicate_tracking_info
Nov 29 20:49:05 satnogs satnogs-client[6348]: self.send_to_socket(p, sock)
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/observer/worker.py”, line 240, in send_to_socket
Nov 29 20:49:05 satnogs satnogs-client[6348]: position = sock.send(“p\n”).split(‘\n’)
Nov 29 20:49:05 satnogs satnogs-client[6348]: File “/var/lib/satnogs/local/lib/python2.7/site-packages/satnogsclient/observer/commsocket.py”, line 76, in send
Nov 29 20:49:05 satnogs satnogs-client[6348]: self.s.send(message)
Nov 29 20:49:05 satnogs satnogs-client[6348]: error: [Errno 32] Broken pipe

This also keeps showing up in logs:

satnogsclient - ERROR - Cannot connect to socket 127.0.0.1:4533

According to troubleshooting page, it’s related to rotator which I don’t use. I have rotator set to NO in my satnogs-setup.

For now, I disabled auto scheduler as this seems to be something related to observation scheduling. I can send complete logs to devs on request.

Any help would be greatly appreciated.

Thanks!

jebba · November 29, 2019, 8:45pm

This you need to sort out. Perhaps it is writing to a partition that is filling up. It defaults to /tmp on the Pi. Maybe you’re filling that partition?

As for this part:
satnogsclient - ERROR - Cannot connect to socket 127.0.0.1:4533

You can ignore that. Even if you have the rotator set to NO in the config, it still tries to connect to it. That’s what a lot of those python errors are. You can safely ignore that part.

I’d suggest trying to figure out what is up with the disk first. If it is still broken after that, needinfo…

By default sqlite database is here: /tmp/.satnogs/jobs.sqlite

fredy · December 5, 2019, 11:45pm

Just to be accurate, /tmp is part of the RPi RAM. We use RAM instead of the sdcard as there are many reads/writes and this can make the sdcard corrupted soon after a couple of weeks/months of use. Anyway the place that client saved the generated data can be changed in satnogs-setup.

@ivor if that still happens please check this directory “/tmp/.satnogs/data” and see if there are any files in there.

ivor · December 5, 2019, 11:59pm

Thanks for the response! Since starting this thread I’ve re-enabled auto-scheduler and it happened again. So it appears to be happening after I enable scheduler for a few days.

/tmp/.satnogs/data has two folders, files and incomplete. It also shows two files:

receiving_satnogs_1322401_2019-12-05T23-51-41.out
receiving_waterfall_1322401_2019-12-05T23-51-41.dat

This appears to be happening after I enable scheduler for a few days. I’ve disabled auto-scheduler for now. Maybe there’s an issue with the cron job, I didn’t use the priority list because I assumed it will just schedule everything it can, maybe that’s the culprit? Here’s the line I’ve put in crontab:

0 */1 * * * ~/satnogs-auto-scheduler/schedule_single_station.py -s 521 -d 1.2 -z

Thanks for your help!

jebba · December 6, 2019, 12:05am

If you see it again, run:

df -h /tmp
df -hi /tmp
ls -l /tmp/.satnogs/jobs.sqlite

ivor · December 6, 2019, 12:07am

Sure, I’ll give it a try. Thanks

ivor · December 6, 2019, 12:43pm

Okay, so it happened again and I believe I tracked the core issue:

satnogs satnogs-client[483]: usb_claim_interface error -6

Once this occurs, station doesn’t go offline right away however it stops producing any useful observations, without waterfall:

After several observations, especially CPU intensive ones, the unit goes offline while the decoder still runs (in my case it was NOAA decoder). Perhaps all of this is just too much for RPi3b?

A simple reboot fixes the issue until it happens again. Quick googling reveals that the above error might be related to libusb but I honestly don’t know where to look for the fix.

Any advice is welcome. I plan to re-install this unit over the weekend, even though I’ve used satnogs image for it.

Thanks!

jebba · December 6, 2019, 5:59pm

It looks like that error means something else is trying to access the device.

n5fxh · December 7, 2019, 6:44am

I had the exact same issue with a computer under Ubuntu. Every 2 or 3 weeks, the usb interface is locked (with error 6). Last time I only rebooted the usb device and it worked (instead of rebooting the whole computer).
I first inquired on which port is the USB controller with
lspci | grep USB | awk ‘{ print $1 }’
-> put in variable $dev

then (suppose it is USB-3 / xHCI under linux), I did:
echo “0000$dev” > /sys/bus/pci/drivers/xhci_hcd/unbind
echo “0000$dev” > /sys/bus/pci/drivers/xhci_hcd/bind

if USB-2 then directory is ehci-pci.

It is easy to script…

ivor · December 7, 2019, 10:08am

Thanks for your feedback! I was considering scripting something similar if I don’t manage to figure out the root cause. I’m glad to see I’m not the only one having this issue.

Thanks!

n5fxh · December 7, 2019, 10:36am

If you figure the root cause, I’m interested.

ivor · December 10, 2019, 10:35am

Happened again but I have more info which may lead to the culprit.

While the usb interface error 6 still occurs, it doesn’t knock the station offline. At this point there will be no waterfall or it will produce blank waterfalls. What makes station go offline is scheduled observation or rather a start of the decoder which won’t stop for a longer time (particularly if it’s CPU intensive). In my case, today satnogs_bpsk_ax was running for 45 mins after the pass of ZHOU ENLAI satellite. I’ve observed this behavior several times with different decoders running much longer after the satellite pass.

I’ll rebuild this install today or tomorrow, but this time I won’t add any additional software with its dependencies, like auto-scheduler and satnogs-monitor. I’ll follow up with the results.

Rico · December 17, 2019, 12:26pm

Did you configure any long-running post observation scripts? This could trigger the issue as follows, this was the case on my station:

Observation 1 ends, long running post observation script starts
Observation 2 starts while post-ob from 1 is still running
Obs 2 is written to tmpfs, but Obs 1 is not cleaned up yet
tmpfs is filled up
jobs database on tmpfs cannot be written to any more
Obs 2 is not stopped because jobs database cannot be accessed, and will run indefinitely,
Obs 3 is started, but cannot claim usb since Obs 2 is still running.

A lot of steps, but this is what happened on my system. If you defer a long-running post-ob to the background by daemonizing or forking it will be fixed. Would you by coincidence be running my meteor processing script: do a git pull and it will do just that.

ivor · December 20, 2019, 8:59pm

Hi @Rico, apologies for answering late… it was a busy week

Nope, just this to add bias-t

/home/pi/rtl_biast/build/src/rtl_biast -b 0

I haven’t gotten yet to trying your meteor processing script, but will try soon!

Thank you!