SatNOGS client lots of failed observations

My two currently active stations have been failing observations at a much higher rate than the historical trend (~never apart from my own mistakes).

#3150 is the SatNOGS Kit demo hardware from Hamvention. We re-imaged the uSD card, mostly as a student training opportunity, but the hardware was otherwise reassembled to as-received condition.

I’m leaving troubleshooting notes and things I’ve collected and seen, partly because I don’t have much dedicated time to focus on root-cause discovery, and partly to give search hits for others with similar behavior.

Info

Both stations run the satnogs-auto-scheduler with the majority same priorities file. This is to intentionally schedule obs of the same transmitter on co-located systems to make for comparisons.

#834 is a Pi4 driving a G-5500 rotator and a set of VHF+UHF Yagis. Yes, there is indeed no preamp between the antenna and the 60m of LMR-600 run to the diplexer. Some day…

#3150 has its Pi4 on the roof with PoE for network and power.

Obs 8247156 failure

Sep 28 23:17:13 satnogs-1 satnogs-client[432]: satnogsclient.scheduler.tasks - INFO - Spawning observer worker.
Sep 28 23:17:13 satnogs-1 satnogs-client[432]: satnogsclient.observer.observer - INFO - Start rotctrl thread.
Sep 28 23:17:13 satnogs-1 satnogs-client[432]: apscheduler.executors.default - ERROR - Job "spawn_observer (trigger: date[2023-09-29 04:17:13 UTC], next run at: 2023-09-29 04:17:13 UTC)" raised an exception
Sep 28 23:17:13 satnogs-1 satnogs-client[432]: Traceback (most recent call last):
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:   File "/var/lib/satnogs/lib/python3.9/site-packages/apscheduler/executors/base.py", line 125, in run_job
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:     retval = job.func(*job.args, **job.kwargs)
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:   File "/var/lib/satnogs/lib/python3.9/site-packages/satnogsclient/scheduler/tasks.py", line 64, in spawn_observer
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:     observer.observe()           
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:   File "/var/lib/satnogs/lib/python3.9/site-packages/satnogsclient/observer/observer.py", line 243, in observe
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:     self.run_rot()               
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:   File "/var/lib/satnogs/lib/python3.9/site-packages/satnogsclient/observer/observer.py", line 334, in run_rot
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:     self.tracker_rot.trackobject(self.location, self.tle)
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:   File "/var/lib/satnogs/lib/python3.9/site-packages/satnogsclient/observer/worker.py", line 184, in trackobject
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:     self._midpoint = WorkerTrack.find_midpoint(self.observer_dict, self.satellite_dict,
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:   File "/var/lib/satnogs/lib/python3.9/site-packages/satnogsclient/observer/worker.py", line 150, in find_midpoint
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:     timestamp_max = pytz.utc.localize(ephem.Date(observer.next_pass(satellite)[2]).datetime())
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:   File "/var/lib/satnogs/lib/python3.9/site-packages/ephem/__init__.py", line 534, in next_pass
Sep 28 23:17:13 satnogs-1 satnogs-client[432]:     result = _libastro._next_pass(self, body)
Sep 28 23:17:13 satnogs-1 satnogs-client[432]: ValueError: that satellite appears to be circumpolar and so will never cross the horizon

.... later ....

Sep 28 23:56:06 satnogs-1 satnogs-client[432]: satnogsclient.scheduler.tasks - ERROR - Observer job lock acquiring timed out.

Correlation?

Looking back at the obs history, maybe there is a pattern on #834 of GOES-17 being the first failure. After that the station fails everything until I reboot the Pi.

maybe?

… still looking for other correlations for the first failure of a series.

1 Like

Never seen that error, but should be handled more gracefully with this MR.
Looks like that was a manual schedule by @ENSFNM and not from auto-scheduler, I recall there was some progress with geostationary observations.
More specifically this seems to affect the rotator part of the code, so it should really be handled there as well, it should be possible to run such an obs.
Restarting the client should be sufficient, but applying that MR locally will make sure it doesn’t lock up.

Thanks for the tip @SA2KNG .

I was thinking that the finally: OBSERVER_LOCK.release() on the next line would execute regardless of whether an exception was caught or not. Or the issue is elsewhere that is preventing the lock from releasing.

Before rebooting, I fire off a:

journalctl -u satnogs-client.service | pv | gzip > "$(date -I).gz"

The satnogsclient.scheduler.tasks - ERROR - Observer job lock acquiring timed out. seems to begin a series of failed obs (which makes sense). What causes it is still ??? for me.

Frequent network timeouts, but that seems unrelated to me.

It seems the exception just trickles back to the previous level as it’s not caught.

If nothing else has happened since, scheduling a geostationary on a station with rotator, exception in find_midpoint() should be caught.

1 Like

Failed obs SatNOGS Network - Observation 8314166

More collecting of initial causes.

Oct 12 21:02:45 satnogs-1 satnogs-client[431]: satnogsclient.observer.observer - INFO - Start gnuradio thread.
Oct 12 21:02:46 satnogs-1 satnogs-client[10187]: Found Rafael Micro R820T tuner 
Oct 12 21:02:46 satnogs-1 satnogs-client[10187]: [INFO] Opening Generic RTL2832U OEM :: 00002004...
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]: Found Rafael Micro R820T tuner 
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]: [R82XX] PLL not locked!        
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]: [INFO] Using format CF32.      
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]: gr::log 2023-10-12 21:02:47,550 :INFO: satnogs.doppler_compensation: Output sampling rate should be less or equal the device sampling rate
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]: Traceback (most recent call last):
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]:   File "/usr/bin/satnogs_fsk.py", line 544, in <module>
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]:     main()                     
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]:   File "/usr/bin/satnogs_fsk.py", line 527, in main
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]:     tb = top_block_cls(antenna=options.antenna, baudrate=options.baudrate, bb_freq=options.bb_freq, bw=options.bw, dc_removal=options.dc_removal,                decoded_data_file_path=options.decoded_data_file_path, dev_args=options.dev_args, doppler_correction_per_sec=options.doppler_correction_per_sec, enable_iq_dump=options.enable_iq_dump, file_path=options.        file_path, framing=options.framing, gain=options.gain, gain_mode=options.gain_mode, iq_file_path=options.iq_file_path, lo_offset=options.lo_offset, other_settings=options.other_settings, ppm=options.ppm,       rigctl_port=options.rigctl_port, rx_freq=options.rx_freq, samp_rate_rx=options.samp_rate_rx, soapy_rx_device=options.soapy_rx_device, stream_args=options.stream_args, tune_args=options.tune_args,               udp_IP=options.udp_IP, udp_dump_host=options.udp_dump_host, udp_dump_port=options.udp_dump_port, udp_port=options.udp_port, waterfall_file_path=options.waterfall_file_path)
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]:   File "/usr/bin/satnogs_fsk.py", line 139, in __init__
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]:     self.satnogs_doppler_compensation_0 = satnogs.doppler_compensation(samp_rate_rx, rx_freq, lo_offset, baudrate*decimation, 1, 0)
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]:   File "/usr/lib/python3/dist-packages/satnogs/doppler_compensation.py", line 91, in __init__
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]:     raise AttributeError       
Oct 12 21:02:47 satnogs-1 satnogs-client[10187]: AttributeError                 
Oct 12 21:02:50 satnogs-1 satnogs-client[431]: satnogsclient.observer.observer - INFO - Tracking stopped.

Then the waterfall creation raised an exception, because the .dat was zero size.

The next observation SatNOGS Network - Observation 8322858 then failed with

Oct 12 21:36:57 satnogs-1 satnogs-client[431]: satnogsclient.scheduler.tasks - ERROR - Observer job lock acquiring timed out.




Trying out some upgraded exception catching:

patched on station SatNOGS Network - Ground Station Valpo WIRED Lab 1

and

… now to wait for more errors in the logs.

1 Like

(My read is) that is an old issue we missed to track & fix. There was a mitigation implemented by modification of satnogs-db, but there was no safeguard in place that new transmitters with high baudrate could be added to satnogs-db. Apparently this happened now. The mitigation will be renewed by @fredy and the underlying issue is tracked in satnogs-client#466 now.

satnogs-client!621 is a “catch-all” fix which I would propose we should adopt immediately, but a proper fix should be implemented anyway.

edit: stated my uncertainty, because I had no time to properly look into stacktraces / code yet.

1 Like