Observations vetting - Need for a reform

pierros · May 2, 2019, 2:42pm

Present state

Since the beginning of our Network operations we have been having a vetting system. Vetting of observations is crucial for our operations and provides extremely useful statistics for satellites and ground stations that are in turn providing great insights and guidance for future observations or identifying issues. (also for future usage on our auto-scheduling algorithms)

Although the guidelines and processes have changed in the past (with introduction of new states like “Failed”) the general concept has remained the same and you can see the outline and our vetting guidelines here: https://wiki.satnogs.org/Operation#Rating_observations

Recent growth of the network and growing diversity of ground stations and satellites is forcing us to re-evaluate those processes and guidelines. Thoughts of introducing more crowd-sourcing around the process and possible AI on it are also further complicating the discussion.

The issue

The main issue at hand (among various others) is the difference between “Failed” and “Bad” states. Coincidentally this is also the most debatable decision per observations as evident in various other threads (like this) and not without a reason.

In theory the concept is (although admitingly not easily digestible in our guidelines) that “Failed” observations should be marked as such if there is a problem with the station (i.e. not returned artifacts, problematic RF line etc) and “Bad” should be reserved as a state when a satellite is malfunctioning and thus we could not pick-up a signal.

The problem arises that in many situations it is really hard to tell what is happening, plus it requires knowledge and research done by the observation vetting user. The vetter should know what are the capabilities of the station, and also know if the satellite is performing well and be able to deduct that it is a ground station issue. In some extreme cases it is even harder to tell what is happening since only a handful of stations could determine the status of the satellite (e.g. the Kicksat situation where it was only picked up by 3 stations, 1 of them being the Dwingeloo Radio Telescope)

Although there are clear cases of “Failed” (like when no artifacts, waterfall, audio etc are returned), there is always an amount of uncertainty on the “Bad” observations (it could be the station or it could be the satellite failing).

Scale comes to the rescue?

In principle since we are doing operations at large scale (3k+ observations per day with more than 200 stations) erroneous vetting should not be a problem. We will always have some level of uncertainty and it should be fine.

In theory we could stick to a much more strict vetting guideline (as suggested by @acinonyx and @fredy before) and mark “Failed” only the absolutely obvious cases (like malformed waterfalls or missing artifacts). This could be done in an automated way since those are easily detectable cases. Then for everything else (except the auto-vetted good) it is up to the users to vet as “Good” or “Bad” (and gradually an AI system could supplement this). That scenario removes any uncertainty between “Failed” or “Bad” but introduces a crucial issue about the statistics and they way we calculate them.

At present, transmitter (thus satellite) statistics are calculated based on Good vs Bad states of observations. Ground station statistics are calculated based on Good-Bad vs Failed states. If we were to change our policy to the above mentioned scenario this would skew the transmitter statistics to unrealistic numbers.

I am opening up this thread to gather feedback and ideas on the way forward and determine collectively what makes sense to follow as a policy and implement the technical aspect of it in our Network. Please do chime in!

K3RLD · May 2, 2019, 5:53pm

Ok, several comments here.

First of all, the “good, bad” options for technically successful observations is utterly confusing. Not for us “old hands”, but surely for new people.

I suggest the following:

“observed” - if you can detect something in the waterfall that is known to be the satellite that is scheduled, then you select “observed”
“not observed” - if there is nothing in the waterfall, if you can’t really tell if what you are looking for is in the waterfall, or if it is obvious that other satellites are in the waterfall, you select “not observed”

3)“failed” - remains the same - a purely hardware failure (obvious to all) that something went wrong (no waterfall, no audio, funky displayed sky track, solid color waterfall, etc.).

I would also suggest perhaps an allowed 100 to 200 characters for comments, such as “signal seen in waterfall is SSB and CW, thus cannot be from this FM sat”.

Second observation: The CW auto vetter doesn’t even seem to be close to reliable or accurate. I see many CW observations vetted as “good” that the CW seen is shown as “IEEEEEIIIEEEEEIIIEE”.

–Roy
K3RLD

thebaldgeek · May 2, 2019, 6:07pm

Brand new user here (Just less than 24 hours). I would like to second the notion of ‘observed’ and ‘not observed’.
The self vetting was a bit odd to me till I started to look at what others had vetted as good and bad, its becoming clear to me that any signal, no matter how weak is ‘good’, even if I felt that it was so weak that no data could be gleaned from it, others had vetted their pass as good and thus so should I.

cgbsat · May 2, 2019, 7:27pm

I agree with the others. With the current classification labels I interpret the vetting as:

good: station operational and signals from the target satellite were received, possibly decoded
bad: station operational but signals from the target satellite were not received, regardless if the satellite is not transmitting or the signals is lost in the noise
failed: station is not operational

I like K3RLD’s suggestion of observed and not observed to signify if the transmitter was heard, though there may be other better suggestions.

n5fxh · May 2, 2019, 7:46pm

I do like the idea of changing the names “good” and “bad”, if possible, because they are really confusing at the beginning (however it will mean changes in a lot of scripts). Observed or not seem much better names.

There is another related issue which I have not seen debated. In the case, there is an “observed” signal from the proper satellite, how does it affect the vetting if the signal is from the wanted transmitter or not
(FSK4.8 instead of FSK9.6, FM instead of SSTV…).

It also depends on what we want to do with this vetting. May be you (@pierros?) could give a reminder on the aims (produce data?)

ctriant · May 3, 2019, 9:46am

That’s a point that I also wanted to discuss. With the current vetting scheme the statistics related to the success ratio of a satellite’s specific transmitter are unreliable.

Thinking of the signal dataset that we want to provide in the future, this information should be reliable or else not existent at all, as it is practically meaningless for the purposes of a scientific research (e.g Machine Learning on inconsistent data is a great pain).

My proposal on this issue, that however relies on the old vetting scheme, is a two-way vetting for the satellite status and the specific transmitter status. Take a look here.

A “good vs bad” scheme makes more sense to me. It provides the core information of interest for an observation, aka “the observed satellite is observed or not”. After all, if the reason for a bad observation is related to the performance of the specific ground station, this will correctly affect the success ratio of the station. On the other hand, the false alarm will not affect that much the statistics of the satellite as “scale comes to the rescue”.

A more ambitious strategy, that I don’t know if it is feasible, could be to use a max-vote process for the stats related to the life status of a satellite, taking into consideration the reported status from all the observing ground stations on a pass.

Summing up, as long as the current scheme provides inconsistent statistics for the status of a satellite, we should move on to a new methodology that we are sure to provide the information as it should be and let the existing statistics converge over time.

K3RLD · May 3, 2019, 11:49am

Yep, this is a good point. I observe all LilacSat-2 pass over my stations, and while the FM “transmitter” is rarely on, there is always some type of telemetry squawk on that transmitter when it’s not on. I vet these as “good”, even though the FM repeater is not on. Again, this might be a good usage of a short comment, such as “FM repeater off, Downlink telemetry only.”

bob · May 3, 2019, 3:08pm

How about ‘the good’, ‘the bad’ and ‘the ugly’?

Acinonyx · May 3, 2019, 4:40pm

This is a topic that has been discussed a lot in the past.

The current vetting process has always been problematic. I believe the reason for that is that during the early days of SatNOGS there was an oversimplified approach on checking whether a satellite was active or not. This worked out at that time cause everything was “controlled”. There was a small group of vetters, ground stations were few and closely monitored, and observations were not that many. The state of the satellite was assumed to be the vetting status of the last observation of a satellite. But as the network grew, new stations came in which were not all that stable. This created a new problem; the old vetting process of “Good” vs “Bad” was influenced by the stability of the station. Unfortunately, a “quick and dirty” solution was implemented; the introduction of the “Failed” state. With the “Failed” state, a vetter could filter out a “bad” station observation and preserve the “Good” and “Bad” as an indication of the satellite state.

As expected, the “Failed” state introduced a whole new set of problems. Now, people cannot easily tell the difference between “Failed” and “Bad” as the definition is indeed somehow gray. This misunderstanding and subjectivity ultimately created inconsistent data. Such data is not only useless for ML, as @ctriant said, but also as input for traditional statistical analysis. But the root cause is that the level of vetting objectivity depends directly on the level of research of past observation each vetter is willing to make and this is not measurable.

I believe that there is one clean solution; neither vet that a satellite is alive nor that ground stations is functioning correctly. We need to move away from the old approach of trying to vet based on what people think that the current state of the satellite or station is. The vetting should be about the observation artifacts themselves and limited to what the system cannot deduce on its own. We should basically, offload to people only the tasks that the machine cannot do well. It is also very important that this process is trivial like answering 1 or 2 simple questions with “Yes” or “No” in order to remove as much subjectivity as possible and make it fast. That being said, in almost all cases vetting should be limited to checking that “some” signal is present at the center of the waterfall within a bandwidth defined by the mode and the quality of TLEs. No prior knowledge or research should be needed to vet. That would change the way we currently vet some observations. For example, unreasonably doppler shifted signals will be vetted bad because they won’t be inside the defined bandwidth. Also, signals which fit within the defined bandwidth will be vetted good even if they are not transmitted with the expected mode.

Nevertheless, the state of satellite will be inferred correctly since it depends on many factors which the system already has knowledge of (e.g. the vetting state of recent observations, the presence of demodulated data of recent observations decoded with the expected decoder, the quality of the ground stations which were used for these observations, etc.). Similarly, the state of the ground station will also be inferred by multiple factors and again vetting result is only one of them.

Regarding the “Failed” vetting state, it should be replaced with an internal flag of the observation and removed as a vetting option. This flag should be set by the system after checking the sanity of the uploaded data (e.g. no or zero-sized audio, no waterfall, obs cut short, etc). This will keep our data clean of what is obviously invalid.

Another thing I want to point out is the importance of multi-vetting support. With this feature, the quality of the vetting data will increase dramatically. I also think that auto-vetting, which is based on whether demodulated data is decodable or not, must be removed. It creates an unnecessary feedback path at the wrong level and messes up the data inputs (machine - user).

With this new approach we will be creating a data set which is consistent and usable both for statistical analysis and machine learning which will hopefully alleviate the need for manual vetting at some point.

fredy · May 3, 2019, 5:20pm

I’m exactly on the same page with @Acinonyx, I just want to add one more parameter that will affect the whole vetting/rating process. There is a plan for moving from waterfall image to waterfall data.

So, client will send data instead of an image to network. This will give us a better control over the waterfall data and how we visualize them and will open the way for other features.

n5fxh · May 3, 2019, 5:53pm

Do you think there will be enough “vetting power” in the network if we remove auto-vetting and/or if there is multi-vetting support. If you also consider that auto-scheduling arriving will dramatically increase the quantity of observations (both observations eligible for auto-vetting or needing manual vetting), it will also increase the vetting needs.

I think disabling auto-vetting or increasing the vetting work can discourage some people submitting observations and can thus decrease the data collected.

Acinonyx · May 3, 2019, 10:41pm

Normally, there would be no need for vetting a signal that is verified by the system to be decoded correctly as this would not contribute in any way to the satellite or station rating. We could totally skip vetting for such cases. But even if we still did that, it should not be done in a way that contaminates the vetting data. If we keep automatically setting of the “Good” flag to basically skip vetting, then we must certainly exclude them somehow from the data set. They are not produced using the same process and thus they express different levels of certainty. I would not even recommend to skip vetting in such cases though. Decoders can make mistakes (e.g. CW case, decoding other passing by satellites) and an additional independent input will always improve accuracy. If we define some simple rules for vetting, then the process will be faster. Then we could create a UI which facilitates mass vetting based on these rules. I think there is much room for improvement on making vetting easier and faster in terms of UI/UX. Another thing I want to say is that we do not need to vet all observations; we just need a sample large enough to infer satellite and station state.

n5fxh · May 4, 2019, 5:19am

OK, This explains a lot of things.
Thanks,

daveh · May 7, 2019, 11:48am

This is a great thread and really helpful. For my personal intrest I am less interested between the distinction between the current “Good” the satellite is still transmitting and Bad meaning their was nothing in the waterfall and the distinction between Good there was a lot of data with a high signal to noise not only can you decode this but the decode is likely to be somewhat useful and Good you can see the data on the waterfall but you may only get a couple of bytes or less of meaningful decode.

The distinction between observed and not observed would be great but I would like some kind of observation quality metric

K3RLD · May 7, 2019, 1:00pm

Getting the “quality” rated would be fantastic - but really I think the trouble is figuring out HOW to do that. The automatic decoders (for Fox satellites, for example) give a somewhat “de facto” quality based on how many frames are decoded. Same with the APRS satellites. But how do you quantify something such as a voice downlink? Or an NOAA image? Those satellites are almost 100% decoded into an image - but the quality of the image can range from fantastic to just a black box.

g7kse · May 7, 2019, 9:00pm

+1 for the observed / not observed btw. Also +1 for the @Acinonyx volunteer vetting / auditing. But a couple of things.

Firstly, Understand the purpose of the observation and how does station a compare with station b. Flight aware use MLAT as a way of verifying data of ADSB transmissions. OK there are a lot more groundstations to make their comparisons but from what I understand they aggregate data and improve consistency. Can we learn from that?

The volunteer auditing / checking would in itself need to be consistent. A person checking in depth might need additional training and to be able to demonsrate that they are suitably qualified and experienced. Otherwise it could easily decend into opinion over fact. Ensuring that people can dip in and out of observations will be crucial. I sometimes have to go to work

fredy · May 7, 2019, 9:18pm

One quick comment on this… This is exactly why, as @Acinonyx said, we need to start vetting observations and not stations/satellites/transmitters. We need to make the process simple with well-defined steps that will not allow anyone to have doubts on how to vet, or vet in an objective way. If we manage this then “vetting” (or better stats) for station/satellites/transmitters will be just a matter of statistics calculations.

n5fxh · May 8, 2019, 5:54am

If you want to suppress the effect of differences between vetters, you can also propose a vetting of random stations.

This way remaining differences between vetters will be applied uniformly on all stations and there will not be difference of vetting between stations anymore. At least the mean of the differences will converge to 0.

ks1g · May 9, 2019, 3:04pm

Excellent point. example: I am trying to capture #43678 DIWATA-2 FM voice downlinks. I consider this observation to be good/successful: https://network.satnogs.org/observations/639768/ There is an obvious voice signal on waterfall and it is detectable in the audio. If I don’t see/hear the FM audio downlink, I vett the observation as bad: https://network.satnogs.org/observations/645691/ The S-curve frequency shift in the waterfall indicates to me the signals are not associated with the satellite, and there is no detectable signal in the audio.

I’ve noticed that other observers vett a similar observation as good: https://network.satnogs.org/observations/629672/ Interestingly, the waterfall looks similar to one I mark as “bad”. I’m not commenting on this observer, just noting 2 observers apply different subjective criteria to the same observation and come up with different assessments.

Another example: https://network.satnogs.org/observations/629669/ I have seen something similar including the signal at 00:75 +17kHz that looks a lot like the “Foxtail” from an AMSAT Fox-1 transmitter and might be Fox-1C (AO-95) nominal 145.920. I marked my reception as “bad” because I did not think I observed the intended target (I also didn’t think it was Fox-1C and need to go back and check again).

So maybe we need a hierarchy that fits:
Excellent: detected the intended transmitter and received data that was auto-decoded or is human-interpretable.
Good: detected the intended transmitter with enough detectable information to be confident of the observation.
[insert appropriate term here]: detected something but cannot definitively associate it with the intended target or with other target satellite transmitters. (I’d put the above observations of DIWATA-2 and capturing Fox-1C(?) in this category).
Bad: did not detect anything (clean waterfall), or detected signals clearly not associated with the target or other possible targets; known to observer as local RFI, …

73 de KS1G

thebaldgeek · May 9, 2019, 4:14pm

I really like this idea.
The 4 levels do not make things overly complicated for new and experienced users alike and I like the notion of the pass being so good that data / voice is detected.
As it currently stands, if you see a signal, its vetted as ‘good’, when in reality, its less than excellent.

Perhaps this 4 level could be implemented until such time as ML / auto vetting can be put in place.