Classi-Fly: Inferring Aircraft Categories from Open Data

In recent years, air traffic communication data has become easy to access, enabling novel research in many fields. Exploiting this new data source, a wide range of applications have emerged, from weather forecasting to stock market prediction, or the collection of intelligence about military and government movements. Typically these applications require knowledge about the metadata of the aircraft, specifically its operator and the aircraft category. armasuisse Science + Technology, the R&D agency for the Swiss Armed Forces, has been developing Classi-Fly, a novel approach to obtain metadata about aircraft based on their movement patterns. We validate Classi-Fly using several hundred thousand flights collected through open source means, in conjunction with ground truth from publicly available aircraft registries containing more than two million aircraft. We show that we can obtain the correct aircraft category with an accuracy of over 88%. In cases, where no metadata is available, this approach can be used to create the data necessary for applications working with air traffic communication. Finally, we show that it is feasible to automatically detect sensitive aircraft such as police and surveillance aircraft using this method.


INTRODUCTION
Aircra metadata is a key component for research and applications related to aviation tracking. Recent work includes the collection of open source intelligence on government and military operations [23], the detection of mergers and acquisitions data of public companies [22], or the in-depth analysis of privacy leaks [20].
Journalists and researchers alike have used the ready availability of aircra tracks to gain insights on the intentions and plans of the passengers-or in some cases, the pilots. For example, reporters have used such data to uncover federal surveillance aircra deployed in the USA [1]. Most of these applications rely on metadata since upfront information on the aircra category-i.e. the broad category of user, such as government, surveillance, business or military-is not available. A common way to do this uses the owner (or operator) or the aircra model (e.g. a corporate jet such as a Gulfstream), from which the use case and category of the aircra may be inferred with a good level of certainty. For instance, US Air Force operates military aircra so will likely be ying military operations, whereas business jets are likely to be used for corporate purposes.
However, in a large percentage of cases there is no meta information available for observed aircra . is makes it much more di cult to identify the category of an aircra . A recent study found that around 15% of all transponder-equipped aircra could not be found using publicly available data [17]. Typically, these aircra are from countries that do not provide an open aircra registry. Furthermore, they may carry out sensitive operations, or be very recently registered.
In this paper, we present Classi-Fly, and show that it is feasible to classify aircra into di erent operator categories based on their ight movement pa erns. e main advantage of using exclusively behavioural features is that they can not be trivially altered or spoofed by the aircra operator without signi cant cost (e.g., diversions or other changes to the mission pa ern), which is contrary to any classi cation based on the content of their communication.
In cases where no metadata is available, this approach can be used to obtain the aircra categories with an accuracy of over 88%. e applications for our work range from investigative journalism to open source intelligence or research on the technical aspects of transponder equipage. We illustrate this by use of a case study where previously unknown aircra can be identi ed as surveillance aircra with a very high likelihood.
In this paper, we make the following contributions: • On a dataset of 6014 aircra , we show that it is feasible to automatically estimate the category of a given aircra with over 88% accuracy based solely on its ight behaviour. • Using our approach, we classify a further 1,066 unknown aircra into di erent categories, e ectively deriving valuable metadata information for these aircra , which can be used for popular research applications. • We discuss the implications of our method, including potential countermeasures, and analyze a case study of previously unidenti ed aircra with sensitive mission pro les. e remainder of this paper is structured as follows: Section 2 describes the necessary background on air tra c control and tracking. Section 3 discusses the related work while Section 4 introduces our data collection process. Section 5 describes our experimental design before Section 6 presents the results. Finally, we discuss our method in Section 7 and conclude this paper in Section 8.

BACKGROUND
is section provides the necessary background to how aircra tracking works. Fig. 1 shows the wireless communication links of two considered technologies, which are explained in the following.

Surveillance Technologies in Aviation
ere are two main surveillance technologies used for cooperative tracking of civil aircra . Secondary Surveillance Radar (SSR) uses the so-called transponder Modes A, C, and S, which provide digital target information(altitude, squawk identi cation) compared to traditional analog primary radar (PSR). Aircra transponders are interrogated on the 1030 MHz frequency and reply with the desired information on the 1090 MHz channel (see Fig. 1, right.) With the newer Automatic Dependent Surveillance-Broadcast (ADS-B) protocol (see Fig. 1, le ), aircra regularly broadcast their own identity, position, velocity and other information such as intent or emergency codes. ese broadcasts do not require interrogation; position and velocity are automatically transmi ed at 2 Hz [16].

Aircra Identi ers in Air Tra c Communication
A 24-bit address assigned by the International Civil Aviation Organization (ICAO) to every aircra is transmi ed via both ADS-B and SSR. Crucially, this identi er is di erent to an aircra squawk or call sign. Squawks, of which only 4096 exist, are allocated locally and not e ective for continuous tracking. e call sign can be set separately through the ight deck for every ight. Call signs of private aircra typically consist of the aircra registration number, commercial airliners use the ight number, and military and government operators o en use special call signs depending on their mission. In contrast, the ICAO identi er is globally unique and provides an address space of 16 million; while the transponder can be re-programmed by engineers, the identi er is not easily (or legally) changed by a pilot. ese characteristics make it ideal for continuous tracking over a prolonged period of time.

Required Data Mining Capabilities
Aircra tracking is the act of obtaining live or delayed positional information on aircra by purely passive actors. eir motivations range from traditional hobbyist planespo ing enthusiasm over military and business interests to criminal intent. Where traditionally most spo ers have conducted their trade purely using visual means, i.e., seeing and recognizing the aircra near an airport, modern so ware-de ned radio (SDR) technology has made accurate, fast and scalable tracking of aircra feasible for anyone. ere are two options to exploit SDRs: install their own personal receivers or use the SDR data aggregated by web tracking services. While a single receiver with a radius of up to 600 km can already provide interesting results, the insights are increased considerably with a larger network. Both live tracking data and the required metadata are easily accessible on-line as discussed in Section 4.

RELATED WORK
e classi cation of objects or subjects based on wireless communication has been a popular eld of research, in particular with a focus on security and privacy aspects. Exemplary studies range from the mobility states of humans [14] to the classi cation of intruders (people, soldiers, vehicles) in a military se ing [2]. e closest related academic research is the classi cation of different types of ground vehicles. Vehicle type classi cation is an important signal processing task with widespread military and civilian applications in intelligent transportation systems [7]. Several data types have been used for vehicle classi cations, collected for example from acoustic or seismic [19,25] sensor sources. e authors in [24] distinguish two classes of vehicles (trucks and passenger cars) using GPS data extracted from mobile tra c sensors with a misclassi cation rate of 4.6%.
e main features are based on the vehicles' acceleration and deceleration behaviour. Other work used GPS-based tracks of cab drivers to study their behaviour and classify them into high-earning and average-earning drivers through the use of angularity and travel time features [13]. Using taxi tracks with a di erent focus, further work a empted to uncover anomalous trajectories in a dataset by comparing and isolating tracks which are few and di erent from the majority [26].
In the aircra domain, wireless classi cation has focused on traditional non-cooperative PSR communication as the medium. Such work focuses on both military [12] and commercial aircra [27] and includes the exploitation of Doppler signatures [5] and high resolution range pro les [27] to identify the type of aircra seen by the radar. However, primary radars are expensive and are replaced globally with the more accurate and cost-e cient ADS-B. e closest non-academic work related to our approach is the successful a empt of investigative journalists to uncover unknown surveillance aircra in the USA, which was presented at DEFCON 25 [9]. e authors report on the background of so-called spy aircra , which are identi ed using a machine learning approach on aircra ight data pre-processed by a large commercial tracking website. While we follow a similar basic approach concerning such surveillance aircra in this work, we systematically analyze the e ectiveness and validity of applying machine learning to aircra behaviour by processing a large open data set. Importantly, we further generalize our approach to many aircra categories.

DATA COLLECTION
We describe the processes for the collection of ne-grained tracking data and for obtaining aircra ground truth from public sources. All data used in this work has been openly available and is thus already accessible to researchers on an ever growing scale.

e OpenSky Network
OpenSky is a crowdsourced network [16], which is used as the backbone of our data collection. As of July 2019, the OpenSky Network consists of more than 2000 registered sensors streaming data to its servers. e network has currently received and stored over 16 trillion ATC messages, adding over 15 billion messages by more than 50,000 di erent aircra every day. As a non-pro t, research-oriented network, OpenSky o ers open access to its data to academic researchers and has been used for a large number of publications spanning many di erent domains. Detailed information about the history, infrastructure and use cases of OpenSky are provided in [16].
Data Acquisition and Pre-Processing. Aircra tracks can be retrieved from the OpenSky Network for free for universities, ight authorities, and other non-pro t research institutions. 1 e available data goes back several years, for which it o ers dense coverage of Europe and the US. More recently, it has spread to other continents, although coverage in Africa in particular is still lacking as it is based on volunteers to provide the locally broadcast aircra communication. We obtained about 200,000 such aircra trajectories for our ground truth and another 180,000 for the di erent classi cation categories. e raw data is obtained from OpenSky via an Impala shell and consists of so-called state vectors, which describe the state of every observed aircra , i.e., its position, altitude, and velocity in increments of one second. All state vectors were then separated into ights, by dividing the positional data messages received by all aircra as follows: Each positional state which is more than 10 minutes older than the next and is at an altitude of less than 2500 m is considered an arrival state, and hence a nished ight. Note that not all ights seen by OpenSky are necessarily complete, if a ight begins or nishes outside the coverage area, the rst/last message will constitute the end point of the ight. We did not di erentiate between complete or incomplete ights in order to maximize the robustness of our approach. OpenSky conducts some additional processing to lter out erroneous messages and transmission-induced noise as well as potentially maliciously altered data [18]. 1 h ps://opensky-network.org/data/impala

Aircra Behavioural Ground Truth
To facilitate the feature selection in the next section, we required ground truth on the average ight and movement behaviour of aircra . We rst retrieved the positional data of 9880 randomly selected aircra seen by OpenSky in the year 2017 to be able to obtain the average values as boundaries for our features. is data was capped at maximum of 25 ights per aircra , which resulted in more than 200,000 collected ights, with an average duration of 4669 seconds and a total number of more than 30 million analyzed state vectors. Table 1 provides the details of the ground truth dataset.
We then used these randomly selected aircra to learn the average aircra behaviour with regards to its ight features, which are discussed in Section 5.2. For each feature, we quantized the data set into q quantiles and learned these quantiles' speci c bounds. ese are then used to obtain the relative behaviour of di erent aircra categories for our classi cation task.

Aircra Metadata Ground Truth
ere are several public sources which provide meta-information on aircra based on their identi ers: the aircra registration or a unique 24-bit address provided by ICAO. is typically includes the aircra model (e.g., Airbus A320) and the owner/operator (e.g., British Airways), which we exploited to label our aircra category ground truth.
We have used the following openly available sources to collect and verify the ground truth for our work: • e OpenSky Network has recently released an aircra database complimenting its tracking e orts with crowdsourced metadata on over 495,000 aircra . Available here: h ps://opensky-network.org/aircra -database • Another non-pro t project, Airframes.org, is a valuable source, o ering comprehensive metadata about 609,000 aircra identi ers. is includes background knowledge such as pictures and historical ownership information (available at h ps://opensky-network.org/aircra -database). • For aircra registered in the USA, the FAA provides a daily updated database of all owner records, online and for download. ese naturally exclude any sensitive owner information but overall contain over 320,000 clean and well-organised records as of January 2018 (available at h ps://registry.faa.gov/aircra inquiry/). • Furthermore, the plane spo ing community actively maintains many separate databases with spo ed aircra . ey usually operate SSR receivers and enrich the received data with information such as operator, model, or registration manually. e database structure of Kinetic Avionic's BaseStation so ware has become the de facto standard format and is also used to exchange and share their databases in forums and discussion boards. Our database version used stems from November 2017, containing 455,457 rows of aircra data.
• Lastly, web services such as FlightAware and FlightRadar24 provide online access to more than a million aircra IDs (available at h p://www. ightaware.com and h p://www. ightradar24.com).
When considering all these databases together, we had access to metadata for 2,180,803 unique aircra identi ers; this snapshot for our work was taken in January 2018.
Note that these sources are naturally noisy, since they rely on compiling many separate smaller databases, are o en (partly) crowdsourced and change over time; aircra are frequently registered, de-registered and transferred globally. Due to the number of aircra involved in the experiments in this paper we could not verify the model and operator of every aircra by hand (i.e., by following their behaviour on web trackers and ensure consistency with the existing database). Nonetheless, this is a realistic situation for anyone looking to accurately categorize aircra and requires an approach which is robust to such noise uctuations.

Aircra Category Extraction
Based on the data provided by OpenSky and the collected metadata, we obtained ight behaviour data for eight di erent aircra categories described here in brief: • Business jets: Business stakeholders typically y jets capable of 4-20 passengers. Gulfstream's G-range, Cessna's Citation jets and Bombardier's Learjet and Challenger aircra are amongst the most popular choices. However, this category also comprises smaller and larger aircra as long as they are operated for business use. • Commercial airliners: A large group that makes up a vast majority of passenger miles in the air. It is de ned by the operator, i.e. a commercial airline that conducts scheduled transport, typically with large aircra seating 50 or more passengers (e.g., Airbus 320 or Boeing 737). With the exception of the commercial airliners and small utility aircra , all are potentially sensitive aircra categories, knowledge of which is required for many use cases. We note that these categories are not determined solely on aircra model but instead on their use cases as de ned by the operator (i.e., military or not).
Indeed, there is also overlap in some military aircra models, for example Multi Role Tanker Transport (MRTT) aircra ful l several roles.
In the future, armasuisse plans to extend these categories to include unmanned aerial vehicles (UAV, or drones) and other nonstandard aircra such as gliders or ultralight aircra (ULAC). is requires these aircra categories to have su ciently broad equipage with ADS-B transponders or alternatives such as FLARM. 2

EXPERIMENTAL DESIGN
We describe the features used to determine aircra behaviour and explain the experimental data set used to predict aircra categories.

Experimental Data Sets
To select our main data set, we rst queried the full sample of aircra seen by OpenSky in January 2018, which spanned 87,000 aircra in total. is sample was then classi ed into eight di erent categories based on operator and model metadata (see Section 4.4).
We aimed to obtain 1000 aircra per category, however, for ve of the subcategories (in particular the sensitive categories comprising military and surveillance aircra ) there are fewer aircra with reliable identi cation and the necessary transponder equipment required to obtain the detailed ight behaviour data. us, we picked all available aircra for ghters, surveillance aircra , tankers, trainer and transport aircra .
For small utility aircra , the available pool was larger, however, due to the fact that many surveillance aircra share the same aircra model (in particular Cessna 182's [1]), manual inspection of all aircra and their tracks was required to accurately label the ground truth For the abundant business and commercial categories, we picked random 1000 aircra to represent their category.
us, the main data set used for our classi cation experiments consists of 6014 aircra overall, each with a maximum of 50 ights. Table 2 provides the breakdown of all aircra categories as well as the number of ights and individual state vectors used to obtain the classi cation features. e lowest number of ights (6918) and messages (751,000) could be obtained for the 921 ghter aircra , presumably due to their comparatively rare use. At the upper end, the 1000 commercial aircra were seen on 48,590 ights with over 12 million messages, illustrating the high utilization of commercial airliners. Overall, over 185,000 ights and almost 40 million messages were processed to obtain the behavioural features. Finally, Table 3 shows the main countries of origin of our dataset, with the US making up just under half of all aircra , followed by several European countries, China, Australia and Canada.
Unknown Aircra . We further obtained all features described in Section 5.2 from 1066 unknown aircra , i.e., aircra sending messages with identi ers where no metadata was available from any of the structured sources. We use the communication received from these identi ers to gain insights on the potentially sensitive category of their aircra . Naturally, we consider that there will be some noise in this dataset, which we will not be able to fully solve due to the lack of ground truth. anks to OpenSky's sanity   Based on the 24-bit identi er, if truthful, it is possible to obtain the country the aircra is nominally registered in, by comparing it with the o cial ranges de ned by the ICAO [10]. Table 4 shows the main countries of origin, ranging from several European countries to China, Brazil and Australia. We nd that the distribution is di erent to the main dataset (albeit with a small sample size), in particular the lack of US aircra is noteworthy.
We have several hypotheses and explanations for the absence of these unknown aircra from available public sources: (1) Sensitivity: Highly sensitive military or state aircra are excluded from public records in most countries. Depending on their missions, their country, and their use cases, hobbyist plane spo ers may not be able to ll these gaps with information gleaned through traditional planespo ing. (2) Novel aircra : Depending on the quality of the public or private records, aircra in many countries take several weeks or months until they turn up in public databases.

Feature Extraction
We selected 12 di erent features, divided into two main categories: ight level and state vector features. We explain these categories in the following; a full list of the chosen features is presented in Table  5. We also contrast these with non-behavioural features, which we chose not to integrate into our approach. Non-behavioural Features. ere are potential features available in OpenSky that can be derived from the content of the communication of the aircra rather than its behaviour. Such nonbehavioural features range from the aircra 's call sign and squawk code 4 to the contents provided by the Mode S Enhanced Surveillance (EHS) protocol features used by many aircra . We have decided to not use these for our classi cation task for the following reasons: First, they can easily be changed, manipulated or spoofed by the aircra operator. Second, these communication options are not consistently used, over 50% of aircra do not broadcast any information besides position and velocity.

Feature Analysis
Feature Correlation. Fig. 2 shows the correlation between the features calculated on the main dataset. We can see strong relationships mainly between the horizontal velocity and acceleration features, aircra with many values in high X-velocity and acceleration bins also exert this behaviour in the Y-direction. On the other hand, many aircra either fall into long ights with constant middling speeds (e.g., commercial aircra ), or instead exert many very low and very high speed and acceleration values over the course of their ights, typical for ghter jets or trainer aircra . 3 e actual position in longitude and latitude values itself is not relevant, as it does not generalize to be a distinguishing feature across aircra models and continents. 4 A squawk is an 8-bit code used for local di erentiation of the aircra , introduced before the era of globally unique 24-bit ICAO codes. Heading Accel. 3Q Feature ality. To obtain a clearer view on how the classication works and to identify potentially detracting features, we estimated their quality.
ere is a given amount of uncertainty associated with the aircra category-its entropy.
is amount depends both on the number of classes (i.e., aircra categories) and the distribution of the samples between them. As each feature reveals a certain amount of information about the aircra category, this amount can be measured through the mutual information (MI). In order to measure the mutual information relative to the entire amount of uncertainty, the relative mutual information (RMI) is used. e RMI measures the percentage of entropy that is removed from the aircra category (cat) when a feature (F ) is known [3]. e RMI is de ned as where H (A) is the entropy of A and H (A|B) denotes the entropy of A conditional on B. In order to calculate the entropy of a feature it has to be discrete. As most features are continuous we perform discretization using an Equal Width Discretization (EWD) algorithm with 20 bins [6]. is algorithm typically produces good results without requiring supervision. As outliers may have a drastic e ect on the RMI computation, we use the 1 st and 99 th percentile instead of the minimal and maximum values to compute the bin boundaries in order to prevent large distortions. A high RMI indicates that the feature is distinctive on its own, but it is important to consider the correlation between features as well when choosing a feature set. Additionally, features may be more distinctive when combined, even when they are not particularly useful on their own. Figure 3 shows the RMI for each of our selected behavioural features, the colors indicating their physical feature group (positional, velocity, acceleration, or ight level). Overall, the velocity and acceleration features (red and blue, respectively) share the most information with the aircra category, with many of these having an RMI of 15% or more. e positional and ight level features are relatively less distinctive, which suggests that for example the distribution of heading values or the overall ight durations are more common to any aircra mission than a consistent behavioural feature of a category. However, we choose to keep all features for our classi cation to produce the best results.

Classi cation
For our experiments, we compare the performance of two di erent classi ers, the Random Forest (RF) algorithm and Support Vector Machines (SVM), using 5-fold cross validation for each.

RF.
We chose a maximum number of splits of 20, and 300 decision trees (or learners) as parameters for the RF algorithm. In a classi cation problem such as the one considered by us, the mode of all classes predicted by the individual trees is then used as the overall output.
SVM. We tested a linear, a quadratic, a cubic and several Gaussian kernel functions. For all three kernels we varied the so margin constant C between 1 and 100. We further used the automatic kernel scaling modes of the MATLAB classi cation learner app, which is a heuristic procedure to select the scale value using subsampling. e best results were achieved with the cubic SVM kernels and C = 1.5, which we used for our experiments. Finally, we used the one-vs-one method to classify several categories.

RESULTS
In this section, we describe our results from the classi cation task. We rst show the accuracy of the aircra categorization, before we discuss the e ectiveness for detecting surveillance and military aircra through our approach. Second, we classify the set of unknown aircra and obtain the best prediction for their aircra categories. Finally, we illustrate the e ectiveness of our approach in a case study of a detected surveillance aircra .

Aircra Categorization
e results of the classi cation show whether aircra categories can be distinguished purely on their movement behaviour. e aircra features were obtained with the minimum number of ights f min = 30 and number of feature quantiles q = 10.
6.1.1 RF. Fig. 4 shows the detailed results of the classi cation using purely on ight level and state vector features. With an average accuracy of 85.1%, the classi cation can overall accurately classify aircra into di erent categories. Naturally, there are quantitative di erences for each of the aircra categories, commercial airliners (97% true positive rate) and ghter aircra (95%) are the most accurate classes, potentially owing to their distinct and consistent behaviours and capabilities. e least accurate classes are military transport and tanker aircra , both with a true positive rate of 68%. Looking more closely into the misclassi ed aircra , the most common class mistaken for tankers are the military transport aircra , which is sensible as many more modern tanker aircra ful l multiple roles, in particular those of transport aircra [8].     Figure 5: Confusion matrix of the classi cation task using SVM (obtained with q = 10, f min = 30) 6.1.2 SVM. Fig. 5 shows the results on the same task using the SVM classi er. With an overall accuracy of 87.1%, it is more accurate than the RF classi er. In particular, all individual categories exhibit a true positive rate of at least 75%, with the weakest class again found in the tanker aircra , and the best results for the commercial and ghter classes. Similar to the random forest classi er, it is noteworthy that the main misclassi cations happen between surveillance and small utility aircra , which may o en share the same basic aircra model and behaviour (e.g., short and direct point-to-point ights) until the surveillance use case is required.

E ects of Number of Flights and Feature
antiles. We further examined the e ects of two parameters on the accuracy of the classi cation: rst the number of ights f min collected for each aircra 's feature creation, and second the number of quantiles q, in which the state vector features were divided. Fig. 6 illustrates these relationships. While with only a single ight used to create the features, the overall classi cation accuracy is at 61%, it quickly increases to over 80% with 5 collected ights, at which point it becomes su ciently accurate for reasonable use cases. Increasing the number of ights per aircra further, the accuracy increases to over 85% at 30 ights and 88.1% at 50 ights. All results were obtained with q = 10 and represent the mean of 100 classi cations. For the number of feature quantiles, there is also a positive relationship with the classi cation accuracy. With the minimum of q = 5 the accuracy is at 84%, increasing to 86.9% at q = 10, and only increasing marginally therea er until leveling o at q = 40 and 87.0% (f min = 35 ights were used for this comparison, 100 repetitions). Further increases to q = 50 show no positive e ect, even slightly hurting the classi cation accuracy, which falls to 86.9%. Table 6 shows the classi cation of approximately 1000 aircra , about which there was no data available in any publicly accessible database at the time of our snapshot. All selected aircra had at least 10 ights and 500 state vector data points available for their feature creation, to reduce the amount of noise to a minimum and ensure that these are consistently used aircra identi ers. To obtain categories for these aircra , we used the random forest classi er trained on the known aircra data as described above. As an ensemble classi er it provides con dence scores, i.e., the percentage of times a sample has been classi ed as a particular category. We used these scores as a cut o threshold, i.e., any sample classi ed with a score of less than 0.5 in any of the eight classes was judged as too low to provide useful insights. Taking this into account, 52.3% of all aircra were classi ed con dently into one group. Table 6 shows the full results. e commercial aircra could overwhelmingly be veri ed using the most current online source, FlightRadar24, as having been put into service a er the time our metadata snapshot was taken in January 2018. Indeed, of the 316 aircra , 305 were classi ed correctly, with the 11 misclassi cations being larger business jets. e new airliners in this set included, for example, 9 Boeing Dreamliners delivered to Norwegian in the rst half of 2018 [4] or new aircra in China, one of the biggest growth markets for commercial aviation.  We further nd that a large number of aircra are seen by the classi er as business and small utility aircra (10.9% and 4.6% respectively). is is plausible, as information on such private aircra is not necessarily well-publicized, potentially even sensitive, and many countries other than the English-speaking world either do not require such aircra to be on a public register or even do not publish any aircra register at all. While we can naturally still not verify the accuracy of the classi cation, many such classi ed aircra are regulars at typical business airports (e.g., Farnborough, UK or Teterboro, US), improving our con dence. e nal large group was made up of surveillance aircra (6.9%), whose sensitivity provides a clear motivation for not publishing their meta information. We discuss a detailed case study on such an aircra in the next section. ere was a small minority of aircra classi ed as trainer aircra (0.2%). Finally, no military ghter, transport, or tanker aircra were found in this dataset.

Analysis of Unknown Aircra
Detection of Surveillance Aircra . Fig. 7 shows seven ights from an example classi ed with very high con dence as surveillance aircra in Croatia. While no information about this aircra is available, as it does not appear in any database, it clearly exhibits the pa erns of an aircra used for surveillance of a narrow area, which are picked up by the classi er. From this, we can see that our approach generalizes across di erent countries and their surveillance institutions and is able to detect surveillance aircra around the globe.

DISCUSSION
We now discuss the intended applications for Classi-Fly, its limitations, and potential countermeasures to our approach.

Applications
e key objective for armasuisse was to obtain metadata as input for radar and research applications, including potentially adversarial se ings such as a target being aware of the analysis of its communications and movements. Considering this, Classi-Fly was developed to not require cooperation of the aircra so that it is robust even against active distortions of the transponder communication.
us, Classi-Fly can be used as input for open source intelligence on military and government agencies [23]. With regards to such use cases, it is also important to consider whether it may be possible to re ne Classi-Fly further to identify speci c planes or operators with reasonable accuracy and whether there are countermeasures against it, as discussed below.
Finally, Classi-Fly can contribute towards open data initiatives such as the OpenSky Network aircra metadata database, which is used for a wide variety of research applications (e.g., [17,20,22]

Limitations
e greatest limitation of Classi-Fly is the inherent non-speci city of some categories. For example, it is di cult to identify the precise use case of a business aircra ; besides business travel, the same Gulfstream G550 could be used for transport of goods for the military or people for leisure. However, with further research into potential subcategories and how to de ne them based on metadata such as the operator or owner or the airports frequented, this could be mitigated and their di erent behaviour learned. is applies also to currently neglected aircra categories such as UAV and ULAC, which will become transponder-equipped in larger numbers in the future and are a major interest of armasuisse Science + Technology.

Re nement
Besides improving the category ground truth, other, non-behavioural features can be integrated into Classi-Fly. As many wireless standards (not only in aviation) give manufacturers a large amount of freedom over the actual so -and hardware implementations, di erences emerge that can be used as classi cation features.
On the physical layer, [11] proves that it is feasible to distinguish aircra transponders based on anomalies in the frequency stability of their messages. On the data link layer, research has exploited di erences in the transponders' random backo algorithms [21].
Besides these approaches, it is possible to add a host of features derived from the actual message content sent out by the aircra . In a non-adversarial se ing where the aircra operators do not actively seek to obfuscate their identity (beyond excluding it from public databases), this would greatly improve classi cation accuracy.
Overall, we assume that certain uncommon aircra may be individually identi able through the combination of features. Future work will thus consider the possible granularity that several approaches can provide if they are combined and further quantify the privacy impact for aircra owners and operators.

Countermeasures
As our classi cation approach is agnostic to any non-behavioural features, it is di cult to apply any e ective countermeasures against it. Related work [22] has looked at countermeasures to the basic enabling mechanisms of aircra tracking, which is generally based on the ICAO identi er or other directly identifying information broadcast voluntarily by the aircra (such as its registration). ere are two popular privacy-preserving approaches to aircra tracking found in the aviation industry: the rst consists of not displaying aircra on popular web feeds (such as FlightRadar24 or FlightAware), the second comprises the use of shell companies to hide the real owners of an aircra and thus undermine the collection of accurate metadata. Both ideas, while certainly popular, are ultimately not e ective against a moderate threat model [22].
e most e ective countermeasure as concluded by the literature consists of the randomisation of the aircra 's ICAO identi er, making it di cult to continuously track the same aircra over time. If done globally for all aircra , and in conjunction with other pseudonymisation measures regarding the registration, it could e ectively thwart consistent aircra tracking and by extension also Classi-Fly. However, the cat may largely be out of the bag already; with the current widespread availability of comprehensive aviation data there is su cient input available for training.
Lastly, aircra could deliberately change their behaviour to avoid detection and classi cation. However, this has the major drawback the aircra possibly not being able to ful l its intended function, for example surveillance aircra not circling their target, or military ghter jets deliberately ying slowly. is limits the potential bene t of such an option.

CONCLUSION
In this work, we presented Classi-Fly, a method used by armasuisse Science + Technology to infer the categories of aircra , both anonymous and known, based purely on their movement behaviour. We validate our approach using publicly available ight data, comprising several hundred thousand ights with tens of millions of states in conjunction with meta information obtained from publicly available aircra registries. Our results show that we can obtain the category of an aircra with a likelihood of almost 90%, based on features obtained from 50 ights or fewer. In cases where no metadata is publicly available for an aircra , we show that our approach can be used to create this data, which is necessary for many research projects based on air tra c communication. Finally, we have examined a case study showing that it is possible to automatically discover sensitive aircra in a large data set using Classi-Fly, including police, surveillance and military aircra .