VITAL-SIGN DATA FUSION MODELS FOR POST-OPERATIVE PATIENTS

Deterioration in Patients who undergo upper-gastrointestinal surgery may be evident in the vital signs prior to adverse events. A dataset comprising observational vital-sign data from 128 post-operative patients was used to explore the trajectory of patients vital-sign changes during their stay in the post-operative ward. A model of normality based on pre-discharge data from patients who had a “normal” recovery was constructed using kernel density estimates, and tested with “abnormal” data from patients who deteriorate sufficiently to be re-admitted to the Intensive Care Unit. The results suggest that the criticality of post-operative patients can be evaluated by assessment of the distributions of their vital signs after their admission to the post-operative


I. INTRODUCTION
D ELAYED detection of clinical deterioration has been repeatedly associated with high rates of avoidable inhospital death and Intensive Care Unit (ICU) readmissions (which are associated with a substantially increased mortality rate) [1], [2], [3].
According to large national surgical audits such as the UK National Confidential Enquiry into Post-operative Deaths, current systems of post-operative care fail to detect or respond appropriately to early signs of critical illness [4].Such failures have been explained by lack of experienced senior nursing staff, inexperienced trainee medical staff [4], poor quality of care offered to critically ill patients [1], [5], and, more importantly, the inability of current systems to recognise clinical deterioration early.All of these factors can lead to deterioration in a patients condition and admission to the ICU, or death.
The UK National Institute for Health and Clinical Excellence (NICE) [6] has recommended that physiological track and trigger (T&T) systems should be used to monitor all adult patients in acute hospital units, in order to promote the recognition of patient deterioration early enough to allow proper intervention by medical staff.These systems are based on early warning scores (EWS) calculated from the values of physiological variables observed periodically (such as heart rate, HR, measured in beats per minute; respiratory rate, RR, measured in breaths per minute; arterial blood oxygen saturation, SpO 2 , measured as a percentage; systolic blood pressure, SysBP, measured in mmHg; core temperature measured with a tympanic thermometer in °C; and a level of consciousness assessed typically with the Glasgow Coma Scale1 , GCS).Univariate scoring criteria are applied to each physiological variable (vital sign) in turn, and then care is escalated to a higher level if any of the scores assigned to individual vital signs, or the sum of all such scores, exceed some threshold.There is widespread interest and clinical utilization of these scores in countries across Europe and Australasia, and increasingly in North America [7].However, the quality of evidence supporting the use of T&T systems is poor [7], and they have a number of disadvantages.The thresholds and ranges of these EWS systems are mostly determined heuristically (although evidence-based methods have recently been proposed [8], [9]).Furthermore, each vital sign is treated independently and correlations between them are not taken into account.Also, the clinical setting from which data are acquired for either validating or designing the EWS system is an important consideration.Many studies have been conducted in Medical Assessment Units [7], [8], and it is questionable whether the scores can be extrapolated to other medical units; for example, post-operative wards, general wards, or other settings.
An alternative approach to detecting patient deterioration from changes in vital signs is that of novelty detection [10], [11], [12], or one-class classification, which involves the construction of a multivariate, multimodal model of normality using examples of normal vital signs.This then allows the classification of test data as either "normal" or "abnormal" with respect to that model.Several approaches to novelty detection have been proposed, and an extensive review of these techniques is presented in [13], [14].We have shown how novelty detection can be combined with continuous vital-sign monitoring of acutely ill in-hospital patients [15], [16], [17], [18].
In this paper, we investigate models of normality tuned to a specific patient population.We present results of the analysis of data acquired during phase one of a two-phase clinical trial in the post-operative ward of the Cancer Centre, Oxford University Hospitals NHS Trust.Oxford, UK (approved by the local Research Ethics Committee, REC reference: 08/H0607/79).The data consist of the vital-sign measurements recorded periodically by the nurses from ambulatory patients in this ward, as well as demographic data and outcomes.
These patients are recovering from upper gastro-intestinal (GI) surgery.They start in their most acute state and gradually stabilise.We learn the vital-sign trajectories associated with "normal" recovery of these patients, the aim being to identify "abnormal" trajectories.We also study the 24-hour variability of each physiological parameter during the patients' stay on the post-operative ward.Multivariate models of the distribution of both vital-sign and vital-sign variability data from "normal" patients, which describe the normal trajectories, are constructed using probabilistic and discriminative approaches (such as the kernel density estimates and one-class support vector machines, respectively) and tested on "abnormal" data from patients who deteriorate sufficiently after surgery to be re-admitted to the ICU.It is this cohort of patients that we wish to identify as soon as possible after entry to the ward, with the goal of improving their outcomes.

A. Contributions of this paper
(i) We analyse a new dataset containing manual observations of vital signs acquired from surgical patients (which is described in Section II-A).
(ii) Trajectories associated with physiological recovery of patients after major surgery are described in Section II-B.The variability of each individual physiological variable is introduced and seem to be a key indicator of recovery.
(iii) We adopt a machine learning approach to fuse the observational vital-sign data acquired by nurses from ambulatory patients.In section III, we describe the incorporation of variability indices into our models of normality, which are computed using four different methods, the majority of which have not previously been applied to patient monitoring data.Results are presented and discussed in Section IV, and finally in Section V we give some brief conclusions.

II. VITAL-SIGN ANALYSIS
The Computer Alerting Monitoring System 2 (CALMS-2, Oxford, UK) trial has been designed to assess whether monitoring of vital signs with computer-modelled alerting to detect patient deterioration reduces patient the length of stay in hospital by alerting staff to clinical deterioration more effectively than current paper-based systems.200 patients were recruited during Phase I of the clinical trial.

A. Dataset
Vital-sign data were recorded by nursing staff during their regular observations of post-operative patients in the Upper GI ward at the Oxford Cancer Centre.The dataset used for the work described by this paper comprises measurements of HR, BR, SpO 2 , SysBP and Temperature acquired by ward staff every hour or every two hours in the first days after patient admission (depending on the patient's condition), and every four hours in the last days of the patient's stay on the ward.These measurements were then transcribed by two independent research nurses into an electronic database.The dataset was firstly refined to include only observations with no missing physiological variables (for example, if an observation from a patient does not include HR, it was removed from the dataset).We divided the patients from the data collection phase of the CALMS-2 trial into two groups: the "normal" group of patients who were discharged home after their stay on the ward (177 or 88.5% of the patients), and the "abnormal" group (23 or 11.5% of the patients), which comprises those patients who were deemed by clinicians to be sufficiently "abnormal" (due to post-operative complications) that they required re-admission to the ICU (18 patients), as well as those who died unexpectedly on the ward (5 patients).We note that the mortality rate in the "abnormal" set of patients was above 20%, which shows the severity of the risk associated with ICU re-admission.Table I summarises the patient demographics and outcomes for both groups.The age of the 200 patients ranged from 20 to 82 years, with 58% male and 42% female.The median length of stay on the ward for the "normal" group was 10 days (25 th percentile: 6 days; 75 th percentile: 12 days).The equivalent figures for the "abnormal" group are 17 days (25 th percentile: 7 days; 75 th percentile: 31 days); i.e., a much higher median and 75 th percentile than for the "normal" group, because the length of stay figures are skewed by the time spent in the ICU after re-admission from the ward.Of more relevance for the "abnormal" group is the median time to the abnormal event (re-admission to the ICU or death on the ward): 5 days with an inter-quartile range (IQR) of 4 days.As a result, the number of patients from the abnormal group still on the ward on Day 6 is halved with respect to the number on Day 1, and so our analysis of their physiological variables is restricted to the first 10 days after surgery.

B. Physiological variables pre-analysis
We first analyse the trajectory of each vital-sign variable for the 200 patients.Figure 1 (left column) shows the mean values, for each 24-hour period, of the five vital signs for the 177 patients in the "normal" group and for the 23 patients in the "abnormal" group.The mean values are displayed for the length of stay (or time to event in the case of the "abnormal" group) up to the 75 th percentile (13 days and 10 days, respectively) for each group of patients.
Firstly, we note that the post-surgical recovery phase lasts, on average, six days.By Day 6, for example, the SysBP has reached its steady-state value.The SysBP for the "abnormal" group of patients has a very similar trajectory.In both cases, the values are well within the bounds of what would be considered to be normality; i.e., if we consider the EWS system in use at the time of the study, a score of 0 (corresponding to normal values) for SysBP covers the range from 100 to 180 mmHg, and this is typical of all EWS scoring systems [8], [9].We can also observe that there is no physiologically significant difference in the values of RR between the two groups.The mean values for each day are within 1 breath per minute of each other throughout.The mean HR values are mostly between 80 and 90 beats per minute, with a peak at around 95 beats per minute on Day 6 for the "abnormal" group, but this is not likely to be significant.There is no clear pattern for SpO 2 , which can be due to the fact that most patients in both groups are, for varying lengths of stay, on oxygen masks.Finally, although there is a 0.4°C rise for the mean temperature of the "abnormal" group for Days 8 and 9, there appear to be no significant differences between the two groups in the mean temperature readings from Day 1 to Day 7.
The mean values of the physiological variables, for each day, do not show any significant differences between the two groups of patients.Yet, over 10% of the patients have abnormal physiology in some sense, leading to their eventual re-admission to ICU or death on the ward (see Table I).We hypothesise that abnormal physiology for this group may be characterised instead by abnormal variation about the mean.We define a "variability index" to be the difference between the maximum and minimum values in a 24-hour period, for each physiological variable.It is important to consider that the variability index was determined for all days in which four or more observations were made by the nurses.These variability indices are represented for both groups of patients in Figure 1 (right column).
If we consider initially the variability data from Figure 1 for the "normal" group of patients, it is clear that variability is high, for all physiological variables, for the first two to four days on the ward.After this, variability decreases as the process of recovery from surgery takes place.The most relevant plots in Figure 1 for the "abnormal" group of patients are those for RR and SysBP.On days 1 and 2, the variability is very high (higher than for the "normal" group), and although it also decreases for both variables from Day 3 onwards, the pattern of gradual reduction is not maintained.The mean values for the 24-hour variability index remain above 6 breaths per minute for RR and above 30 mmHg for SysBP.Any pattern in the variability of HR or the variability of SpO 2 is hard to discern.There is possibly increased 24-hour variability in the HR of the "abnormal" group on days 5 and 6, but there are not enough data for the evidence to be conclusive.It is also difficult to interpret the SpO 2 pattern because a significant proportion of patients will have been on oxygen masks.Finally, there do not appear to be any significant differences in diurnal variations in temperature between the two groups of patients.
In short, results show that the 24-hour variability indices of RR and SysBP during the first two weeks after admission to the Upper GI ward for the "normal" and "abnormal" groups are significantly different, and, therefore, they may be good predictors of ICU admission (or death) for surgical patients.

III. A MACHINE LEARNING APPROACH
In order to study the ability of the 24-hour variability of RR and SysBP to identify "abnormal" patients, we consider the construction of two types of model of normality: one that takes as inputs the five vital signs (dimensionality of the input space, D 1 = 5), and another one which takes not only the five vital signs, but also the variability indexes of RR and SysBP as input variables (dimensionality of the input space, D 2 = 7).We first study the physiological trajectory during recovery of the patients in both groups (Section III-B) using the same strategy described in Pimentel et al. [19].Then, four different novelty detection approaches are explored for the detection of patient deterioration from changes in vital signs and vitalsign's variability (Section III-C).

A. Normalisation
We assume a priori that each physiological variable (including the two variability indexes) has equal importance in the patient model.Therefore, each variable should first be scaled to have approximately the same dynamic range to ensure that variables with large changes (e.g., blood pressure in mmHg) do not dominate parameters with smaller changes (e.g., temperature in °C).In order to scale the physiological variables, Tarassenko et al. [15], [20] and Hann [16] normalise every vital-sign measurement, x, using the zero-mean unit-variance transformation, x n = x−µ σ , where x n is the normalised value and µ and σ are the parameter set mean and standard deviation, respectively, for the "normal" group of 177 patients.
We apply the zero-mean unit-variance normalisation to each one of the five vital signs.However, we take a slightly different approach to normalise the variability indexes.From the results presented in Section II-B, we observe that the "normal" values of variability change significantly with time.In order to illustrate this point, we consider the example represented in Figure 2. The distributions of the 24-hour variability of RR for each day (the overall distribution is represented on the right, with a different colour) for the "normal" patients are shown on the vertical axis.The red line corresponds to the variability index of RR of one "abnormal" patient, who was re-admitted to the ICU after 9 days on the ward.As we can see, the variability index of this patient on Day 3 and Day 7 (see black arrows on the figure) is 9 breaths per minute.While this value looks normal with respect to the "normal" group of patients for Day 3, it is highly "abnormal" on Day 7, as it is far way from the mean of the distribution of the "normal" population for this day.A similar result is obtained for the SysBP variability index.Hence, we normalise every 24-hour variability index of RR and SysBP, x j , using a zeromean unit-variance transformation for each day, x j * = xj −µj σj , where x j is the value observed on day j, x j * is the normalised value, and µ j and σ j are the parameter set mean and standard deviation of day j, respectively, for the "normal" group of 177 patients.

B. Physiological Trajectory
We have previously considered the construction of a model of normality, based on the average of the observations made on the last day on the ward (discharge day) of each patient from the "normal" group [19].This subset of data contains the vital signs from the most physiologically stable period of the patient stay, because these data were acquired immediately prior to discharge the ward, when the patient is at their most "normal" after recovering from surgery.In the current study, this set of "normal" pre-discharge data contains 177 vital-sign vectors (correspondent to the average of the vital-sign measurements made on the last for each patient), X A ∈ R DA with D A = 5, which are subsequently used for the construction of one model of normality.For the second model of normality, we also include the 177 vectors but containing the variability indexes of RR and SysBP, X B ∈ R DB with D B = 7.
A kernel density estimate [21] is a technique that allows an underlying D-dimensional vital-sign probability density function (pdf) to be estimated from training data.A kernel density estimate was chosen because it is a non-parametric method, so no a priori assumptions are made about the form of the probability distribution.The pdf of each set of the N = 177 prototype vectors, x T 1 , ..., x T N with T = {A, B} (in the following equations, x T and x T i are generically represented by x and x i , respectively), is estimated using the following equation:

p(x|x
which is a weighted sum of Gaussian kernels centred at the 177 prototype vectors, x i , and where each kernel is isotropic with variance σ 2 .The variance was determined using the nearest-neighbour method proposed by Bishop [10], in which the average of the squared Euclidean distance to the set of 10 nearest neighbours {N N s} is determined for each point and σ 2 is estimated by calculating the average over all points: The likelihood for all data from the "normal" group of patients was then calculated using Equation (1).The likelihood of all data from the "abnormal" group of patients, prior to the occurrence of an adverse event (either death or ICU admission) was also evaluated using the same model of normality.
In order to estimate the "abnormality" of a data point x, the departure from normality is usually quantified using a novelty score defined as follows, where z(x) is the novelty score and θ = {x i , σ}. "Normal" data, which have higher likelihoods p(x|θ), therefore generate low novelty scores z(x); conversely, "abnormal" data, which have lower likelihoods, generate high novelty scores z(x).
A consequence of having different values of p(x) arising from the use of models with different dimensions is that the value of the novelty score z A (x) for one model will not be the same as the novelty score z B (x) for a model with a higher number of dimensions.Hann [16] addressed this problem by proposing a numerical approach which uses probability P (where 0 < P < 1) to relate the probability density or novelty values calculated using a kernel density estimate model with a high number of dimensions to the equivalent value of novelty in a lower-dimensional model.This is achieved by drawing a significantly large number of samples from each model and finding the fraction of samples that have a probability density higher than a series of successively lower thresholds (for more details, please refer to [16]).This method enables all computations of novelty (for whatever dimension) to be mapped onto the same scale as the original 5-D model, using the probability density from a higher-dimensional model.By applying this method, we are able to compare directly the values of the novelty scores given by the 5-D and 7-D models.

C. Identifying patient deterioration
We now consider the performance of different novelty detection schemes with respect to the manual observations made by ward staff and the computed 24-hour variability indices, in order to classify patients as either "normal" or "abnormal".Here, we use the data recorded during the entire length of stay of the patient on the ward to train our model of normality (using also the normalisation procedures described in Section III-A).The available "abnormal" patients in this dataset are insufficient to train a multi-class classifier, being small in comparison with the number of "normal" patients, and therefore the novelty detection approach is justified for this particular application.
1) Data partitioning: For each of the training sets (one with only the five vital signs, and the other one with the five vital signs and 24-hour variability of RR and SysBP), we use data from 60% of the patients in the "normal" group to construct our different models of normality.The remaining patients are split equally between the validation set, 20% (to enable parameter and threshold optimisation for each different method), and the test set 20% (for evaluation of the training models).The available examples of abnormality are split equally between the validation (50%) and test sets (50%).We note that different patients have different numbers of observations.However, it is important that each of the 23 "abnormal" patients contributes to either the validation set or the test set, but not both.If one patient contributed data for both sets, the test set would no longer be independent of the training and validation sets, due to the dependence between observations for a single patient.Furthermore, because of the absence of annotated observational data, we do not have a dataset that contains only "abnormal" observational data.Therefore, we classify a patient as being "abnormal" if at least one observation is classified "abnormal" by the classifier.As a consequence of this procedure, a high number of false positives is expected.Therefore, the partitioning procedure takes into account the number of patients, and not the number of observational data.
The split between the training, validation, and test sets is performed randomly.In order to test the dependence of the results on this random partitioning, 200 experiments are performed, each experiment containing a different random partition of patients between the training, validation, and test sets.Each experiment therefore included retraining of the classifier, revalidation, and retesting, in order to obtain fully independent results for each experiment.The area under the receiver operating characteristic (ROC) curve is used in order to optimise the parameters and find the best threshold.
2) Novelty detection schemes: We use the kernel density estimate (KDE) approach described in Section III-B, and compare the results obtained to three different methods: the Gaussian mixture model (GMM), the one-class support vector machine (SVM), and the one-class Gaussian process (GP).
The GMM is a semi-parametric technique [22], and is defined by the pdf p(x) = M i=1 π i p(x|θ i ), which is comprised of M component distributions, each of which has a prior probability π i and a likelihood p(x|θ i ) = N (x|µ i , Σ i ) where µ i and Σ i are the centre and covariance matrix for the multivariate Gaussian i, respectively.The maximum likelihood estimates of the model parameters are determined using expectation maximisation [21].
The one-class SVM approach proposed by Shölkopf [23] defines a novelty boundary in the feature space corresponding to a kernel (typically a Gaussian kernel is used), by separating the transformed training data from the origin in the feature space, with maximum margin.This approach requires selecting a priori the percentage of positive (or "normal") data allowed to fall outside the description of the "normal" class.The parameter values are estimated here using 5-fold crossvalidation.
The one-class GP is that proposed by Kemmler et al. [24], the details of which are not replicated here due to the limitations of space.This method uses the familiar GP classification framework [25], and parameter values are again estimated using 5-fold cross-validation.
The results obtained with the methods mentioned above are compared with those obtained with conventional EWS systems.For this, we used three different systems: the EWS system used at the time of our study (details of which can be found in Clifton et al. [26]), the ViEWS proposed by Prytherch et al. [8], and the CEWS proposed by Tarassenko et al. [9].

A. Physiological Trajectory
Novelty scores z(x) computed using the two different models of normality and 7-D models), averaged for each day, are shown in Figure 3 for "normal" and "abnormal" patients.The novelty scores are displayed in Figure 3 for the length of stay (or time to event in the case of the "abnormal" group) up to the 75 th percentile (13 and 10 days, respectively) for each group of patients.
From the trajectory of z(x) obtained from both models for the "normal" group of patients we can see a significant decrease in z(x) in the first 4 days, after which z(x) is approximately constant for t ≥ 4 days.The first 4 days correspond to patient recovery immediately following surgery [27].After day 4, the majority of patients included in the "normal" group appear to have fully recovered from surgery and are physiologically stable.It could be argued that these patients are sufficiently stable for early discharge to be considered, or for them to be provided with a lower level of care should they need to remain in hospital for reasons not related to physiological instability.Conversely, z(x) for the "abnormal" group of patients, suggests that the physiological trajectory for these patients is significantly different to that of "normal" patients with a sudden increase in novelty in the last 48 hours, following the gradual decrease prior to this.These results suggest that patients criticality could be assessed by evaluating the distribution of their vital signs using the novelty scores after their admission to the post-operative ward, following major surgery.
If we compare the trajectories obtained with each model, we can see that the difference between the "normal" and "abnormal" trajectories computed with the 7-D model is generally higher than that computed with the 5-D model (note that the vertical axis in Figure 3 is a logarithmic scale).In fact, the pattern of recovery appears to be more accentuated when the 24-hour variability indexes of RR and SysBP are incorporated in the model.Moreover, the increase in the novelty score in the last 48 hours is also very pronounced.These results suggest that the variability indices of RR and SysBP may be good predictors of ICU re-admission or death, and consequently, improve the identification of patient deterioration.

B. Identifying patient deterioration
Table II shows the overall results after 200 experiments, at the "optimal" threshold for each experiment (that threshold determined from the validation set in each of the 200 experiments).Defining true positive (TP), true negative (TN), false positive (FP) and false negative (FN) to be the number  of patients correctly identified as "abnormal", the number of "normal" patients correctly identified as "normal", the number of "normal" patients incorrectly identified as "abnormal", and the number of patients incorrectly identified as "abnormal", respectively, then sensitivity is defined to be TP / (TP + FN), and specificity is TN / (TN + FP). Figure 4 shows the average ROC curves of the 200 experiments for each novelty detection scheme.The averaging procedure (vertical averaging) takes vertical samples of the ROC curves for fixed FP rates and averages the corresponding TP rates (more details can be found in [28]).
In general, the performance of the 7-D classifiers is higher than that of the 5-D models.For example, the novelty detector built using the KDE approach which includes the variability indices as inputs has an average sensitivity of 0.87 and average specificity 0.70, while the same figures for the 5-D model constructed without the variability indices are lower, 0.84 and 0.70, respectively.These results are confirmed by the ROC curves shown in Figure 4, in which it may be seen that the ROC curves from the 7-D classifiers are higher than those for the 5-D models.The results also show that the 7-D classifiers have higher sensitivities but maintain the specificity obtained with the 5-D models.With the variability indices, we are detecting more "abnormal" patients without increasing the false positive rate.These results support the hypothesis that the 24-hour variability indexes of RR and SysBP may be good predictors of ICU re-admission or death.
Table II also shows that the KDE approach achieved the highest area under the ROC (AUROC) curve in comparison with the other methods (see also Figure 4).Furthermore, the classification performances of all novelty detection methods explored are higher than any of the conventional EWS systems.The EWS system that provided the best result in terms of both sensitivity and specificity, was the CEWS [9], with an average sensitivity of 0.65 and specificity of 0.57 (see Table III).The low sensitivity of these EWS systems, which were designed to identify abnormal values of physiological variables, shows that they fail to provide early warning of deterioration in post-
We now demonstrate the performance of the different approaches to novelty detection with case studies from "abnormal" patients who were known to deteriorate, ending with ICU re-admission, and, in some cases, death.Examples of the application of the technique to patient vital-sign data is shown in Figure 5.The first example (left-hand plots in Figure 5) shows a patient who deteriorated eight days after admission to the post-operative cancer ward, and was then admitted to the ICU.During the first day after surgery, the patient exhibits some physiological variability showing episodes of high blood pressure (SysBP reaching 180 mmHg).In the following few days, the patient appears to be recovering "normally": a gradual decrease of the SysBP to more "normal" values and a reduction in the variability of the vital signs.In the last 48 hours before the patient was re-admitted to the ICU, the patient becomes slightly tachypneic (RR reaching 24 breaths per minute), and there is a slight increase in the blood pressure (SysBP reaching 160 mmHg).These small increases in the RR and SysBP are not enough to trigger an alarm using our 5-D model (the novelty scores remain below the defined threshold).However, if we look at the variability indices for both parameters, we observe that there are increases in both the variability index of RR (reaching 11 breaths per minute on the last day) and in the variability index of SysBP (reaching almost 60 mmHg).As a result, our proposed 7-D model, which includes these indices, reaches the defined threshold during the last 24 hours before the patient's ICU admission, and therefore, detects the deterioration of the patient.All of the manual observations made for this patient in the last 48 hours were deemed to be "normal" by conventional EWS systems.This patient ended up dying 10 days latter in the ICU.
The second example (right-hand plots in Figure 5) shows a patient who had some periods of instability after being admitted to the post-operative ward, following surgery.After 8 days on the post-operative ward, the patient was re-admitted to the ICU.On Day 4 (before ICU re-admission), the 24hour variability of RR reaches 18 breaths per minute, which causes the novelty scores computed with 7-D model to go above the threshold.In the last 36 hours, the patient exhibits desaturations in SpO 2 , decreasing to approximately 86%, and episodes of high blood pressure, reaching 171 mmHg.We note a slight increase of the novelty score of the 5-D KDE.Although the variability index of RR remains low (at 3 breaths per minute), the variability index of SysBP increases to 56 mmHg, which results in a more pronounced increase (reaching the correspondent threshold) of the novelty score computed with the 7-D model.

C. Limitations of the study
A few considerations should be made here regarding the analysis conducted and the results obtained.Firstly, we note that the size of the dataset used in this study limits the conclusions that can be drawn from the results obtained in this analysis.Particularly, the number of patients included in the "abnormal" group is not large enough to allow a comprehensive interpretation of the changes in some physiological parameters and 24-hour variability indices.However, we were able to identify that the 24-hour variability indices of RR and SysBP may be good predictors of ICU re-admission or death on the ward.
Another limitation of the current study is the fact that we do not have a dataset that contains only examples of "abnormal" data; i.e., we classify a patient as being "abnormal" if at least one observation made throughout the entire patient's stay on the ward is classified as "abnormal".The consequence of this procedure is a high false-positive rate (which can be observed in the different classifiers' performances).This effect could be minimised if we have the requirement of two consecutive abnormal observations (instead of only one).Nevertheless, we were able to compare the different methods and evaluate the contribution of the 24-hour variability indices to the models of normality and identification of "abnormal" patients.We also note the computation of a 24-hour variability index can become problematic in noisy datasets; i.e., if one of the observations has an extremely high (or extremely low) value of RR or SysBP due to noise or other artefacts, the variability index for the corresponding day will be artificially high.However, because the dataset used in this study comprises only observational data recorded by nursing staff on the ward, data were subjected to a filtering process, in which artefactual or noisy data have been discarded and not considered for the study.Therefore, we are not expected to find many artefactual data.
On the other hand, the dataset used in the analysis described by this paper consisted of manual measurements of vital signs acquired periodically (every 2 or 4 hours) by ward staff.These infrequent patient observations can lead to unnoticed clinical deterioration, including "abnormal" 24-hour variability in the vital signs.Furthermore, our previous analysis (described in [29]) showed that transitory changes in patient physiology occur when nurses observe patients, which can bias the results due to suspected physiological changes.Reducing this bias is an important factor, and one highlighted in this paper [29].A solution to the infrequency of observational data is the use of a patient monitoring system based on continuous data acquired from patient-worn sensors.The challenges for such an approach are to provide early warning of patient deterioration in a robust manner such that low numbers of false alarms are generated.

V. CONCLUSIONS
We have presented results from analyses of data acquired from patients who were admitted to the post-operative Upper GI ward after cancer surgery.We studied the vital-sign trajectories during the patient's stay on the ward.From our results, no significant changes in the physiology of the "normal" patients from around halfway through their stay to the time of discharge were observed, which suggests that these patients could have been considered for early discharge or provided with a lower level of care from halfway through their stay.
We have introduced and shown in this paper that the 24hour variability of RR and SysBP during the first two weeks after admission to the Upper GI ward may be good predictors of ICU admission for surgical patients, adding independent information with regard to the values of physiological variables.There have been several reports on the monitoring of patients post-operatively on surgical wards, but to the best of our knowledge, none has focused on the variability of physiological variables.We have also proposed one approach to include these clinically significant variations in RR and SysBP in a 24-hour period in the construction of models of normality, which have provided better classification performance than the conventional EWS systems currently in use in most UK hospitals.These systems would alert clinical staff to unstable post-surgical patients not otherwise identified by the current systems.Earlier identification of instability would allow earlier escalation of care, which should then lead to improved outcomes, as delayed transfer of patients from general wards to ICU is associated with increased morbidity and mortality [30].
In conclusion, we note that the analysis presented here has been retrospective, and that real-time use of these novelty detection approaches should be implemented and tested in a clinical environment.The clinical study described in this paper goes some way towards addressing the lack of clinical evidence for the efficacy of machine learning methods in patient monitoring.The on-going next phase of the clinical study will result in further data on which these preliminary findings may be confirmed, and aims to determine if patient outcomes are improved by displaying the output of a simple novelty detection system [20] to ward nurses, in real-time, during the patient's stay on the ward.Future work will concentrate on considering continuous data acquired from patientworn body sensors, on the refinement of existing techniques for the target population, and on the improvement of model construction using dynamical modelling approaches such as Gaussian processes.

Fig. 1 .
Fig. 1.On the left: the mean values for the 5 vital signs (RR, HR, SpO 2 , SysBP and Temperature) are represented for the first 13 days post-operatively,for patients from the "normal" group (in dashed green), with the patients from the "abnormal" group; on the right: the variability index (see text for explanation) for each of the 5 vital signs is shown for the same period of time and the same groups of patients.Error bars denoted one standard error of the group mean.

Fig. 2 .
Fig. 2.Representation of the distributions of the 24-hour variability indices for RR for each day for the "normal" group of patients (green areas).The overall distribution (including all days) is shown on the right (yellow area).The variability indices for RR of one patient from the "abnormal" group are shown with red lines for each day.This patient exhibits the same value of variability index for days 3 and 7 (pointed with black arrows).

Fig. 3 .
Fig. 3. Representation of average (per day) of novelty scores z(x) against time for the "normal" group of patients shown in green and the "abnormal" group of patients shown in red.Dotted lines correspond to the novelty scores computed using the 5-D model.Error bars denote one standard error of the group mean.

Fig. 4 .
Fig. 4. ROC curve representation for novelty detection results.Vertical averaged ROC curves (see text for explanation) of 200 experiments is shown.The black line indicates the line of no discrimination.

Fig. 5 .
Fig. 5. Two example patients are shown on the left and on the right.The upper plot show the observations of vital signs RR, HR, SysBP and SpO 2 with time (refer to right vertical axis for SpO 2 ).The 24-hour variability indexes of RR and SysBP are also shown in pink and red dotted lines, respectively.The lower plot in each column shows the novelty score determined using the 5-D and 7-D models of normality computed using the kernel density estimate approach.The decision thresholds are represented with dashed lines.

TABLE I PATIENT
DEMOGRAPHICS COMPARING BOTH "NORMAL" GROUP (DISCHARGED HOME) AND "ABNORMAL" GROUP (RE-ADMITTED TO THE ICU OR DIED ON THE WARD).

TABLE II NOVELTY
DETECTION PERFORMANCES, MEAN ± ONE STANDARD DEVIATION.