Neuropsychology Validation of an Automated Scoring Program for a Digital Complex Figure Copy Task Within Healthy Aging and Stroke

Objective: Complex Figure Copy Tasks are one of the most commonly employed neuropsychological tests. However, manual scoring of this test is time-consuming, requires training, and can then still be inconsistent between different examiners. We aimed to develop and evaluate a novel, automated method for scoring a tablet-based Figure Copy Task. Method: A cohortof 261healthy adults and 203 stroke survivorscompleted the digital Oxford Cognitive Screen-Plus (OCS-Plus) Figure Copy Task. Responses were independently scored by two trained human raters and by a novel automated scoring program. Results: Overall, the Automated Scoring Program was able to reliably extract and identify the separate ﬁ gure elements (average sensitivity and speci ﬁ city of 92.10% and 90.20%, respectively) and assigned total scores which agreed well with manual scores (intraclass correlation coef ﬁ cient [ICC] = .83). Receiver Operating Curve analysis demonstrated that, compared to overall impairment categorizations based on manual scores, the Automated ScoringProgramhadan overallsensitivityandspeci ﬁ city of80%and93.40%,respectively (Area Underthe Curve; AUC = 86.70%). Automated total scores also reliably distinguished between different clinical impairment groups with subacute stroke survivors scoring signi ﬁ cantly worse than longer-term survivors, which in turn scored worse than neurologically healthy adults. Conclusions: These results demonstrate that the novel Automated Scoring Program was able to reliably extract and accurately score Figure Copy Task data, even in cases where drawings were highly distorted due to comorbid ﬁ ne-motor de ﬁ cits. This represents a signi ﬁ cant advancement as this novel technology can be employed to produce immediate, unbiased, and reproducible scores for Figure Copy Task responses in clinical and research environments. Question: We extract and accurately score Figure Copy Task data, even in cases where drawings were highly distorted due to comorbid ﬁ ne-motor de ﬁ cits. Importance: This represents a signi ﬁ cant advancement as this novel technology can be employed to produce immediate, unbiased, and reproducible scores for Figure Copy Task responses in clinical and research environments. Next Steps: Trialing the Automated Scoring Program in clinical environments.

Keywords: neuropsychology, stroke, validation, automated scoring, complex figure copy Supplemental materials: https://doi.org/10. 1037/neu0000748.supp The administration of neuropsychological tests is a key component of establishing brain-behavior relationships (Crawford et al., 1992;Ellis & Young, 2013). However, comparisons that employ these metrics can be limited by the quality of scoring of these neuropsychological tests. For example, tests that require subjective examiner judgments may introduce potentially confounding noise into neuropsychological analyses (Barker et al., 2011;Franzen, 2000;Moore et al., 2019;Watkins, 2017). Interrater reliability traditionally is improved by implementing extensive training courses, employing exhaustive or requiring agreement across multiple independent raters or tests (Franzen, 2000;Huygelier et al., 2020). However, more demanding scoring procedures often are prohibitively time-consuming and can lead to studies opting to rely on small, selected samples rather than larger, generalizable patient cohorts, or similarly to only complete limited cognitive measures (e.g., the Mini Mental State Examination Folstein et al., 1983) which reduce the informational richness. For these reasons, identifying new methods for efficiently improving scoring consistency on clinically feasible measures is critically important for improving both the scope and reliability of neuropsychological investigations. Here, we focus on validating this approach in a specific, prominently studied, a clinical cohort of stroke survivors, as an example group where these automated scoring measures may improve methods to further elucidate specific aspects of domain-specific cognitive impairments in Complex figure copy and recall.
The Figure Copy test is one of the most commonly employed neuropsychological assessment methods used to evaluate the visuospatial constructional ability and nonverbal memory in clinical environments (Shin et al., 2006). In traditional versions of this test, participants complete two drawings of a composite geometric shape. First, participants are presented with a target image and are asked to copy it from sight. Next, the target figure is removed and participants are asked to reproduce it from memory (Demeyere et al., 2021;Schreiber et al., 1999). The Rey-Osterrieth Complex Figure Test (ROCFT; Somerville et al., 2000) is the most well-known Figure Copy test, though many variations, including computerized versions (e.g., Demeyere et al., 2021;Humphreys et al., 2017;Schreiber et al., 1999;Taylor, 1969), are in use.
Successful completion of any Figure Copy task requires participants to co-ordinate fine-motor movements, employ visuospatial perception, maintain visual images in working memory, and effectively plan and organize their responses (Shin et al., 2006). The Figure Copy Task has been found to act as a reliably metric of a wide range of cognitive functions and is therefore useful for establishing a diverse range of brain-behavior relationships. Chechlacz et al. (2014) conducted a voxel-lesion symptom mapping study aiming to identify the neural correlates of a range of deficits captured by performance in a Figure Copy Task. Analysis of this single behavioral assessment yielded significant and distinct neural correlates associated with general poor performance, lateralized omissions, spatial positioning errors, global feature impairment, and local feature impairment (Chechlacz et al., 2014). Similarly, (Chen et al., 2016) conducted a lesion mapping study investigating the correlates of principal component analysis-derived factors underlying figure copy performance. This investigation identified brain regions associated with high-level motor control, visuomotor transformation, and multistep object use using only behavioral data from a Figure Copy Task. This wide range of assessed cognitive functions makes the Figure Copy Task an extremely valuable tool both for clinical diagnostic purposes and for research aiming to establish brain-behavior relationships.
The Figure Copy Task is comparatively simple to complete and while assessing a diverse range of functions. These advantages mean that this task is frequently employed within clinical neuropsychological evaluations. A survey conducted by Rabin et al. (2016) found that the ROCFT was the eighth most popular single neuropsychological assessment employed by a sample of 512 North American neuropsychologists, with 7.6% reporting using this test (Rabin et al., 2016). Previous research has suggested that Figure Copy Task performance can effectively distinguish between various clinical populations (Alladi et al., 2006;Demeyere et al., 2021;Freeman et al., 2000). For example, Freeman et al. (2000) administered the ROCFT to a cohort of Alzheimer's disease, ischemic vascular dementia, and Parkinson's disease patients. This investigation identified significant differences in performance with patients with Alzheimer's disease performing significantly worse than patients diagnosed with vascular dementia or Parkinson's disease (Freeman et al., 2000). These findings suggest that patients' Figure Copy Task scores may provide clinically relevant information which can be employed to inform diagnoses.
Patient performance on the Figure Copy Task is generally scored manually. For example, examiners score performance on the Oxford Cognitive Screen-Plus (OCS-Plus) figure copy Task by reporting the presence, accuracy, and position of each individual figure element independently (Demeyere et al., 2021). However, this scoring method is time-consuming, requires training, and is ultimately reliant on subjective examiner impressions. Individual examiners may disagree on which drawn line represents which element, especially in cases where a patient has committed many errors. A significant amount of training is required to ensure high agreement. This reliance on subjective examiner judgments inevitably introduces human biases into Figure Copy scores. Relying on subjective interpretations of objective criteria can result in systematic scoring biases, potentially precluding the validity of large-scale comparisons involving Figure Copy test data, especially in cases where multiple independent examiners are involved. Automated algorithms have been repeatedly demonstrated to be able to perform many diagnostic and classification tasks with greater sensitivity and specificity than human experts (Dawes et al., 1989;Meehl, 1954).
For this reason, several automated tools have been developed to quantify performance on neuropsychological tests. Chen et al. (2020) developed a deep-learning-based automated scoring tool for the Clock Drawing Task, a common component of dementia screening batteries (Agrell & Dehlin, 1998;Pinto & Peters, 2009). This investigation compared algorithmic and expert assigned scores in a cohort of 1,315 outpatients and concluded that the program exhibited a comparative scoring accuracy of 98.54% (Chen et al., 2020). Similarly, Moetesum et al. (2015) applied an automated This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
2 approach to assessing performance on the Bender Gestalt Test (Koppitz, 1964) within a sample of 18 healthy adults. The performance of this program varied dramatically depending on the specific gestalt component being assessed [range = 6/18 (overlap) and 18/ 18 (rotation); Moetesum et al., 2015]. Two figure-copy-specific automated scoring algorithms have been developed. First, Canham et al. (2000) developed an automated scoring software for the commonly used ROCFT. In this task, responses are generally manually scored by categorizing each of the target figure's 18 elements according to whether or not they are present, accurately drawn, and correctly placed within the response figure. Canham et al.'s (2000) automated software matched these scoring criteria by first identifying distorted areas of patient drawings, then locating and grading basic geometric shapes while employing unary metrics to remove unsuitable features from patient drawings. This method was found to perform well on real patient data with 75% of features being within 5% of the manually assigned scores and 98.6% within 10% (Canham et al., 2000). Second, the most recent, "state-ofthe-art" figure copy scoring tool was designed by Vogt et al. (2019), which demonstrated a .88 Pearson correlation with human ratings of Rey-Osterrieth Complex Figure performance. While this performance in near the documented human interrater agreement (.94), equivalence testing revealed that these scoring methods did not produce strictly equivalent total scores. However, these algorithms were designed specifically to score data from the Rey-Osterrieth Complex Figure  test and do not generalize to other commonly used Figure Copy tests.
The purpose of the present investigation is to develop an automated scoring tool to score the OCS-Plus (Demeyere et al., 2021) Figure Copy Task. This project aims to evaluate the efficacy of this automated scoring tool by comparing automated versus manually assigned scores and identifying potential sources of systematic disagreement. The utility of this automated software for distinguishing between different clinical populations is also explored. Ultimately, this project aims to deliver a robust automated clinical scoring tool to deliver immediate scoring and evaluation of individual performance on the OCS-Plus Figure  Copy Task.

Method Participants
A cohort of 261 neurologically healthy adults was recruited as well as 203 stroke survivors who completed the Figure Copy Task within the OCS-Plus Tablet Screening Project (REC reference: 18/SC/0044, IRAS project ID: 241571). Of the stroke survivors 49 were tested on the Figure Copy test within 6 months of their stroke (termed subacute stroke participants) and 154 stroke survivors were tested on the Figure Copy test on or after 6-month poststroke (termed chronic stroke participants).
All healthy adult participants were recruited through convenience sampling as part of the OCS-Plus validation project (Demeyere et al., 2021) from an existing pool of older healthy aging research volunteers (University of Oxford, MSD-IDREC-C1-2013-209). Healthy adult participants were included in the OCS-Plus project if they were able to provide informed consent, had sufficient English language proficiency to comprehend instructions, were at least 18 years old, and were able to remain alert for at least 20 min. The exclusion criteria included inability for the participant to consent to take part, insufficient English language proficiency, and inability to stay alert for 20 min to do the task (Table 1).
We collected additional measures from clinical notes including the Barthel Index (Mahoney & Barthel, 1965) and the Oxford Cognitive Screen (OCS; Demeyere et al., 2015) to measure functional ability and domain-specific cognitive impairment. As part of the 6-month follow-up protocol for the overarching study, we collected data on the Hospital Anxiety and Depression Scale (Zigmond & Snaith, 1983), to measure anxiety and depression, the Stroke Impact Scale (Duncan et al., 2002) to measure the domain-specific impact of stroke, and the Quality of Life Scale (Al-Janabi et al., 2012) to assess the quality of life of the participants poststroke ( Table 2).

The OCS-Plus Figure Copy Task and Manual Scoring Criteria
The OCS-Plus is a tablet-based cognitive screening tool designed to briefly assess cognitive impairments within clinical and This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.  (Demeyere et al., 2021). The OCS-Plus version used in this investigation was created in MATLAB 2014b and was run on a Microsoft Surface Pro computer tablet (Windows 10 Pro, version 1511). The OCS-Plus begins with a small practice to ensure even those with limited experience with computer-tablet technology can complete tasks accurately, this practice involves tapping a shape in the center of the screen and drawing a line between two small dots. The OCS-Plus includes a computerized Figure Copy Task which is designed to be inclusive for severely impaired patients, including a simple, multielement target figure. In this task, participants are asked to copy a composite geometric shape ( Figure 1) once from sight and again from memory, immediately following completion of the copy condition. Participants are not informed that they will be asked to remember the figure until the beginning of the memory condition. Participants are instructed to complete their drawing using a tablet stylus within a marked area underneath the target figure. Participants are allowed unlimited time to complete each of these drawing tasks.
The Figure Copy Task records performance in terms of coordinates and timeline, allowing full, detailed reconstruction of the drawing process. Each completed drawing is assigned a total score out of 60 with each of the individual 20 figure elements being scored independently according to 3 independent criteria: presence, accuracy, and position ( Figure 2). An element is scored as present if it has been drawn anywhere in the response figure. Perseverative responses are not quantitatively penalized but are noted by the examiner. Elements are marked as accurate if they are drawn with reasonable accuracy as could be expected from a person with typical drawing ability. Reasonable allowances are made to account for the use of a tablet computer stylus on the relatively slippery screen surface and comorbid age-related fine motor impairments (e.g., arthritis). For example, slight inaccuracies in line joining as well as obvious attempts to correct such errors (e.g., doubling up a line to ensure that it is straight) are not penalized. Finally, element position is marked as correct if an element's location is reasonably accurate. As in accuracy scores, allowances are made to account for tablet usage and age-related fine motor impairments. Scorers are instructed to only penalize each drawing position error once and to disregard cases in which position errors within one element have led to placement errors within neighboring elements. These criteria are used to assign a score out of three for each individual element shown in Figure 2, and these element scores are summed to produce a total score. This scoring procedure is repeated for the copy and recall drawing condition.
A full scoring manual detailing the exact instructions given to scorers is openly available on the Open Science Framework (Foster & Deardorff, 2017; https://osf.io/9dwpv/). Human raters completed approximately 2 hrof trainingwith themanual tocomplete themanual ratings. The average time required to manually score a figure copy response varied between 1 and 5 min, depending on the degree of distortion and error present within the response drawing. The automated scoring program requires less than 5 s to score a drawing, implementation of automated scoring canbe expected to save between 2 and 10 min per participant (two drawings each). Note, we This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Note. Subacute and Chronic refer to when the Figure Copy test was administered, so either before 6-month poststroke (termed subacute) or greater or equal to 6-month poststroke. ADL refers to activities of daily living; IADL refers to instrumental activities of daily living. ., had drawn no border elements). This approach was adopted in order to avoid penalizing participants who used the outer border of the rectangular drawing area as the figure border. Given that this error pattern occurred in both healthy adult participants (n = 23 in the copy condition, n = 27 in the recall condition) and patients (n = 14 in the copy condition and n = 11 in the recall condition), we judged this to not have represented a clinical deficit, such as closing-in behavior, and instead attributed these errors to the presentation of the space indicating where to draw on the tablet being too similar in size and shape to the drawing, along with potential misunderstanding of instructions. Given the small number of patient responses (14 at maximum, and more for the healthy adults) this scoring rationale did not significantly impact the results of the conducted analyses.

Automated Scoring Program
The Automated Scoring Program created in this project was developed in Python 3.7 and employs functions from the packages SciPy (Jones et al., 2001), Shapely (Gillies, 2015), Kivy (Virbel et al., 2011), and PyLaTeX (Fennema, 2014). This program employs output variables created by the OCS-Plus software including (x, y) co-ordinates of patient responses, time stamps, and final drawing images. Before scoring each element, this software first preprocesses these data in six sequential steps: noise removal, normalization, circle identification, line segmentation/extraction, star and cross identification, and line element identification.
First, in noise removal, all pen strokes totalling fewer than five pixels are removed, as these responses represent very small marks which were most likely created by accidentally touching the pen to the tablet. Similarly, all elements which are abnormally distant from other elements are removed, as these marks are unlikely to be a part of a participant's intended response. Abnormal distance is determined via calculating the centroids for each element, and then use k-dimensional trees (Maneewongvatana & Mount, 2002) to find nearest neighbors for each of the centers within the distance r, such that r = 1 2 minðfig h , fig w Þ, where h is the height of figure and w is width. Second, participant drawings are normalized. This step is essential due to the large variance in participant response sizes, orientation angles, and positions within the allocated drawing response area. Normalization is conducted by translating each drawing to be positioned with the bottom left-most point at co-ordinate (0, 0) then scaling the x-and y-axis to match the dimensions of the target figure.
In the third step, circular elements are identified within the normalized response drawing. Circles are defined as a continuous path that meets the criteria detailed in Figure 2. The values of these parameter cut-offs were adjusted to the values which optimize overall performance. Next, line segmentation is performed using the Ramer-Douglas-Peucker program which processes a series of points on a single curve and outputs a simplified element path composed of straight lines (Douglas & Peucker, 1973). Vector calculations are then used to determine the angle between multiple lines on each simplified curve, to identify turning points, and to subsequently split simplified lines into individual figure elements.
In step five, star and cross figure elements are identified by finding all sets of lines composed of intersecting paths where the length of each line is less than half of the drawing's total height and individual line lengths are within the third quartile plus 1.5 of the interquartile range of each of the intersecting lines. Line sets of three or more lines where the smallest angle between lines is greater than or equal to 30°are defined as stars and sets of two or more lines of which the smallest angle between lines is greater than or equal to a threshold, empirically determined at 36°are defined as crosses. Finally, in the last step, line elements of the response figure are identified. The orientation of each remaining unclassified figure element is determined as either vertical, horizontal, right, or left slanted by calculating the angles between simplified lines and the normalized x-axis. Euclidian distance calculations (Deza & Deza, 2009) are then used to match each drawn line to its corresponding element in the target figure.
Once this six-step preprocessing is completed, response drawing total scores are assigned automatically. As in manual scoring, each element is assessed based on presence, accuracy, and position. The Automated Scoring Program marks an element as present if it has been identified in the preprocessing steps described above. Accuracy scoring criteria differ based on the element being assessed. For components such as circles, stars, and crosses to be successfully identified by the preprocessing, they must be drawn with a reasonable degree of accuracy. For this reason, if a circle, star, or cross is marked as being present, it is also scored as being accurate. The This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. AUTOMATED SCORING OF COMPLEX FIGURES accuracy of linear elements is scored by calculating the best-fit line of the element via linear regression. The distance between a drawn point and a target point in 2D space is calculated as the absolute difference between their respective xand y-co-ordinates. Linear elements are scored as accurate if the maximum distance from any point of the target element to the bestfit regression line is less than 10, the length of the best-fit line is greater than or equal to 70% of the target path length, and the angle between the best-fit line and target line segment is less than 10°. If two line segments, which are defined as separate in the figure template, are drawn as a continuous line in the participant's drawing, the program is able to split the drawn line segment in order to assess the fit of the separated line segments to the original template as to avoid underscoring presence.
Finally, element position is scored by comparing the location of drawn paths to the location of the corresponding element within the target figure. The program assigns each drawn linear element to its corresponding target element if it has the same orientation and the distance between the elements is less than 20% of the total drawn figure height. As these position criteria have to be met in order to identify a line, such a line is automatically scored as being in the correct position. The detail elements star, cross, and circle are scored as positioned correctly if their distance from the target location is less than 50% of the drawn figure height. Similarly, to manual scoring, the automatic scoring program scores full border elements points if all border elements are not present. This scoring process results in a total score out of 60 points for each response drawing. Full details on the design and implementation of this Automated Scoring Program can be found in the original master's dissertation which details the Program (Yamshchikova, 2019). The Figure Copy software can be downloaded for Academic Use from Oxford University Innovation Software Store (https://process.innovation .ox.ac.uk/software/).

Data Analysis
The manual scoring data included in this investigation were completed independently by SW 1 (rater 1) and VK 2 (rater 2). Both raters were trained to score drawings and both scored all 928 responses included in this investigation. During scoring, all figures were randomized and anonymized so that raters were blind to drawing condition, participant group, and identity. First, the degree of agreement between human rater scores was assessed. Given that figure copy total scores represent an aggregate measure that may not accurately capture interelement variation, these analyses were conducted on total scores and on an element-wise basis. The agreement was measured in two ways. First, summed scores were compared using an intraclass correlation coefficient (ICC; model, ICC1: i.e., single scores and random raters), which measures the ratio of true variance divided by true variance plus error variance (Koo & Li, 2016) and ranges from 0 to 1. Cohen's kappa reliability statistic was used for binary data such as awarding a presence, accuracy, or position score or not and is scaled as a standardized correlation coefficient to enable cross-study interpretation (McHugh, 2012). This investigation employs the ICC reliability benchmarks proposed by Koo and Li (2016): ≤.50 = poor reliability; >.50-≤.75 = moderate reliability; >.75-≤.90 =good reliability; >.90 = excellent reliability. All Cohen's kappa calculations employ the agreement benchmarks defined by McHugh (2012): 0-.20 = no agreement; .21-.39 = minimal agreement; .40-.59 = weak agreement; .60-.79 = moderate; .80-.90 = strong; >.90 = almost perfect.
Next, the agreement between the Automated Scoring Program and aggregate human scores was determined. Element-wise sensitivity (True Positives/True Positives + False Negatives) and specificity (True Negatives/True Negatives + False Positives) was calculated. In these calculations, False Negatives represented cases in which an element was identified by manual scoring, but not by the automated program. Conversely, False Positives represented cases where an element was identified by the automated program, but not by human raters. Sensitivity analysis is usually used in the case of determining whether a test correctly identifies a specific group of cases from another, in our case presence of an element or not. The benchmark for interpretation is that sensitivity + specificity should be close to or above 1.50 (or 150-200 when as a percentage as reported), where a value of 1 reflects an uninformative test and a value of 2 represents a perfect test (Power et al., 2013).
We also examined how the Automated Scoring Program resolved cases in which the raters did not assign identical scores. Next, a qualitative analysis of cases in which the automated program was and was not able to extract meaningful scores was conducted. Finally, the known-group discriminability of total scores assigned by the Automated Scoring Program was examined.
Statistical analyses were conducted in R (version 3.5.1, 2018-07-02, R Core Team, 2018), the data and analyses scripts used to generate this manuscript are openly available (https://osf.io/3k6gs/). We used the following packages for statistical analyses and visualization ggplot2 (

Human Interrater Reliability
The average total score assigned by human raters was 57.72 (SD = 6.09) for copy condition drawings and 44.4 (SD = 10.69) for recall condition responses. Raters exhibited a high degree of agreement between assigned total scores, with a cumulative interclass correlation (ICC) of . Of the 55,680 elements scored, only 3.64% were assigned conflicting element subscores by the assessors. Of all elements, raters disagreed on position scores most frequently (1.30%), followed by accuracy scores (1.26%), and then presence (1.08%). See Figure 3. Raters were found to disagree on more recall condition elements (5.51%) than copy condition elements (1.78%). This difference is likely due to the comparatively greater quality variation present within delayed recall drawing responses (recall variance = 126.63, copy variance = 17.41).
Next, elements that caused the highest degree of disagreement between the raters were identified. The most frequent element to be disagreed upon across all subsscores was the middle bottom right This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 6 interior divider slanted line (element 11, see Table 3) where the human raters disagreed on all three subscores a total of 37/928 times (3.99%). The small left vertical interior divider line (element 12) had the highest number of two subscore disagreements (4.42%, n = 41), with position representing the most commonly disputed subscore. Finally, the circle (element 14) had the highest number of cases in which human raters differed within a single subscore (6.14%, n = 57). This disagreement primarily impacted position scores.

Automated Scoring Program Versus Human Raters
The comparative accuracy of the automated figure copy scores was evaluated against the manually assigned scores. For the element-wise analyses, only element scores where both raters agreed (96.36% of all scores) were included in these analyses. For total score comparisons, we averaged the two raters' total scores. This procedure was adopted to ensure the Automated Scoring Program was able to accurately score figures versus agreed-upon scores before moving on to more complex cases. Overall, the scores assigned by the automated program and raters exhibited a high degree of agreement both in terms of the total score, ICC = .83 This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.   A further way to compare the Automated Scoring Program to the human raters scoring is to compare whether the same participants are identified as impaired on either scoring version. Receiver operating characteristic (ROC) analyses were conducted to compare total score binarized impairment categorizations (i.e., less than 2 SDs below the mean) of the automated assigned scores to those based on the standard manualscoringandcut-offsincopyandrecallconditions.Inthiswaywe directly compared the impairment classification between manually and automatically derived scores, rather than trying to determine the presence of a stroke event. When compared to impairment categorizations made based on manual scores overall (i.e., across both copy and recall conditions), the Automated Scoring Program was found to have a total sensitivity of 80%, a specificity of 93.44%, and an Area Under the Curve (AUC) of 86.72%, 95 CIs [82.72-90.66%], Youden index = .73. The sensitivity and specificity were similarly high in the copy and recall condition, with a slightly lower Youden index in the copy condition (sensitivity = 79.13% copy & 80.70% recall , specificity = 90.14% copy & 96.81% recall , Youden index = .69 copy & .78 recall ). When overall sensitivity and specificity are summed, we get a value of 173.44% or 1.73 in raw units, meaning that our test had above excellent ability to determine impairment classification compared to manual scores (Power et al., 2013). Table 4 summarizes the average scores attained by each sample group per copy and recall condition and presents group-specific sensitivity and specificity statistics.
To further illustrate the degree of agreement between scoring methods overall in terms of total score we classified assigned automated scores into four categories: (a) a direct match with averaged rater total scores (b) within 5% of averaged rater total scores (c) between five (not inclusive) and 10% (inclusive) of averaged rater total scores, (d) between 10 and 15% deviation from averaged rater total scores, and (e) greater than 15% deviation from averages rater scores (e.g., Canham et al., 2000). We found that that 83.51% of scores from the program were within 15% of the average rater scores (39.76% within 5%) and that the maximum deviation was 52% (n = 1). In this single extreme case, the participant had drawn a nonelement outside of the figure boundary, but within the maximum bounds, skewing the normalization process such that the program failed to recognize one side of the otherwise perfect figure. 16.16% were scored by the program with a deviation greater than 15%. See Figure 4. Table 3 presents the automated program's proportion of element hits, misses, false positives, and correct rejections for element-wise presence scores versus the human raters. Overall, the automated program was found to exhibit an average element sensitivity of 90.10% and an average specificity of 92.20%. See Supplemental Materials for sensitivity tables for each element score (i.e., presence, accuracy, and position) and condition separately.
Next, Cohen's k analyses were performed to evaluate the degree of agreement between automated and manual element accuracy, position, and presence scores. These scoring methods were found to exhibit a high degree of agreement on position, k = .82, 95% CI [.82-.84], p < .001, and presence, k = .76, 95% CI [.76-.77], p < .001 scores, but a lower degree of agreement within accuracy scores. k = .41, 95% CI [.41-.42], p < .001. The greatest source of disagreement between the automated and manual scorings was found to be element accuracy false positives (22.63% accuracy false positives versus 2.81% position and 4.72% presence false positives), resulting in a comparatively reduced overall specificity as seen in Table 3.

Automated Scoring Program Versus Nonmatched Human Data
Thus far, only data from element subscores in which human raters agreed with one another, or averaged total scores, have been considered. However, it is also important to investigate the performance of the automated program in more ambiguous cases. For this reason, the Automated Scoring Program was then evaluated within drawings for where human raters disagreed. To do this, we examined element or total score cases in which the two raters did not agree.
We then examined how the Automated Scoring Program resolved these rater disagreements. In cases where there was a clear disagreement between humans raters (i.e., three-element points disagreed upon) or cases with the clearer agreement between human raters (i.e., only one element point disagreed upon) the Automated This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. However, in cases where the scoring was more ambiguous (i.e., disagreement by 2-points was common on the small left vertical interior divider line, element 12), the automated scoring program scored less favorable, giving the majority of participants zero points (37.39%). For these participants, however, the consistency between the Automated Scoring Program and the average human rater total score was still good, ICC = .69, F(40, 40) = 15.80, p < .001, 95% CI [−.07-.90].
When looking at all elements score disagreements between raters regardless of the degree of disagreement (i.e., by 1-point or more, etc.), the automatic scoring program proportionally gave less accuracy, position, and presence points (57.64% of the time) than awarded them (42.36% of the time), this was especially the case for accuracy where the Automated Scoring Program gave far more inaccuracy points than accuracy points. This is can be seen in Figure 5.

Strengths of the Automated Scoring Program
Overall, the automated scorers matched well with the scores assigned by human raters with the majority of the automated total scores being within 15% of the manually assigned scores. The Automated Scoring Program was reliably able to extract and identify figure elements in drawings. For example, drawings that contained distorted or disconnected lines ( Figure 6, Panel A), partial copies ( Figure 6, Panel B), additional elements ( Figure 6, Panel C), and mild tremor ( Figure 6, Panel D) were generally scored accurately. Overall, the Automated Scoring Program was able to successfully discriminate elements from a wide range of imperfectly drawn figures.

Automated Scoring Program Challenges
The ability of the Automated Scoring Program to effectively quantify Figure  The Automated Scoring Program was found to encounter the most difficulty when scoring circle and star elements, as the program must employ precise criteria (e.g., number and angle of intersecting lines) to identify these features. For example, the automated program struggled to identify the circular element, missing 7.22% of circles that were marked as a present by the human raters. This systematic false negative specifically occurred when the circle was drawn as an arc, as multiple distinct overlapping lines, or another nonclosed path (Figure 8 panel A). Star elements also may not be correctly identified, with the automated program missing 5.60% of stars marked as present by human rater. This false negative occurs if stars are drawn as a single continuous path, rather than as distinct lines (Figure 8 panel B). However, it should be noted that overall the inaccurate scoring by the program was comparatively infrequent, impacting scoring on only 119/928 (12.82%) drawings considered.

Known-Group Discriminability
In order to sanity check the scoring of the Automated Scoring Program, we compared the three sample groups (i.e., subacute stroke, chronic stroke, and healthy adults), to see if they performed differently This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. from each other. Given typical recovery trajectories following stroke, we would expect that the subacute stroke group would score lower than the chronic stroke group and that the chronic stroke group would score lower than the healthy adult group. An ANCOVA analysis was conducted to establish the differences between the healthy adult and stroke survivor groups in their total scores while controlling for statistical differences of demographics of age and education. For the copy condition, the ANCOVA revealed a significant effect of group, when controlling for age and education, F(2, 386) = 40.80, p < .001. Tukey Honest Significant Difference (HSD) test indicated that healthy adults performed significantly better than both subacute stroke survivors (M difference = 7.65, p < .001, d = −1.72) and chronic stroke survivors (M difference = 3.96, p < .001, d = −1.10).
For stroke survivors specifically, lesion volume was added to the model as a covariate and Tukey HSD demonstrated significant differences between the subacute and chronic stroke survivors (M difference = 4.69, p = .01, d = −.46). For the recall condition, when only controlling for the effects of age and education on total score, healthy adults again performed significantly better in the recall condition than both subacute stroke survivors (M difference = 8.94, p < .001, d = −1.31) and chronic stroke survivors (M difference = 4.63, p < .001, d = −.99). When additionally adding lesion volume, on this recall data, the Tukey HSD analysis revealed no significant difference in performance on the recall conditions (M difference = 3.89, p = .13, d = −.21). See Figure 9 for the distributions of the total score on the copy and recall Figure Copy test conditions. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Figure 5
ROC Curve Illustrating Sensitivity/Specificity of the Automated Scoring Program Binarized Impairment in Comparison to Averaged Rater Scores Binarized Impairment Note. ROC = Receiver operating characteristic. There were separate cut-offs for recall and copy conditions of the Figure Copy Task. Impairment on the task was classified as greater than 2 SDs below the mean score, and the overall graph takes into account both conditions. Figure

Discussion
This investigation aimed to develop a novel, automated program to score the OCS-Plus Figure Copy Task (Demeyere et al., 2021) and to evaluate the accuracy and utility of this automated tool. Overall, the automated scoring program was able to reliably extract and identify individual figure elements and to assign total scores which agreed well with manual scores across both the copy and recall conditions. Compared to overall impairment categorizations based on manual scores, the Automated Scoring Program had a high overall sensitivity and specificity and was reliably able to distinguish between different clinical impairment groups. The novel automated program was found to be generally robust and very close to the manual scoring overall. There is a clear benefit of automating Figure  Copy scoring, in terms of time and cost savings, in particular allowing this screening assessment to be used without the need for highly trained neuropsychologists to administer and score the task. At the group level, the scoring tool is clearly able to distinguish groups, and diagnostic accuracy compared to manual scoring was very high with an overall AUC of 86.7%. At an individual patient level, we did note some specific response patterns which resulted in systematic scoring failures on the automatic tool. Even though these were low in incidence, if the scoring program is to be used at an individual diagnostic level, it is important to highlight these. Combining the automated scores with full visualization of the original drawing in the reports is key to help interpret all scores at the individual level (see https://osf.io/z3b46/ for an example output of the Automated Scoring Program). Overall, the very high alignment with manual scoring means this program represents a significant and pragmatic advancement over traditionally employed manual scoring procedures, setting the scene for potential implementation in wide-scale screening programs, potentially even in self-assessment settings.
Within this investigation, the two human raters were found to assign scores with a high degree of agreement. This consistency was present across individual element subscores as well as within both This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. AUTOMATED SCORING OF COMPLEX FIGURES copy and recall condition data. When human raters did not assign identical scores, the source of disagreement was most commonly individual element position and accuracy scores. However, the human raters in this investigation completed an extensive training program designed to standardize assigned scores which is typically not feasible to implement at scale within clinical environments. In order to allow a more automated and wide-scale range of cognitive screening to be conducted, reducing the need for high-level training and neuropsychologists to score such tasks, is a pragmatic solution.
Overall, the automated scoring program was able to reliably extract and accurately score individual elements within patient Figure Copy Task responses. In cases where human raters assigned identical scores, there was a high degree of consistency between automated and manually assigned total scores and moderate agreement within individual element scores. The overall human-program score correlations in this investigation were largely similar to those reported by Vogt et al. (2019;.83 vs. .88, respectively). Within individual elements, the Automated Scoring Program demonstrated extremely high sensitivity (92.10%) and specificity (90.20%). It is important to note that human-program scoring differences do not necessarily represent algorithmic errors, but instead suggest the use of slightly different, but not necessarily less valid, scoring criteria. For example, the automated program has a tendency to be stricter than humans raters when awarding points within the accuracy element subscore. However, despite this systematic difference within accuracy subscores, the vast majority of automated total scores (83.51%) were within 15% of manually assigned total scores. These findings suggest that the automated program employs slightly different element scoring criteria than the human raters, but this variance does not result in substantial changes within response total scores.
The performance of the automated scoring program was also separately investigated within responses where human raters did not This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 12 assign identical scores. This is a particularly critical analysis to conduct, as an automated program can potentially provide a systematic, reproducible method for resolving such human rater disagreements. Within individual response elements which were assigned different scores by human raters, the automated program tended to employ more lenient scoring criteria. For example, when a specific element was scored as being inaccurately drawn by one rater but accurately drawn by the other, the Automated Scoring Program was more likely to report that the element had been drawn accurately. Despite this tendency to be more lenient, as a whole, automated scores exhibited high consistency with both rater one and rater two's assigned total scores in cases where both raters disagreed. This indicates that the automated program's systematic lenient scoring of disagreed upon individual elements does not appear to produce systematic biases within overall response scores. In any case, where quantitative scores are assigned to response which does not have a clear "ground truth" score, some degree of subjective judgment is required. The Automated Scoring Program employs consistency where human raters may not and provide the clear advantage of being able to standardize scoring across all responses. The Automated Scoring Program was found to exhibit several clear strengths over manual scoring procedures. First, was able to systematically assign completely reproducible scores even in cases where drawings were distorted. Given that this investigation included data from a representative sample of subacute stroke survivors exhibiting a range of common poststroke cognitive impairments, responses were frequently extremely dissimilar to the target figure. The automated program was found to cope well with drawing inaccuracies due to comorbid fine-motor impairments, omissions due to visuospatial deficits, perseveration errors, and other common poststroke impairment patterns. This robustness greatly adds to the automated program's potential clinical utility. Second, while manual scoring of Complex Figure Copy drawing requires training and time to complete, the automated program is able to instantly produce detailed score breakdowns. This makes employing an automated scoring procedure extremely time efficient, which is a valued attribute especially within clinical settings. This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Figure 8
Further  AUTOMATED SCORING OF COMPLEX FIGURES Finally, the scores generated by the automated program are completely reproducible. These standardized scores are one of the greatest advantages of employing automated over manual scoring methods, as they facilitate valid score comparisons across many different raters in many different patient groups. Despite these advantages, some potential weaknesses were identified within the automated scoring procedure. First, there are specific response patterns that were found to result in systematic underscoring. For example, the automated program struggled to identify circle and star elements that did not meet its exact mathematical extraction criteria but were easily identifiable by human raters (Figure 8). Similarly, the Automated Scoring Program struggled to accurately score drawings when large, extra features were present within the response space (Figure 7). These failures occurred infrequently but represent a potential avenue for improving the automated scoring procedure, or even simply providing an extra element of confidence ratings for each figure, to flag up those which may have been underscored. Future research should aim to identify more flexible methods for identifying more complex elements and for preventing the presence of large extra elements from distorting figure segmentation. Finally, the automated program employs slightly different element subscoring strategies than human raters. Where the circle, star, and cross elements can be identified by the automated program, it is automatically scored as "accurate" due to the equations having specific placement and line intersection requirements. This means that for these specific three detail elements, they cannot be scored as present correctly if they are not also scored as being accurate. However, this difference in scoring was not found to result in significant disagreements between automated and human-assigned scores.
Several previous investigations of automated segmentation algorithms have found that the best results are achieved when scoring procedures employ limited human feedback to address minor weaknesses in otherwise robust algorithms. For example, Wang et al. (2016) developed a deep-learning program to identify and segment potentially cancerous tissue in mammograms which found that a trained pathologist achieved an AUC of 99.6% while the automated segmentation program achieved an AUC of 96.6%. However, when the automated output was briefly reviewed by the trained pathologist to remove obvious false positive cell clusters, the maximal AUC of 99.5% was achieved while retaining the timeefficiently benefit of employing an automated scoring method (Wang et al., 2016). A similar approach could potentially be taken to improve the performance of the automated scoring program presented within this investigation. For example, human raters could quickly screen all figures assigned very low scores by the program to flag cases where normalizing errors have produced false-negative scores. Overall, the automated scoring program was found to provide a robust and reliable method for analyzing a wide range of Figure Copy Task responses. However, future investigations can aim to further explore clinical feasibility and acceptability and within this investigate whether employing a collaborative scoring approach could maximize the efficacy and accuracy of automated scoring processes. Importantly, the automated scores were found to reliably distinguish between participants falling into different impairment groups. On average, subacute stroke survivors were assigned significantly lower scores than chronic stroke survivors, who were in turn assigned lower average scores than neurologically healthy adults. These findings are in line with expectations, demonstrating the external validity of automated Figure Copy Task Scores. Receiver Operating Curve analysis demonstrated that, compared to overall impairment categorizations based on manual scores, the Automated Scoring Program had an extremely high overall sensitivity and specificity (80% and 93.4%, respectively; AUC = 86.7%). This finding illustrates that impairment classifications based on automated scores alone are largely comparable to those assigned by human raters. Taken together, this external validity and ability to identify overall impairment highlight the automated scorning program's potential clinical utility.
Complex Figure Copy Tasks are commonly used as a component of neuropsychological evaluations within both clinical and research settings. From a clinical perspective, automated scoring offers a time-efficient solution for standardizing Figure Copy Scores in order to more reliably detect impairment patterns across many different patient groups. Examiners will no longer have to complete timeconsuming scoring or training procedures and will be provided with immediate, highly detailed scoring results. This in turn may help improve the speed and accuracy of identifying common visuospatial and nonverbal memory-based neuropsychological deficits and open the door to wider population-based cognitive screening and (assisted) self-assessments. From a research perspective, employing automated Figure Copy Scoring helps reduce bias due to the reliance on subjective examiner judgments. This is a critical advantage, as it facilitates valid, large-scale comparisons of Figure Copy Task data collected by different examiners, within different patient groups or research settings. Automated scoring is also completely reproducible, augmenting the reliability of any findings based on the analysis of these scoring data. Overall, the results of this investigation strongly suggest that the novel, automated Figure Copy Scoring tool is a robust and reliable scoring methodology that can be employed to produce immediate, unbiased, and reproducible scores for Complex Figure Copy Task responses in clinical and research environments.

Limitations
There are several potential avenues through which future research can aim to expand on the findings of this investigation. First, Complex Figure Copy Tasks are not only commonly employed within stroke patients, are also regularly administered to patients with suspected dementia, traumatic brain injury, and other neurological deficits. Patients falling within each of these impairment categories may exhibit different error patterns within Figure Copy Tasks. Future research can aim to investigate whether this Automated Scoring Program performs equally well across these patient groups and to determine whether these Figure Figure), future research will need to develop additional, specialized automated scoring algorithms. Similarly, the automated program relies on detailed (x, y) co-ordinates and timestamps produced by a tablet computer-based task. While computerized neuropsychological testing is rapidly being adapted in clinical and research environments (e.g., Bauer et al., 2012), many Figure Copy Tasks are still administered in pen-and-paper format, and the embedding of computerized testing in clinical practice remains a challenge (e.g., see Schmand, 2019).

Conclusions
This investigation presents a novel, automated scoring tool for the OCS-Plus Figure Copy Task (Demeyere et al., 2021). Overall, the automated scoring program was able to reliably extract and identify individual figure elements and to assign total scores which agreed well with manual scores across both the copy and recall conditions. This automated program was reliably able to identify overall impairment patterns and distinguish between different clinical impairment groups. This represents a significant advancement as this novel technology can be employed to produce immediate, unbiased, and reproducible scores for Complex Figure Copy Task responses in clinical and research environments. More generally, the findings of this investigation suggest that automated scoring procedures can be implemented to improve the scope and quality of neuropsychological investigations by reducing reliance on subjective examiner judgments and improving scoring time-efficiency.