Development of the TeamOBS‐PPH – targeting clinical performance in postpartum hemorrhage

This study aimed to develop a valid and reliable TeamOBS‐PPH tool for assessing clinical performance in the management of postpartum hemorrhage (PPH). The tool was evaluated using video‐recordings of teams managing PPH in both real‐life and simulated settings.

is therefore of high interest (7)(8)(9)(10). Previous research has evaluated performance in terms of outcome, technical performance, non-technical performance, and clinical performance.
Outcome performance assessment captures the incidence of PPH or maternal death due to PPH. Outcome assessment is useful for benchmarking or for comparing outcomes of PPH in two periods, for example, before and after changing treatment paradigms. Although outcome evaluation is important to organizations, it is used less as feedback for a single team, as the team performance is not necessarily reflected in the outcome (5).
Technical performance assessment is useful for evaluating teaching and the training of technical skills, for example, intrauterine palpation or suturing a cervical laceration. However, technical skills evaluation usually addresses only an individual and not the performance of the entire team (11).
Non-technical performance assessment evaluates leadership, communication, and teamwork, which are important dimensions that should be evaluated. However, such evaluations will not include the actual patient care provided (12).
Clinical performance assessment is focused on the quality of the patient care provided by the entire team. Tools for clinical performance assessment are very detailed, as each required action has a weighted score reflecting its importance and the timing of actions involved (13,14). Thus, these tools differ from procedural checklists that are used during management of PPH to help the team to remember elements specified in local guidelines (15,16). Clinical performance tools are very useful for feedback and formal debriefing of the team (17,18).
Assessments of outcome, technical performance, and non-technical performance are widely recognized as important tools in optimizing management of PPH (3), but we were unable in the literature to identify tools assessing clinical performance in PPH. Therefore, we need a tool that also assesses clinical performance. Using such a tool will be valuable for quality assessment and benchmarking of performance over time and between hospitals or regions (9,13). Clinical performance assessment will also be valuable in education, where it can be included in feedback after simulation training or after a real-life PPH (14,17,18). Furthermore, in research, it may help us to define high performing teams and discover the key to their success. The present study aimed to close this gap: • by developing a tool for assessing the clinical performance of teams managing PPH; • by evaluating the reliability, internal validity, and external validity of the tool using video-recordings of teams managing PPHs in both real-life situations and in situ simulations.

Development of the TeamOBS-PPH tool
In the first part of this study, we conducted an e-mailbased Delphi process to develop the TeamOBS-PPH tool ( Figure 1). The Delphi panel consisted of 12 senior obstetricians from maternity units in the UK (n = 3), Norway (n = 2), Sweden (n = 3), Denmark (n = 3), and Iceland (n = 1). In four rounds, the experts answered questionnaires (Supporting Information Appendix S1) concerning the items to be included in the tool and the weight to be assigned to each item ( Figure 2). Based on these results (19,20), we developed the TeamOBS-PPH tool for assessment of clinical performance ( Figure 3). The tool generates a score from 0 to 100 with a minimal pass level of 60, below which the risk of harming the patient is to be considered. We also defined a lower level of high performance at 85, above which no or only minor errors affected the score.

Testing of validity and reliability
In the second part of this study, to evaluate the validity and reliability, we used the framework and conceptual definitions devised by Cook and Beckman (2006), described as the five sources of validity evidence (21,22). To answer our validation questions as listed in Table 1, tests were performed at three levels.
Level 1: Performance in test setting. The initial testing of TeamOBS-PPH was conducted from November 2014 to February 2015. We used four selected videorecordings of in situ team trainings of simulated severe PPH (1300 mL) due to refractory uterine atony from two Danish hospitals: the Regional Hospital in Horsens -Maternal Care Level 2 (HEH), and Aarhus University Hospital -Maternal Care Level 3 (23) (AUH). The two Expert teams consisted of five to six clinicians: obstetricians, midwives, and technicians. The two Novice teams consisted of midwifery and medical students in their final year of training. The simulations were performed on the delivery unit with a hybrid mannequin (PROMPT-Birthing Simulator, Limbs and Things).
We recruited four obstetric providers as raters (one registrar, one senior consultant, one junior midwife, and one senior midwife). These providers were not familiar with the tool as they were not members of the Delphi panel. They were formally trained as "raters" in a one- The expert group weighted the same items again, this time with an anonymous summary of the scores from Delphi round 2.

Result
Experts achieved consensus on 19 of the 19 items.
After preliminary testing the fi nal assessment tool TeamOBS-PPH was approved.
Weights of elements. Each member of the expert group weighted the importance of each item on a 5-point Lickert scale.

Result
Consensus was achieved on 17 of the 19 items.
The expert group was asked to add, remove or suggest corrections to the 19 items.

Result
4 items adjusted; none added or removed. hour session where they were introduced to the tool and had to discuss each item in relation to their daily life in the ward. For example, they had to agree which individual actions would score 0, 1, and 2. During training, raters also jointly assessed a video-recording of a team from an in situ simulation. After training, the four raters applied the TeamOBS-PPH tool to assess all four videorecordings independently. After 1 month, the four raters reassessed all video-recordings.
Level 2: Real-life setting, Denmark. We obtained video-recordings of teams managing real-life PPH situations at two Danish hospitals: HEH and AUH. In all 17 delivery rooms, two or three mini-dome high-definition surveillance cameras were placed in the corner at ceiling level to capture a view of the clinical team from the head-end facing the patient to minimize inappropriate patient exposure. However, staff and patient faces were potentially identifiable. A microphone was placed at the center of the ceiling. In each delivery room, we installed a video-recording system which was activated by a Bluetooth chip in the obstetrician's telephone when he or she entered the room. Camera capacity allowed recording day and night with a storage capacity of 5 min. When the system was activated upon the obstetrician's entry, the immediately preceding 5 min of video-recording were saved to the server, and all subsequent footage was captured. The video was deleted after 48 h. The midwives reported eligible cases (PPH ≥1000 mL) to the research team. The midwives' reports included the total blood loss (mL) assessed by weight. Thus, the research team had 48 h to obtain consent from all parties to download the video-recording from the server for research purposes. Inclusion of teams is visualized in Figure 4.
We recruited two senior consultant obstetricians from the studied hospitals (HEH and AUH), who had not participated in the Delphi process or in the Level 1 testing. They had the same formal one-hour training as the raters performing the Level 1 testing, and they subsequently assessed all 85 video-recordings independently, blinded to  each other's scores and the reported total blood loss. After every five recordings, the raters were asked to discuss any difficulties they had experienced but they were not allowed to alter any prior assessments. After 1 month, the raters reassessed 20% of the recordings (randomly selected) to evaluate intra-rater agreement over time.
Videos were collected during 15 months in 2014-2015.
The videos were analyzed in the spring of 2016.
Level 3: External validity, USA. The TeamOBS-PPH tool is based on traditional approaches in five countries in northern Europe. To explore whether our results can be applied in a different cultural context, we tested its use at the Obstetric Center at Lucile Packard Children 0 s Hospital, Stanford, California, USA. Simulation was used because it was not possible to find another hospital conducting live video-recordings of PPH that could be used for research. To account for system differences, a simplified Delphi process was conducted as a group discussion among five clinicians working in the obstetric field (one Fellow, two Consultant Obstetricians, one Consultant Obstetric Anesthetist, and one Labor and Delivery Nurse). They were asked for approval or corrections to each item considering standard clinical practice in their regional area. The final tool was adjusted and approved by the Delphi group.
Video-recordings of simulated in situ team training of how to manage severe PPH (1500 mL) due to refractory uterine atony were collected in the same hospital in 2011-2016. Each recording was of teams consisting of 7-10 clinicians: obstetricians, anesthesiologist, labor and delivery nurses, and technicians. Fifteen video-recordings were included. Six of the teams were performing on the labor and delivery unit using a hybrid mannequin (Mama Natalie Birthing Simulator, Laerdal). Nine teams performed the simulation in the operating theater with the mannequin NOELLE (Maternal and Neonatal Birthing Simulator from Gaumard).
Two raters (one Fellow and one Consultant Obstetrician) assessed the 15 video-recordings independently and blinded to each other's scores. They underwent the same formal training as all other raters. Analysis was conducted in the spring of 2016. Clinical performance score: (Weighted score+Patient Safety Score/2) Scoring: Put a cross in the box and give "not indicated", "cannot be assessd", "0", "1" or "2"

Statistical analysis
The statistical analysis of the clinical performance scores was performed on the logit-transformed scale using the normal model and back-transformed using the inverse logit function (24,25). Further description of Bland-Altman plots can be found in Supporting Information Figure S1. Analysis of blood loss was performed using the normal model after log-transformation. The relation between clinical performance and total blood loss was analyzed using simple linear regression. The potential confounding of bleeding velocity (mL/min) was assessed using multiple linear regression analysis. The model was checked by diagnostic plots of residuals. An intra-class correlation (ICC) >0.75 was considered high agreement (26). We used STATA version 14.0 for the statistical analysis (StataCorp, College Station, TX, USA).

Part 1: The TeamOBS-PPH tool development
All 12 experts completed the Delphi process. In the first round, they adjusted four items but did not add or remove any items. One example of adjustment was calling for delivery of blood, which was adjusted to include both whole blood and packed cells to allow for differences between local guidelines in various countries. In the second round, the experts reached consensus on 17 of the 19 items; in the third round, on all 19 items. The Delphi panel approved the final tool as shown in Figure 3.

Part 2: Validity and reliability
A complete list of validity process results is presented in Table 1. Non eligible teams, as consent was not obtained within 48 hours due to lack of contact information or early patient discharge (n = 50); staff declined to consent (n = 17); patient declined consent (n = 11); technical failure (n = 7) Consent obtained from all participants; video downloaded and included in the project n = 103 0.85-0.99); re-evaluation after 1 month of one rater was ICC = 0.88 (95% CI 0.71-0.96), the average of four raters was ICC = 0.94 (95% CI 0.83-0.97). Based on the excellent inter-rater agreement, we decided to use two raters in Level 2.
Level 2: Real life setting, Denmark. We included 85 of the 188 cases of severe PPH occurring among the 6700 deliveries performed at the two hospitals in 2015. Reasons for non-eligibility are given in Figure 4. The average amount of bleeding among the included cases was a median of 1648 mL (95% CI 1542-1760) compared with a median of 1641 mL (95% CI 1524-1767), p = 0.93 among the excluded cases. Eighty-one different team combinations were included, four of which appeared in two recordings. The average team size was five (physicians, midwives, and technicians). Rater 1 scored with a median of 87 (95% CI 85-89), range 45-100; rater 2 with a median of 89 (95% CI 87-91), range 59-99 (p = 0.99). The inter-rater agreement of two raters and 85 videos was 0.83 (95% CI 0.74-0.89). The intra-rater agreement after 1 month for two raters was ICC = 0.96 (95% CI 0.92-0.98). The correlation between the weighted scores and the patient safety scores is visualized in Figure 5a. The difference in agreement is visualized with limits of agreement in Figure 5b. The tool was applicable in all 85 cases of PPH, irrespective of the cause (uterine atony, retained placenta or lacerations).
Only one item, 3.1 "consider the cause of the bleeding" changed the consistency when re-calculated after each item had been deleted (27). Deleting this item increased the ICC by 0.02 to a total ICC of 0.85, meaning that this variable is reliable.
The total blood loss in the 85 real-life teams was associated with the level of performance: A score of 60 was associated with a median blood loss of 2097 mL (95% CI 1696-2593 mL), a score of 85 with 1696 mL (95% CI 1577-1824 mL), and a score of 100 with 1493 mL (95% CI 1315-1695 mL). The difference between the three levels of performance was significant (p = 0.0029) ( Table 2). Our results remained stable when adjusting for bleeding velocity as a potential confounder (Supporting Information Table S1).
Level 3: Simulation setting, USA. In the United States, the Delphi panel adjusted the phrasing of two items: "Blood test for FBC, blood type for compatibility and cross match" was rephrased as "Blood test for CBC, coags, cross match and type", and "Monitor observations: pulse, blood pressure, and respiratory rate" was replaced with "Monitor observations: pulse, blood pressure, pulse ox, and respiratory rate". One item, "Documentation on PPH chart", was adjusted in weight from 2.5 to 3.5. A new item was added: "Correct placement of the Bakri balloon". No items were deleted. Following these adjustments, the Delphi panel approved the tool for local/

Usability of the TeamOBS-PPH tool
The tool was easy to use. At all three levels, the raters were trained for one hour and then used the tool as the video was rolling, i.e. they did not see the video more than once.

Discussion
This study is the first study to develop a tool for assessment of the clinical performance of teams managing PPH. The TeamOBS-PPH tool includes an objective weighted score based on a checklist including 19 items and a more subjective overall patient safety score. The tool is applicable in both real-life and simulated settings. The reliability and validity are high, and the tool is adaptable and modifiable to local clinical guidelines.
The main strength of this study is that it is based on video-recordings of the clinical management of patients with real-life PPH. To our knowledge, this has not been done previously. Another strength is that the tool was developed through a Delphi process with a panel of experts from five countries, and we were able to show that the tool was valid cross-culturally. Furthermore, higher clinical performance was significantly associated with less bleeding.
A limitation is that, despite the high technical quality of the video-recordings, it was difficult to accurately assess the amount of blood loss, which is why the raters had to rely on the team itself to verbalize this variable.
The risk of response bias must be considered, as we deliberately chose raters from the study population in order to utilize their knowledge of the actual health organization. Furthermore, we included the more subjective global score, contributing not less than 50% of the total score (14,28). The reason for this decision was that the global score quite often further decreases lower weighted scores and increases higher weighted scores (Figure 5a). Based on our findings, we recommend that: • Each PPH recording be assessed by two independent raters. Two is sufficient as the median difference between the observes was only 3 points, which is far below our predefined acceptable difference of 15 points for a single event (Figure 5b). By use of less experienced raters, the inter-rater agreement might by lower; therefore, three raters could be considered.
• If the difference between the two raters exceeds 15 points, the two raters should compare their ratings. This happened in our study in eight of 85 cases (Figure 5b), primarily among scores below 80. Most often the raters realized that one of them had misinterpreted an item.
• The acceptable level of performance be set at 60. To ensure a fair judgement of the few cases with lower scores, we the raters should also discuss and agree on these ratings.
• The tool be modified according to local clinical guidelines. For example, the US teams checked the uterine cavity and considered placement of an intrauterine balloon in the delivery room, whereas in Northern Europe this treatment would generally take place in the operating theater.
• The raters consider using the free TeamOBS-PPH App, which may help with assessment.
• The TeamOBS-PPH tool be used for post-event assessment, i.e. summative evaluation of performance. It is not designed to be used as an alternative for a procedural checklists or as a cognitive aid for teams to follow during the event (16).
We suggest that the TeamOBS-PPH be considered for use in the following situations: • Education, as a feedback tool after simulation training to encourage reflective practice. The App version of the TeamOBS-PPH tool includes a feedback module, in which the performance in different categories is visualized in graphs.
• Structured debriefing after real-life clinical events. We suggest using the feedback module in the App, where the model "SHARP -5-step feedback and debriefing tool (29)" is included.
• Quality assurance. This step is important to facilitate the best possible conditions in the delivery wards, allowing teams to become more effective and improve patient safety.
In conclusion, our study provides a new tool for assessing clinical performance in the management of PPH. It was developed through an international Delphi process and tested in real-life PPH with acceptable validity. The TeamOBS-PPH tool offers an opportunity to assess actual team performance which is useful for education, research, and feedback. We hope that the App TeamOBS will serve as an aid for the busy clinician to ease the process of providing structured feedback and to encourage continuous learning to improve team performance during real-life PPH.