Orthopedics

Feature Article 

Intraobserver Reliability and Interobserver Agreement in Radiographic Classification of Heterotopic Ossification

Georgios I. Vasileiadis, MD, PhD; Yodhiaki Itoigawa, MD, PhD; Derek F. Amanatullah, MD, PhD; Luis Pulido-Sierra, MD; Jeremy R. Crenshaw, PhD; Christine Huyber; Michael J. Taunton, MD; Kenton R. Kaufman, PhD, PE

Abstract

The most widely used radiologic classification system for heterotopic ossification after total hip arthroplasty (THA) is the Brooker scale. In 2002, Della Valle et al proposed a modified rating system for heterotopic ossification to increase intraobserver reliability and interobserver agreement. To date, no study comparing these 2 classification systems has been conducted. Moreover, these studies were grossly underpowered. In the current study, 3 clinicians reviewed the charts of 236 patients with documented radiographic heterotopic ossification at least 2 months after THA and independently graded the amount of heterotopic ossification according to the Brooker and Della Valle classification systems. Then the intraobserver reliability and the interobserver agreement of each classification system were calculated with Cohen's kappa (κ) coefficient of agreement. The Brooker scale showed moderate to substantial intraobserver reliability (0.43≤κ<0.71), and the Della Valle classification system showed substantial intraobserver reliability (0.65≤κ<0.77). Both classification systems showed moderate interobserver agreement (0.40≤κ<0.60). Della Valle grade C (ie, presence of bone spurs from the pelvis or femur leaving less than 1 cm between opposing surfaces and apparent bone ankylosis) and Brooker grade IV had the best interobserver agreement. The best interobserver agreement for any grade was seen with grade C of the Della Valle classification system, which showed substantial interobserver reliability (0.60≤κ<0.80). The Della Valle classification system may be slightly better in patients with large amounts of heterotopic ossification, but both classification systems lack sufficient clarity and are open to significant subjective interpretation. [Orthopedics. 2017; 40(1):e54–e58.]

Abstract

The most widely used radiologic classification system for heterotopic ossification after total hip arthroplasty (THA) is the Brooker scale. In 2002, Della Valle et al proposed a modified rating system for heterotopic ossification to increase intraobserver reliability and interobserver agreement. To date, no study comparing these 2 classification systems has been conducted. Moreover, these studies were grossly underpowered. In the current study, 3 clinicians reviewed the charts of 236 patients with documented radiographic heterotopic ossification at least 2 months after THA and independently graded the amount of heterotopic ossification according to the Brooker and Della Valle classification systems. Then the intraobserver reliability and the interobserver agreement of each classification system were calculated with Cohen's kappa (κ) coefficient of agreement. The Brooker scale showed moderate to substantial intraobserver reliability (0.43≤κ<0.71), and the Della Valle classification system showed substantial intraobserver reliability (0.65≤κ<0.77). Both classification systems showed moderate interobserver agreement (0.40≤κ<0.60). Della Valle grade C (ie, presence of bone spurs from the pelvis or femur leaving less than 1 cm between opposing surfaces and apparent bone ankylosis) and Brooker grade IV had the best interobserver agreement. The best interobserver agreement for any grade was seen with grade C of the Della Valle classification system, which showed substantial interobserver reliability (0.60≤κ<0.80). The Della Valle classification system may be slightly better in patients with large amounts of heterotopic ossification, but both classification systems lack sufficient clarity and are open to significant subjective interpretation. [Orthopedics. 2017; 40(1):e54–e58.]

Heterotopic ossification is the formation of mature lamellar bone in nonosseous tissues. It is a common complication after total hip arthroplasty (THA), occurring in 15% to 90% of cases.1 There are 24 different radiologic classification systems for heterotopic ossification, such as the Brooker, Arcq, DeLee, and Hamblen systems. The Brooker scale (Figure 1) is used in 47% of published research studies.2,3 However, the Brooker scale does not have adequate intra- or interobserver reliability.4 The amount of heterotopic ossification that interferes with hip range of motion, classified as Brooker grades III and IV, varies from 7% to 63%.5–7 Some authors argue that small amounts of heterotopic ossification lead to significant restriction of hip mobility, whereas most believe that only a significant amount of heterotopic ossification interferes with hip mobility.1,5,8,9


Brooker scale. Isolated bony islands (A). Bone spurs from the pelvis or the proximal end of the femur with more than 1 cm space between opposing surfaces (B). Bone spurs from the pelvis or the proximal end of the femur with less than 1 cm space between opposing surfaces (C). Bridging or ankylosis of the hip (D).

Figure 1:

Brooker scale. Isolated bony islands (A). Bone spurs from the pelvis or the proximal end of the femur with more than 1 cm space between opposing surfaces (B). Bone spurs from the pelvis or the proximal end of the femur with less than 1 cm space between opposing surfaces (C). Bridging or ankylosis of the hip (D).

In an attempt to reduce inconsistencies in grading and increase the correlation of radiographic appearance with clinical significance, Della Valle et al4 proposed a simplified classification system for heterotopic ossification after THA to increase intraobserver reliability and interobserver agreement (Figure 2). However, the reliability of the Della Valle classification has not been independently evaluated. Toom et al10 attempted to evaluate the interobserver agreement of the Della Valle, Arcq, Brooker, and DeLee classification systems, but their study was grossly under-powered. The current retrospective clinical study compared the reliability of the classic Brooker scale and the Della Valle classification. The Della Valle classification system consists of 3 grades, whereas the Brooker scale consists of 4 grades. The Della Valle classification system provides important information on the size of islands of heterotopic ossification, whereas the Brooker scale does not. The authors hypothesize that the Della Valle classification system, which is simpler (ie, fewer grades) and uses more specific descriptions (eg, size specifications), should have higher inter- and intraobserver reliability than the Brooker scale when measuring heterotopic ossification after THA.


Della Valle classification system. Absence of heterotopic ossification and presence of 1 or more islands of bone less than 1 cm long (A). Presence of 1 or more islands of bone at least 1 cm long and presence of bone spurs from the pelvis or femur, leaving at least 1 cm between opposing surfaces (B). Presence of bone spurs from the pelvis or femur, leaving less than 1 cm between opposing surfaces and apparent bone ankylosis (C).

Figure 2:

Della Valle classification system. Absence of heterotopic ossification and presence of 1 or more islands of bone less than 1 cm long (A). Presence of 1 or more islands of bone at least 1 cm long and presence of bone spurs from the pelvis or femur, leaving at least 1 cm between opposing surfaces (B). Presence of bone spurs from the pelvis or femur, leaving less than 1 cm between opposing surfaces and apparent bone ankylosis (C).

Materials and Methods

After institutional review board approval was obtained, the charts of 236 patients with documented radiographic heterotopic ossification at least 2 months after THA were reviewed by 3 clinicians (G.I.V., D.F.A., L.P.-S.) with experience in independent evaluation of electronic radiographs. Each reviewer noted the grade of heterotopic ossification on anteroposterior pelvic radiographs according to the Brooker scale (Figure 1) and the Della Valle classification system (Figure 2). Each reviewer repeated the grading a second time at least 2 weeks after the previous review. Reviewers were blinded to their previous grades.

Intraobserver reliability and interobserver agreement for each classification system were assessed with Cohen's kappa (κ) coefficient of agreement (SAS version 9.3; SAS Institute Inc, Cary, North Carolina). Required sample size calculations were based on previously established incidences of heterotopic ossification within the THA population and an acceptable value of κ≥0.5.4 Inclusion of a total of 236 patients with heterotopic ossification after THA was necessary to achieve a 95% confidence interval of ±0.10 when κ≥0.5.11 Intraobserver reliability was assessed by comparing observations made by the same observer during the first and second evaluations. Interobserver agreement was assessed by comparing the observations made by each observer during the first evaluation. Interpretation of κ was based on the criteria of Landis and Koch, as follows: almost perfect (κ≥0.80), substantial (0.60≤κ<0.80), moderate (0.40≤κ<0.60), fair (0.20≤κ<0.40), and poor (κ<0.20).12

Results

The Brooker scale showed moderate to substantial intraobserver reliability, ranging from 0.43 to 0.71, and the Della Valle classification system showed substantial intraobserver reliability, ranging from 0.65 to 0.77 (Figure 3). No statistically significant difference was found in intraobserver reliability for either classification system for each observer, based on overlapping 95% confidence intervals. Overall intraobserver reliability for each classification system cannot be calculated because calculations must remain independent for each observer. However, the Della Valle classification system appears to have much less variability in intra-observer agreement between observers (Figure 3).


Intraobserver reliability of the Brooker scale and the Della Valle classification system for each observer. White bars indicate reliability of the Brooker scale. Black bars indicate reliability of the Della Valle classification system. Error bars indicate 95% confidence interval for each observer.

Figure 3:

Intraobserver reliability of the Brooker scale and the Della Valle classification system for each observer. White bars indicate reliability of the Brooker scale. Black bars indicate reliability of the Della Valle classification system. Error bars indicate 95% confidence interval for each observer.

Both classification systems showed moderate interobserver agreement. For the Brooker scale, interobserver agreement was moderate (κ=0.41; 95% confidence interval, 0.37–0.45; P<.0001) (Figure 4). None of the Brooker grades showed a statistically significant difference, based on overlapping 95% confidence intervals. Interobserver agreement of the Della Valle classification system was moderate (κ=0.53; 95% confidence interval, 0.47–0.59; P<.0001) (Figure 5). Overall, no statistically significant difference in interobserver agreement was found between Brooker grades and the Della Valle classification system, based on overlapping 95% confidence intervals. However, the best interobserver agreement for any grade was seen with grade C of the Della Valle classification system, which showed substantial interobserver reliability, based on nonoverlapping 95% confidence intervals (Figure 5).


Interobserver agreement of the Brooker scale for each grade. Error bars indicate 95% confidence interval for each grade.

Figure 4:

Interobserver agreement of the Brooker scale for each grade. Error bars indicate 95% confidence interval for each grade.


Interobserver agreement of the Della Valle classification system for each grade. Error bars indicate 95% confidence interval for each grade. Asterisk indicates statistical significance based on non-overlapping 95% confidence intervals.

Figure 5:

Interobserver agreement of the Della Valle classification system for each grade. Error bars indicate 95% confidence interval for each grade. Asterisk indicates statistical significance based on non-overlapping 95% confidence intervals.

Discussion

This study assessed the intraobserver reliability and interobserver agreement of 2 commonly used classification systems for radiographic evaluation of heterotopic ossification. Neither system was found to be superior, and both systems had only moderate interobserver agreement. However, the Della Valle classification system had more consistent intraobserver reliability among the 3 observers.

Both the Brooker scale and the Della Valle classifications system showed substantial reported intraobserver reliability.4 The authors found that the intraobserver agreement for the Brooker scale ranged from 0.49 to 0.71, depending on the observer. They found that the intraobserver agreement for the Della Valle classification system ranged from 0.66 to 0.77, depending on the observer. A possible explanation for the discrepancy, especially among observers, is that both classification systems may be confusing to some observers (Figure 6). Specifically, 2 types of heterotopic ossification are not adequately described by any Della Valle grade. A small femoral or pelvic spur is classified as neither grade A nor grade B (Figure 7). Similarly, a large bone island that leaves little space between the island and the femur and pelvis is classified as neither grade B nor grade C (Figure 8). A similar argument can be made for the Brooker scale. A large bone island is classified as neither grade I nor grade IV (Figure 8). Hence, although pictographically the grades appear clear, in practical application, the grades are subjective, leading to variability in intraobserver reliability.


Anteroposterior radiograph of the right hip with a large spur from the ilium and lesser tro-chanter (solid arrows) suggesting Brooker grade of III and Della Valle grade of B, but inclusion of the bone island formed in the abductor musculature (dashed arrow) might suggest Brooker grade of IV or Della Valle grade of C, depending on the reviewer. Complex radiographs such as this make determination of the final grade subjective.

Figure 6:

Anteroposterior radiograph of the right hip with a large spur from the ilium and lesser tro-chanter (solid arrows) suggesting Brooker grade of III and Della Valle grade of B, but inclusion of the bone island formed in the abductor musculature (dashed arrow) might suggest Brooker grade of IV or Della Valle grade of C, depending on the reviewer. Complex radiographs such as this make determination of the final grade subjective.


Anteroposterior radiograph of the right hip with a small pelvic spur (circle) is not classified as Della Valle grade A or B.

Figure 7:

Anteroposterior radiograph of the right hip with a small pelvic spur (circle) is not classified as Della Valle grade A or B.


Anteroposterior radiograph of the right hip with a large bone island (circle) that leaves little space between the island and the femur and pelvis is not classified as Brooker grade I or IV or as Della Valle grade B or C.

Figure 8:

Anteroposterior radiograph of the right hip with a large bone island (circle) that leaves little space between the island and the femur and pelvis is not classified as Brooker grade I or IV or as Della Valle grade B or C.

Both the Brooker scale and the Della Valle classification system showed moderate reported interobserver agreement.4 For the Brooker scale, κ was 0.31 to 0.54, with mean κ of 0.51. For the Della Valle classification system, κ was 0.38 to 0.69, with mean κ of 0.53. With the Della Valle classification system, evaluations by the same observer were more consistent, but overall intraobserver agreement did not change. The current data independently corroborated this observation.

Conclusion

For each classification system, the most significant grade (ie, Della Valle grade C and Brooker grade IV) had the best interobserver agreement. Della Valle grade C was statistically more reliable between observers, suggesting that this system may be more reliable than the Brooker scale in detecting high grades of heterotopic ossification. High grades of heterotopic ossification are easier to distinguish and may offer the most clinically relevant classifications.5,8 However, this is not universally true because a large bone island, classified as grade I on the Brooker scale, may have the same high clinical effect as a grade IV heterotopic ossification (Figure 8). However, an ideal classification system should have high interobserver reliability, independent of the grade. Orthopedic surgeons should consider switching from the prevalent Brooker scale to the modestly more reliable Della Valle classification system when evaluating heterotopic ossification.

References

  1. Ahrengart L, Lindgren U. Functional significance of heterotopic bone formation after total hip arthroplasty. J Arthroplasty. 1989; 4(2):125–131. doi:10.1016/S0883-5403(89)80064-6 [CrossRef]
  2. Brooker AF, Bowerman JW, Robinson RA, Riley LH Jr, . Ectopic ossification following total hip replacement: incidence and a method of classification. J Bone Joint Surg Am. 1973; 55(8):1629–1632.
  3. Neal B, Gray H, MacMahon S, Dunn L. Incidence of heterotopic bone formation after major hip surgery. ANZ J Surg. 2002; 72(11):808–821. doi:10.1046/j.1445-2197.2002.02549.x [CrossRef]
  4. Della Valle AG, Ruzo PS, Pavone V, Tolo E, Mintz DN, Salvati EA. Heterotopic ossification after total hip arthroplasty: a critical analysis of the Brooker classification and proposal of a simplified rating system. J Arthroplasty. 2002; 17(7):870–875. doi:10.1054/arth.2002.34819 [CrossRef]
  5. Kjaersgaard-Andersen P, Ritter MA. Prevention of formation of heterotopic bone after total hip arthroplasty. J Bone Joint Surg Am. 1991; 73(6):942–947.
  6. Søballe K, Christensen F, Kristensen SS. Ectopic bone formation after total hip arthroplasty. Clin Orthop Relat Res. 1988; 228:57–62.
  7. Thomas BJ. Heterotopic bone formation after total hip arthroplasty. Orthop Clin North Am. 1992; 23(2):347–358.
  8. Eggli S, Rodriguez J, Ganz R. Heterotopic ossification in total hip arthroplasty: the significance for clinical outcome. Acta Orthop Belg. 2000; 66(2):174–180.
  9. Kromann-Andersen C, Sørensen TS, Hougaard K, Zdravkovic D, Frigaard E. Ectopic bone formation following Charnley hip arthroplasty. Acta Orthop Scand. 1980; 51(4):633–638. doi:10.3109/17453678008990854 [CrossRef]
  10. Toom A, Fischer K, Märtson A, Rips L, Haviko T. Inter-observer reliability in the assessment of heterotopic ossification: proposal of a combined classification. Int Orthop. 2005; 29(3):156–159. doi:10.1007/s00264-004-0603-9 [CrossRef]
  11. Altaye M, Donner A, Klar N. Inference procedures for assessing interobserver agreement among multiple raters. Biometrics. 2001; 57(2):584–588. doi:10.1111/j.0006-341X.2001.00584.x [CrossRef]
  12. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33(1):159–174. doi:10.2307/2529310 [CrossRef]
Authors

The authors are from the Department of Orthopedic Surgery, Mayo Clinic, Rochester, Minnesota.

Dr Amanatullah is a 2-time Blue Ribbon Article Award recipient (Orthopedics, November/December 2016). Dr Taunton is a previous Blue Ribbon Article Award recipient (Orthopedics, November/December 2016).

Dr Vasileiadis, Dr Itoigawa, Dr Pulido-Sierra, Dr Crenshaw, Ms Huyber, and Dr Kaufman have no relevant financial relationships to disclose. Dr Amanatullah is a paid consultant for Omni, Exactech, Sanofi, and Blue Jay Mobile Health and has received grants from Acumed, Stryker, and Blue Jay Mobile Health. Dr Taunton is a paid consultant for and receives royalties from DJO Global.

The authors thank Dr Chrisoula A. Toupadakis for the illustrations of the Della Valle and Brooker classification systems of heterotopic ossification of the hip.

Correspondence should be addressed to: Kenton R. Kaufman, PhD, PE, Department of Orthopedic Surgery, Mayo Clinic, 200 First St SW, Motion Analysis Laboratory, CN L-110, Rochester, MN 55905 ( kaufman.kenton@mayo.edu).

Received: May 16, 2016
Accepted: July 26, 2016
Posted Online: September 30, 2016

10.3928/01477447-20160926-05

Sign up to receive

Journal E-contents