Skip to main content

Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis



Cluster analysis (CA) is a frequently used applied statistical technique that helps to reveal hidden structures and “clusters” found in large data sets. However, this method has not been widely used in large healthcare claims databases where the distribution of expenditure data is commonly severely skewed. The purpose of this study was to identify cost change patterns of patients with end-stage renal disease (ESRD) who initiated hemodialysis (HD) by applying different clustering methods.


A retrospective, cross-sectional, observational study was conducted using the Truven Health MarketScan® Research Databases. Patients aged ≥18 years with ≥2 ESRD diagnoses who initiated HD between 2008 and 2010 were included. The K-means CA method and hierarchical CA with various linkage methods were applied to all-cause costs within baseline (12-months pre-HD) and follow-up periods (12-months post-HD) to identify clusters. Demographic, clinical, and cost information was extracted from both periods, and then examined by cluster.


A total of 18,380 patients were identified. Meaningful all-cause cost clusters were generated using K-means CA and hierarchical CA with either flexible beta or Ward’s methods. Based on cluster sample sizes and change of cost patterns, the K-means CA method and 4 clusters were selected: Cluster 1: Average to High (n = 113); Cluster 2: Very High to High (n = 89); Cluster 3: Average to Average (n = 16,624); or Cluster 4: Increasing Costs, High at Both Points (n = 1554). Median cost changes in the 12-month pre-HD and post-HD periods increased from $185,070 to $884,605 for Cluster 1 (Average to High), decreased from $910,930 to $157,997 for Cluster 2 (Very High to High), were relatively stable and remained low from $15,168 to $13,026 for Cluster 3 (Average to Average), and increased from $57,909 to $193,140 for Cluster 4 (Increasing Costs, High at Both Points). Relatively stable costs after starting HD were associated with more stable scores on comorbidity index scores from the pre-and post-HD periods, while increasing costs were associated with more sharply increasing comorbidity scores.


The K-means CA method appeared to be the most appropriate in healthcare claims data with highly skewed cost information when taking into account both change of cost patterns and sample size in the smallest cluster.

Peer Review reports


Cluster analysis

Cluster analysis (CA) is a statistical technique that helps reveal hidden structures by grouping entities or objects (e.g., individuals, products, locations) with similar characteristics into homogenous groups while maximizing heterogeneity across groups [1, 2]. Entities or objects of interest are grouped together based on attributes that make them similar, with the final goal being to distinguish these entities or objects by clustering them into comparable groups and to separate them from differing groups. Conceptually, CA aims to identify cluster solutions that are relatively homogeneous within each group, leading to clusters that have high intra-class similarity, while maximizing heterogeneity between the groups, leading to low inter-class similarity across clusters. Geometrically, the objects within a cluster are close together, while the distance between clusters is further apart. CA is useful to identify groups when it is not clear which entity belongs to which group, and how many groups may best be used to cluster the entities; thus, CA helps to identify a latent structure within a dataset [13].

CA has been widely used in varied applications including finding a true typology, prediction based on groups, hypothesis generation, data exploration, and data reduction-or grouping similar entities into homogeneous classes, consequently organizing large quantities of information and enabling labels that facilitate communication [1, 4, 5]. Numerous specific examples of the use of CA have been reported in the literature, such as characterizing psychiatric patients on the basis of clusters of symptoms [6]; finding a group of genes that have similar biological functions [7]; or identifying medical patient groups most in need of targeted interventions [4, 5].

Less well investigated is the utility of CA in identifying macro-structures associated with changes in treatment outcomes documented in large healthcare claims databases. A particular challenge for the use of CA in healthcare claims datasets is that the distribution of healthcare expenditure data are commonly severely skewed, which complicates analyses [8, 9]. In spite of this challenge, CA may aid in identifying clusters of patients who experienced similar change in costs of care before and after treatment, and particular interest may lie in focusing attention on consistently high-cost groups or groups for whom healthcare costs dramatically increase after a change in treatment. This study employed CA to the patients with end-stage renal disease (ESRD) who were initiated on hemodialysis (HD) for their healthcare cost change patterns before and after HD and explored the feasibility of application of CA method in highly skewed claims data.

Affecting an estimated 600,000–900,000 patients in the United States, chronic kidney disease (CKD) is a complicated clinical issue increasingly recognized as both a pressing public health concern and a growing worldwide epidemic [1015]. Kidney function progressively declines in a proportion of patients with CKD, particularly without adequate therapy. However, often, even with adequate therapy, CKD eventually progresses to devastating ESRD [16]. Two types of dialysis are widely used: hemodialysis (HD) and peritoneal dialysis (PD). The most common and costly of the two, HD, uses a dialysis machine and a special filter called a dialyzer to clean blood outside of the body [17, 18]. The less commonly type is PD, a procedure in which blood is cleaned inside the body via the introduction of dialysate into the abdominal cavity [18].

Even though HD is the most expensive treatment for patients with ESRD [16, 17] little has been reported beyond the aggregate level on the economic impact of the transition of ESRD patients who had previously not received dialysis to HD [19]. Hence, examining healthcare cost patterns of patients with ESRD who initiated HD and classifying these patients into groups may provide useful information to healthcare decision-makers in relation to the cost burden of HD therapy. The objectives of this analysis were: 1) to apply CA techniques to an evaluation of change in all-cause healthcare costs in patients with ESRD before and after initiating HD; 2) to explore the feasibility of application of this method to administrative claim database with highly skewed cost information; 3) to present clusters that show meaningful patterns of change of costs before and after initiating HD; and 4) to further examine these clusters to identify differences in comorbidities and other variables in the pre- and post-HD period, to see if different clinical or demographic patterns may explain the variations in overall costs across clusters.


Study design and data

This retrospective, cross-sectional, observational study with 2007 to 2011 data was conducted using the Truven Health Analytics’ MarketScan® Commercial Claims and Encounter and Medicare Supplemental Databases [20]. The MarketScan database, one of the most commonly used for health economics outcomes research (HEOR), is one of the largest administrative claim databases that provides healthcare costs and resource utilization in real-world settings. The databases reflect inpatient, outpatient, and outpatient prescription drug information for approximately 53 million employees and their dependents covered under commercial health insurance plans sponsored by more than 300 employers in the United States. This database provides detailed cost (payment) and healthcare utilization information for services performed in both inpatient and outpatient settings, in addition to standard demographic variables (i.e., age, sex, employment status, and geographic location). Medical claims are linked to outpatient prescription drug claims and person-level enrollment data through the use of unique enrollee identifiers [20]. The study did not require informed consent or institutional review board approval because all study data were accessed using techniques compliant with the Health Insurance Portability and Accountability Act of 1996. Thus, no identifiable protected health information was extracted during the course of the study.

Sample selection and patient population

Patients aged ≥18 years were included in the analyses if 1) the patient had at least one confirmed diagnosis of ESRD and 2) initiated at least 2 HD sessions between 2008 and 2010. An “index date” was defined as the first HD claim within that time span. Patients were excluded if they did not have continuous enrollment for the 12 months prior to (the “pre-” HD period) or 12 months following (the “post-” HD period) the index date (pre- and post-HD periods thus may have included data from 2007 or 2011 as relevant based on index date). Patients who had a transplant or underwent PD were not excluded due to sample size and generalizability consideration. Therefore, there could be cases that patients had PD or transplant before index HD or switched to PD or had transplant after their index HD. Diagnoses were based on International Classification of Disease, Ninth Revision, Clinical Modification (ICD-9-CM) codes. Codes considered to indicate ESRD included ICD-9-CM codes 404.02, 404.12, 404.92, 404.03, 404.13, and 404.93 (hypertensive heart and CKD without heart failure and with CKD Stage V or ESRD), as well as ICD-9-CM codes 585.5 (CKD Stage 5/ESRD) and 585.6 (ESRD) (Appendix 1 includes a full set of patient medical codes that qualified a patient for inclusion in this study). Persons receiving HD were identified using Healthcare Common Procedure Coding System, Current Procedural Terminology, and ICD-9 codes, which are listed in Appendix 1 [2123].

Variables for clustering

The variables used for clustering were “all-cause medical costs”, or direct costs for each patient reported in the pre- and post-HD periods. All-cause medical costs included hospitalization, office, and emergency department visit costs for all purposes, including dialysis costs. Healthcare costs included payments from both insurance and out of pocket costs from patients including deductible copays and coinsurances.

Variables for describing clusters

The variables for describing patients in clusters included gender (male or female), geographic region (Northeast, North central, South, or West), insurance type (Health Maintenance Organization [HMO] or Point-of-Service [POS] capitation, Fee-for-Service [FFS]), age (stratified as 18–24, 25–34, 35–44, 45–54, 55–64, and ≥ 65 years), and the comorbidity measures—Charlson Comorbidity Index (CCI), Elixhauser Comorbidity Index (ECI), and the Agency for Healthcare Research and Quality”s (AHRQ) top 10 Clinical Classification Software (CCS) categories. The CCI composite comorbidity score was calculated from medical records as a weighted sum of the presence of 19 documented health conditions including diabetes, peripheral vascular disease, or congestive heart failure. Weighting was accomplished by assigning a value of 1, 2, 3, or 6 to each appropriate comorbidity condition and summing these values-thus, higher values reflect greater comorbidity [2426]. The ECI score was used to measure the burden of comorbid conditions not directly related to HD. ECI distinguishes 30 comorbid conditions identified using ICD-9-CM codes from complications by considering only secondary diagnoses unrelated to the primary diagnosis [27]. The mean ECI score for each cluster was determined; like the CCI, higher scores reflect greater comorbidity burden. The AHRQ CCS for the ICD-9-CM provides a system for classifying ICD-9-CM diagnoses or procedures into a manageable number of clinically meaningful categories. One use of the CCS method is to identify the most frequent types of conditions present in study populations. The single-level diagnosis CCS approach combines illnesses and conditions into 285 mutually exclusive categories [22, 28]. The same individual might receive a flag for as many CCS categories as the recorded diagnoses support. The CCS uses a broad definition for each disease and, unlike Charlson instruments, the CCS is reported to make little distinction regarding disease severity.

Statistical analysis

The goal of these analyses was to cluster patients in terms of all-cause costs in the “pre” period and “post” period. Values for all-cause costs were normalized by subtracting the minimum from each value and dividing that difference by the range of all values. CA was conducted on normalized all-cause costs. Patients with similar cost patterns were “grouped” together into a set of clusters based on their costs in the pre- and post-HD period using different CA methods. Patterns of demographic information and comorbidities within each cluster were reviewed and compared/contrasted across clusters. Two major CA methods, K-means (non-hierarchical) and hierarchical CA with various linkage methods, were applied to normalized costs within the pre- and post-HD periods to identify clusters. PROC FASTCLUS and PROC CLUSTER procedures in SAS, Version 9.3, were used to conduct the cluster analyses. All other analyses were also performed using SAS, Version 9.3 [29, 30].

Several important questions must be addressed when conducting CA [1], including: What measures of similarity should be chosen to compare the entities under consideration? How should clusters be formed? And what is the optimal number of clusters? Similarity between objects is most often assessed by a distance measure, with higher values (i.e., greater distances between cases) representing greater dissimilarity between entities. Various measures are available to express similarity or dissimilarity between pairs of objects. In these analyses, we used Euclidean distance, or straight-line distance between individuals in the database-this is the most commonly used type of similarity measure when analyzing ratio or interval-scaled data [31]. Mathematically, the Euclidean distance between any 2 entities, such as B and C, with regard to 2 variables, x and y, can be expressed by the following formula [31]:

$$ {d}_{Euclidean}\left(B,C\right) = \sqrt{{\left({x}_B-{x}_C\right)}^2 + {\left({y}_B-{y}_C\right)}^2} $$

The values obtained from comparing all entities on both x and y (in this case, pre- and post-HD costs) form a distance matrix capturing the distances between all pairs of entities.

Clusters can be formed using either hierarchical or non-hierarchical methods. Hierarchical CA attempts to identify relatively homogenous groups of cases based on selected characteristics using an algorithm that either agglomerates or divides entities to form clusters [32]. Agglomerative algorithms begin with each entity in a separate cluster; in each subsequent step, the two clusters that are most similar are combined to build an aggregate cluster. This process is repeated until all objects are finally combined into a single cluster. Once formed, clusters cannot be split, and similarity decreases during each step. A variety of “linkage” methods may be chosen to facilitate an agglomerative algorithm and define how similar or dissimilar any two clusters may be, including, single-, complete-, or average-linkage methods, flexible beta method, McQuitty’s method, as well as the centroid method or Ward’s method (Table 1).

Table 1 Common agglomerative algorithms for forming clusters

In a divisive algorithm, analyses start with a single cluster containing all entities, which is then divided at each subsequent step into two additional clusters that contain the most dissimilar objects. Splitting continues until all observations are in a single-member cluster. The end product of either an agglomerative or divisive hierarchical clustering method is the construction of a hierarchy or structure depicting the formation of clusters.

The K-means method is the primary example of non-hierarchical CA. In contrast to hierarchical analyses, non-hierarchical approaches do not involve the construction of groups via iterative division or clustering; instead, they assign objects into clusters once the number of clusters is specified. To accomplish this, starting points (or cluster seeds) for each cluster must be identified, and each observation is assigned to one of the cluster seeds via some process or algorithm. In K-means CA, “k” points are entered into the space represented by the entities being clustered-these points represent initial group centroids [33]. The n observations are then partitioned into k clusters in which each observation belongs to the cluster with the nearest mean. Once all objects have been assigned, the positions of the k centroids are recalculated. These steps are repeated until the centroids no longer move, yielding a separation of the objects into groups from which the metric to be minimized can be calculated. Both hierarchical and K-means CA methods have their strengths and weakness (Table 2), and they are sometimes used in complementary fashion to converge upon an optimal cluster solution.

Table 2 Strengths and weaknesses of hierarchical and K-means CA methods

The process of conducting CA leads to a set of decisions related to the CAs performed: which method is best, and what is a reasonable number of clusters to form? In this regard, there is no right or wrong approach; ultimate consideration is given to developing a model that not only represents the data appropriately, but can be easily interpreted and understood in the context of the entities investigated-thus, successful CA requires experience and perspective to inform the selection of meaningful clusters. In this study, a final model was chosen based the following criteria: 1) In order to have a meaningful number of clusters, it was important not to have too few observations (<10) in the smallest cluster or too many small clusters; 2) As to generate a reasonable clustering pattern, it was essential to have interpretable clustering patterns; and 3) Having a reasonable number of clusters for further analysis. Selecting the number of clusters can be aided by maximizing key statistical elements of the CA: larger values of the Pseudo-F Statistic (PsF) [34] and the Cubic Clustering Criterion (CCC) [35] suggest better model fit in terms of number of clusters [29, 30, 36].



After applying the entry criteria for this study and from 140,720 individuals, a total of 18,380 individuals were identified in the MarketScan Database (Fig. 1). The average age was 63.2 years (standard deviation [SD] = 14.1); 46 % were aged ≥65 years, and 29 % were aged 55 to 64 years. Of the total individuals, 58 % were males, 84 % had FFS insurance plans, and 14 % had HMO or POS capitation plans. At baseline, average ECI scores were 5.8 (SD = 2.6) in the full sample and CCI scores were 4.6 (SD = 2.3); at follow-up, ECI scores had increased to 7.1 (SD = 3.0) while CCI scores had increased to 5.3 (SD = 2.4).

Fig. 1
figure 1

Patient selection diagram. Abbreviations: ESRD, end-stage renal disease; HD, hemodialysis

Overall costs, pre- and post-HD periods

Medical costs for all patients during the pre- and post-HD periods are summarized in Table 3. We defined annual medical costs ≤ $50,000 as “average”, $50,001 to ≤ $500,000 as “high”, and > $500,000 as “very high”.

Table 3 All-cause medical costs in the 12-month baseline and follow-up periods

Clustering techniques

Hierarchical CA with the average, centroid, single-linkage, complete-linkage, and McQuitty’s similarity methods led to cluster solutions that included clusters with unreasonable sample sizes (i.e., prone to the creation of very small clusters with <10 observations; Table 4). Both K-means CA and hierarchical CA with either the flexible-beta method or Ward’s method yielded reasonable solutions. However, the K-means solutions were more meaningful and more easily interpreted, particularly for cluster number <5, circumstances in which both Ward’s method and the flexible-beta method generated at least one cluster with large variation, which is not helpful in practice (Appendix 2, Appendix 3, and Appendix 4, respectively).

Table 4 Summary of results from clustering analysis methods applied

Upon inspection, the best K-means solution included 4 clusters (Fig. 2). More formal criteria associated with each of the K-means solutions suggested 4 clusters yielded maximum separation between clusters (4-cluster solution: PsF = 13,979.98; CCC = −63.928 compared with PsF = 10,502.25 and CCC = −99.702 for a 3-cluster solution, and PsF = 13,109.62 and CCC = −70.634 for a 5-cluster solution). Empirically, the 4-cluster solution was judged to be more appropriate and more easily interpretable than either the 3- or 5-cluster solution. Thus, a 4-cluster K-means solution was chosen for further investigation (Fig. 3). The 4 clusters in this model included a cluster with average costs pre-HD and high costs post-HD (Cluster 1: Average to High); a cluster (the smallest) with very high costs in the pre-HD period and high costs in the post-HD period, along with a substantial decrease in average cost from pre- to post-HD (Cluster 2: Very High to High); a group (the largest) exhibiting average costs in both the pre- and post-HD periods, with a small decrease in average costs from baseline to follow-up (Cluster 3: Average to Average); and finally, the second largest group, exhibiting “high” costs in both the pre- and post-HD periods, along with relatively sizeable cost increases from baseline to follow-up (Cluster 4: Increasing Costs, High at Both Points). Figure 3 and its corresponding table summarize the cost changes in the 12-month pre- and post-HD periods, respectively. Cluster 1 (Average to High) reveals median costs that increased from $185,070 to $884,605; Cluster 2 (Very High to High) shows that the median costs decreased from $910,930 to $157,997; Cluster 3 (Average to Average) reports that the median costs were relatively stable and remained low from $15,168 to $13,026, and Cluster 4 (Increasing Costs, High at Both Points) reveals that the median costs increased from $57,909 to $193,140.

Fig. 2
figure 2

Scatter plot by cluster of all-cause medical costs in pre- and post-HD periods by K-means CA with four cluster solutionsa. Footnote: aPseudo F Statistics = 13,979.98; Approximate Expected Over-All R 2 = 0.79; Cubic Clustering Criterion = −63.93. Each cluster is labeled by corresponding number

Fig. 3
figure 3

All-cause medical costs in pre- and post-HD periods by clustera

Basic demographic information and clinical characteristics of the sample divided into the four clusters suggested by K-means analysis are summarized in Table 5; the top 10 CSS disease categories in the baseline and follow-up period for each cluster are reflected in Appendix 5. Patients in Cluster 3 (Average to Average) (i.e., those with stable average costs before and after initiating HD) tended to be older, with an average age of 63.9 years compared with an average age of 55.5 through 57.6 years in the other three clusters. Otherwise, there was little to no meaningful difference across each cluster in terms of gender, living region, or health insurance type (Table 5). Economically, Clusters 1 (Average to High) and 4 (Increasing Costs, High at Both Points) were both associated with increasing costs from pre- to post-HD. Clinically, substantial increases in comorbidity scores, including both the ECI and the CCI, were observed from baseline to the follow-up period in both these groups. In contrast, Cluster 2 (Very High to High) experienced a reduction in costs after starting HD, from very high to high costs, and both ECI and CCI scores were relatively stable after initiating HD. In addition, relatively stable ECI and CCI scores were reported in Cluster 3, where stable average costs before and after HD were identified. Cluster 3 (Average to Average) exhibited notably low comorbidity scores during the post-HD period when compared with the three other clusters (Table 5).

Table 5 Demographic and clinical characteristics of patients grouped into 4 proposed clusters using K-means CA


In this retrospective observational analysis of claims data from commercially insured ESRD patients initiating HD, CA successfully revealed a latent structure underlying all-cause cost data before and after the start of HD. Several clustering techniques were applied, including both K-means CA and a set of hierarchical clustering analyses with multiple agglomerative algorithms that included average, centroid, single- and complete-linkage methods; McQuitty’s similarity method; and both the flexible-beta and Ward’s methods. Models generated by both K-means and hierarchical cluster CA with flexible beta and Ward’s methods produced clusters of reasonable sample size. K-means CA yielded the most informative categorization of patients generating more reasonable clusters from a practical perspective than did the other statistical methods. In addition, the K-means solutions were the most easily interpreted. In contrast, Ward’s and the flexible-beta methods led to solutions with at least one cluster with large variability (or spread), which can be difficult to interpret. Among the models suggested by K-means CA, a 4-cluster solution appeared to be the most appropriate for these data: associated criteria suggested a 4-cluster solution offers maximum separation of clusters compared with either a 3- or 5-cluster solution. In addition, a 4-cluster solution was more interpretable, and thus more appropriate to apply than other methods.

Mean all-cause medical costs in this sample of privately insured patients ranged from approximately $45,000 (USD) prior to the initiation of HD to $49,000 (USD) after; median costs ranged from $17,000 in the 12 months before HD initiation to $16,000 in the 12 months following HD initiation. Interestingly, these reported costs are generally lower than those found in other analyses in other populations. In 2004, the average annual Medicare expenditure for an ESRD patient started on HD was reported to be $72,000 (USD) [37], increasing to $77,500 (USD) in 2012 [11]. Other estimates suggest annual all-cause costs for HD patients to be as high as $174,000 (USD) in a privately insured population [17]. It is worth noting that the current results reflect payment from insurance claims made in the “real-world setting”. Importantly, a switch to HD from no dialysis in the present data set was only associated with a modest increase in average and median annual costs for ESRD patients on the whole, suggesting that the transition to HD does not generally add substantial costs to average annual care for a patient and may be associated with quite similar costs for the majority of late-stage patients with renal disease in comparison to their cost of care immediately before initiating HD. It is interesting to note that in both the pre- and post-HD assessment periods, 75 % of patients had costs below the average of $45,000 and $49,000 (USD), respectively-thus, it appears as if a relatively small fraction of patients are driving up the overall increase in costs after initiating HD, a contention supported by CA.

More specifically, CA demonstrated that the data could be reasonably represented by 4 clusters of patients: those with average costs before and after initiating HD (90 % of the full sample); those with high costs before and high/increased costs after (8 %); those with average costs who incur high costs after initiating HD (0.6 %); and a cluster with very high costs prior to initiating HD who see their annual costs reduced to a high level (0.5 %). Thus, overall costs stay stable for most ESRD patients initiating HD, suggesting transition to HD per se is not an important driver of cost for the majority of patients. A minority of patients drive an increase in overall costs after HD initiation.

Because of the different cost patterns in each group, it is worthwhile to better understand patients in each cluster to help predict and contain the costs of HD. Comorbidities seem to be particularly relevant to costs, with increasing comorbidity scores from baseline to follow-up periods in those clusters associated with an increase in costs during follow-up, and more stable comorbidity scores associated with more stable costs (or even declining costs). This is consistent with other research: one study demonstrated that an increased level of comorbidity was associated with higher cost in the 2 years prior to starting HD [13], while another demonstrated a clear relationship between CCI scores and costs [38]. These data suggest timely management of comorbidities or the prevention of comorbidities may be critical for containing costs in patients starting HD. Interestingly, the older age of the patients in the most stable cost cluster (i.e., Cluster 3) suggests that there may be a difference in expression of ESRD in these patients compared with the other clusters, perhaps a factor that manifests itself as both a later-in-life need for HD as well as better overall health (e.g., fewer comorbidities).

In aggregate, costs are high at an absolute level, both before and after the initiation of HD, suggesting that the healthcare costs of the majority of ESRD patients not treated with HD are not substantially lower than the costs of care for these patients immediately after starting HD. Thus, HD does not add substantial costs for most patients and seems like an economically feasible option in most patients with CKD, given the overall high cost of care for these patients prior to initiating HD. True cost containment for patients with ESRD likely requires more aggressive or widespread intervention before patients reach this advanced stage of disease, where costs are high before and after HD. One overall strategy that may reduce costs includes early referral to a nephrologist in the period before starting HD [16]. HD is not an important cost driver for the majority of patients, so limiting HD may not contain costs for these patients. There is a need to better understand the fraction of the population that is driving higher post-HD costs, and consider ways to mitigate the costs associated with their transition to HD.


Interpretation of these results must be informed by limitations of these analyses. First, these analyses were conducted only in those employed individuals with commercial insurance coverage and some individuals with Medicare coverage; thus, these results from a relatively healthy population may not be fully generalized to individuals with Medicare, Medicaid, other insurance, and no insurance. Second, administrative claims data cannot capture deaths and changes of employment; therefore, the cost not captured due to loss to follow-up may lead to selection bias. In addition, administrative claims data are not collected for research purposes and measurement error may have been introduced by coding that was in error or driven by reimbursement needs more so than research needs. Further, administrative claims data does not collect clinical information that would have been valuable additions to these analyses, such as laboratory test results or vital signs. Access to patients’ claims prior to their enrollment in MarketScan databases is not available. Retrospective analysis limits the study to those who are clinically diagnosed and incur health care resource utilization through claims; resource utilization not identified by claims would not be included in these analyses. Finally, treatment costs in future studies should examine what cost drivers may have influenced increases or decreases in costs for each cluster.


CA was a useful statistical technique for evaluating a claims data set that included skewed healthcare cost data. One implication of these analyses is that costs for most patients with ESRD stay relatively stable after starting HD; a minority of patients drive overall increasing annual costs after initiation of dialysis. These increasing costs may be driven, in part, by a greater comorbidity burden among these patients.


  1. Dilts D, Khamalah J, Plotkin A. Using cluster analysis for medical resource decision making. Med Decis Making. 1995;15(4):333–47.

    Article  CAS  PubMed  Google Scholar 

  2. McLachlan GJ. Cluster analysis and related techniques in medical research. Stat Methods Med Res. 1992;1(1):27–48.

    Article  CAS  PubMed  Google Scholar 

  3. Romesburg HC. Cluster analysis for researchers. Belmont: Lifetime Learning Publications; 1984.

    Google Scholar 

  4. Clatworthy J, Buick D, Hankins M, Weinman J, Horne R. The use and reporting of cluster analysis in health psychology: a review. Br J Health Psychol. 2005;10(Pt 3):329–58.

    Article  PubMed  Google Scholar 

  5. Weir MR, Maibach EW, Bakris GL, Black HR, Chawla P, Messerli FH, Neutel JM, Weber MA. Implications of a health lifestyle and medication analysis for improving hypertension control. Arch nter Med. 2000;160:481–90.

    Article  CAS  Google Scholar 

  6. Blashfield R. The classification of psychopathology: Neo-Kraepelinian and quantitative approaches, Softcover reprint of the original. 1st ed. New York: Springer; 1984. p. 328.

    Book  Google Scholar 

  7. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998;95(25):14863–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Diehr P, Yanez D, Ash A, Hornbrook M, Lin DY. Methods for analyzing health care utilization and costs. Annu Rev Public Health. 1999;20:125–44.

    Article  CAS  PubMed  Google Scholar 

  9. Griswold M, Parmigiani G, Potosky A, Lipscomb J. Analyzing health care costs: a comparison of statistical methods motivated by Medicare colorectal cancer charges. Biostatistics. 2004;1(1):1–23.

    Google Scholar 

  10. Rossert JA, Wauters JP. Recommendations for the screening and management of patients with chronic kidney disease. Nephrol Dial Transplant. 2002;17 Suppl 1:19–28.

    Article  PubMed  Google Scholar 

  11. Levey AS, Coresh J. Chronic kidney disease. Lancet. 2012;379(9811):165–80.

    Article  PubMed  Google Scholar 

  12. Stevens PE, Farmer CK, Hallan SI. The primary care physician: nephrology interface for the identification and treatment of chronic kidney disease. J Nephrol. 2010;23(1):23–32.

    PubMed  Google Scholar 

  13. St Peter WL, Wazny LD, Patel UD. New models of chronic kidney disease care including pharmacists: improving medication reconciliation and medication management. Curr Opin Nephrol Hypertens. 2013;22(6):656–62.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Hall ME, do Carmo JM, da Silva AA, Juncos LA, Wang Z, Hall JE. Obesity, hypertension, and chronic kidney disease. Int J Nephrol Renovasc Dis. 2014;7:75–88.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Andersen MJ, Friedman AN. The coming fiscal crisis: nephrology in the line of fire. Clin J Am Soc Nephrol. 2013;8(7):1252–7.

    Article  PubMed  Google Scholar 

  16. Lee J, Lee JP, Park JI, Hwang JH, Jang HM, Choi JY, Kim YL, Yang CW, Kang SW, Kim NH et al. Early nephrology referral reduces the economic costs among patients who start renal replacement therapy: a prospective cohort study in Korea. PLoS One. 2014;9(6):e99460.

  17. Berger A, Edelsberg J, Inglese GW, Bhattacharyya SK, Oster G. Cost comparison of peritoneal dialysis versus hemodialysis in end-stage renal disease. Am J Manag Care. 2009;15(8):509–18.

    PubMed  Google Scholar 

  18. Dialysis []. Accessed 2 September 2015.

  19. United States Renal Data System. 2014 USRDS annual data report: Epidemiology of kidney disease in the United States. Bethesda: National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases; 2014.

    Google Scholar 

  20. Truven Health Analytics [homepage on the Internet]. []. Accessed 2 September 2015.

  21. HCPCS-General Information []. Accessed 2 September 2015.

  22. ICD-9 Codes []. Accessed 2 September 2015.

  23. CPT-Current Procedural Terminology []. Accessed 2 September 2015.

  24. Charlson M, Szatrowski TP, Peterson J, Gold J. Validation of a combined comorbidity index. J Clin Epidemiol. 1994;47(11):1245–51.

    Article  CAS  PubMed  Google Scholar 

  25. Olomu AB, Corser WD, Stommel M, Xie Y, Holmes-Rovner M. Do self-report and medical record comorbidity data predict longitudinal functional capacity and quality of life health outcomes similarly? BMC Health Serv Res. 2012;12:398.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis. 1987;40(5):373–83.

    Article  CAS  PubMed  Google Scholar 

  27. Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Med Care. 1998;36(1):8–27.

    Article  CAS  PubMed  Google Scholar 

  28. HCUP - Databases and Product Releases []. Accessed 2 September 2015.

  29. SAS/STAT 9.3 User’s Guide, SAS Institute Inc []. Accessed 2 September 2015.

  30. Methodological Approach To Performing Cluster Analysis With SAS, SESUG Proceedings. []. Accessed 2 September 2015.

  31. Afifi A, May S, Clark VA. Practical multivariate analysis. 5th ed. Boca Raton: CRC Press; 2012.

    Google Scholar 

  32. Everitt BS. Cluster analysis of subjects, hierachial methods. Hoboken, New Jersey, US: John Wiley & Sons, Ltd; 2005.

  33. MacQueen JB. Some methods for classification and analysis of multivariate observations, 2. Proc Fifth Berkeley Sym Mathematical Stat Prob. 1967;1:281–97.

    Google Scholar 

  34. Sarle WS. The cubic clustering criterion, SAS technical report A-108. Cary: SAS Institute; 1983.

    Google Scholar 

  35. Calinski RB, Harabasz J. A dendrite method for cluster analysis. Comm Stat. 1974;3:1–27.

    Article  Google Scholar 

  36. Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985;50:159–79.

    Article  Google Scholar 

  37. Shih YC, Guo A, Just PM, Mujais S. Impact of initial dialysis modality and modality switches on Medicare expenditures of end-stage renal disease patients. Kidney Int. 2005;68(1):319–29.

    Article  PubMed  Google Scholar 

  38. Beddhu S, Bruns FJ, Saul M, Seddon P, Zeidel ML. A simple comorbidity scale predicts clinical outcomes and costs in dialysis patients. Am J Med. 2000;108(8):609–13.

    Article  CAS  PubMed  Google Scholar 

  39. Sokal RR, Michener CD. A statistical method fro evaluating systematic relationships. Univ Kansas Sci Bull. 1958;38:1409–38.

    Google Scholar 

  40. Florek K, Lukaszewicz J, Perkal J, Zubrzycki S. Taksonomia wroclawska. Przeglad Antropol. 1951;17:193–211.

    Google Scholar 

  41. Sneath PH. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1):201–26.

    Article  CAS  PubMed  Google Scholar 

  42. McQuitty LL. Elementary linkage analysis for isolating orthogonal and oblique types and typical relevancies. Educ Psychol Meas. 1957;17:207–29.

    Article  Google Scholar 

  43. Sorensen TA. Method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish Commons. Biologiske Skrifter. 1948;5:1–34.

    Google Scholar 

  44. Lance GN, Williams WT. A general theory of classificatory sorting strategies 1. Hierarchical system. Comp J. 1967;9(4):373–80.

    Article  Google Scholar 

  45. A Study of the Beta-Flexible Clustering Method, Technical Report 87–61 [].

  46. McQuitty LL. Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Meas. 1966;26:825–31.

    Article  Google Scholar 

  47. Ward JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58.

Download references


The authors acknowledge individuals who contributed and provided assistance during the development of this manuscript. Steve Candela, PhD, and Michelle A. Adams, BSJ, MA, are Write All, Inc. consultants who provided medical writing and editorial assistance for this manuscript.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Yunfeng Li.

Additional information

Competing interests

ML is an Analyst at KMK Consulting Inc. and works as a consultant for Novartis Pharmaceuticals Corporation. YL, FK, and SA are employees of Novartis Pharmaceuticals Corporation. EO is an Outcomes Research Fellow at Novartis Pharmaceuticals Corporation and a Post-Doctoral Research Associate at the Institute for Health Outcomes, Policy, and Economics, Rutgers University, Piscataway, NJ. ML, YL, FK, SA, and EO have made substantial contributions to conception and design, or acquisition of data, or analysis and interpretation of data; have been involved in drafting the manuscript or revising it critically for important intellectual content; have given final approval of the version to be published; and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. Funding for this project was provided by Novartis Pharmaceuticals Corporation, East Hanover, NJ, USA. Publication of the study results was not contingent upon sponsor’s approval.

Authors’ contributions

All listed authors met the criteria for authorship set for by the International Committee for Medical Journal Editors (ICMJE). ML, YL, FK, EO, and SA participated in study’s conception and design; ML and YL handled the database, collected and analyzed the data. ML, YL, FK, EO, and SA drafted the article, and interpreted the data. ML, YL, FK, EO, and SA revised it critically for important intellectual content and gave final approval. All authors read and approved the final manuscript.


Appendix 1

Table 6 Medical codes indicating HD

Appendix 2

Fig. 4
figure 4

Scatter plots of all-cause medical costs in pre- and post-HD periods by K-means CA with 3- and 5-cluster solutionsa. aEach cluster is labeled by corresponding number

Appendix 3

Fig. 5
figure 5

Scatter plots of all-cause medical costs in pre- and post-HD periods by Hierarchical CA with Ward Method and 3-, 4-, and 5-cluster solutions. aEach cluster is labeled by corresponding number

Appendix 4

Fig. 6
figure 6

Scatter plots of all-cause medical costs in pre- and post-HD periods s by Hierarchical CA with Flexible-Beta Method and 3-, 4-, and 5-cluster solutionsa. aEach cluster is labeled by corresponding number

Appendix 5

Table 7 Top 10 CSS disease categories in the pre- and post-HD periods

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liao, M., Li, Y., Kianifard, F. et al. Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis. BMC Nephrol 17, 25 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: