Latent Variable Model
How it works
Step 1 — Measure what happened. For every diagnosis and procedure code (ICD-10 CM and PCS) that appears within an SSP, we look at all the claims where that code was billed. We then calculate 14 signals of resource use: ICU admission rate, average length of stay, rate of mechanical ventilation, total charges, and more. Codes with fewer than 100 supporting claims are excluded to keep the statistics reliable.
Step 2 — Estimate Intensivity. We use a statistical method called a Latent Variable Model / PCA to distill all 14 signals into a single intensity score per code. Higher score means the encounters that carry that code were, on average, more resource-intensive.
Step 3 — Cut into tiers. Codes are grouped into tiers based on their intensity score. The number of tiers is chosen automatically to maximise how cleanly separated the groups are. Tier 1 is the least intensive; the highest tier is the most.
Step 4 — Validate. For diagnosis codes, we cross-check our tiers against the CMS severity labels (Major Complication, Complication, No Complication). Well-calibrated tiers should line up: tier 1 codes should be predominantly "No Complication" and the top tier predominantly "Major Complication."
Latent Intensivity Score
Note: PCA and Latent Variable Model are the same thing
After feature engineering produces 14 numeric signals per SSP × code, PCA compresses them into a single intensity score using PCA. The score is the coordinate of each code along the first principal component — the direction in 14-dimensional feature space that explains the most variance.
The 14 features are highly correlated: an encounter with a long LOS tends to also have ICU admission, higher charges, and more organ systems involved. Most of their shared information lives on a single latent axis. PCA finds that axis without any hand-tuned weights, and the first PC empirically aligns with overall resource intensity (LOS, ICU, ventilation, charges). A single score is then straightforward to threshold into tiers.
Step-by-step
2a — Standardize
Each feature is z-scored to zero mean and unit variance:
This is necessary because features differ widely in scale — avg_length_of_stay is in
days while rate_icu is a proportion — and PCA is sensitive to scale.
2b — PCA
PCA is fit on the full standardized matrix. The first principal component solves:
where is the empirical covariance matrix.
Sign orientation. PCA does not guarantee the sign of the eigenvector. After fitting,
if PC1 is negatively correlated with avg_length_of_stay, scores and loadings are
multiplied by so that higher intensity score = higher resource use.
Quality check. The proportion of variance explained by PC1 is reported in
pc1_var. A value above ~40% indicates that the features share a strong common
factor and a single score is a reasonable summary. The scree plot is included in
each per-SSP report.
2c — Key drivers
Each intensity score decomposes exactly into per-feature contributions:
The top 3 features by are stored in key_drivers as a
human-readable string, e.g.:
"icu (+0.45), length of stay (+0.38), dialysis (+0.21)"
Positive values pushed the score up (more intensive); negative values pulled it down.
Interactive demo
The plot below illustrates the core idea using three representative RII features:
rate_icu, avg_length_of_stay, and avg_organ_system_count. In the actual pipeline
all 14 features are used.
- 3-D panel — standardized feature space. The purple arrow is the PC1 axis; dashed lines drop each code perpendicularly onto it.
- 1-D panel — the same codes projected onto PC1. Separation between the three simulated tiers confirms that PC1 recovers the latent intensity gradient.
All three features collapsed onto a single latent severity axis. Each dot is the same patient as in the 3-D plot, now positioned only by their PC1 score.
| Feature | Loading (w) | Magnitude |
|---|---|---|
| rate_icu | +0.4140 | |
| avg_length_of_stay | +0.6489 | |
| avg_organ_system_count | +0.6384 |
All loadings are positive, confirming PC1 is a single intensity axis. The real pipeline uses all 26 RII features; these three are shown for illustration.
Interpreting loadings
The loading for feature shows how much a one-standard-deviation increase in that feature contributes to the intensity score. In well-behaved SSPs the dominant loadings are clinical intensity signals (LOS, ICU rate, mechanical ventilation, total charge).
Output
The intensity score for each code is stored in rii_code_tiers.intensity_score.
Tier Assignment then applies 1-D KMeans to partition codes into
discrete tiers on this scalar axis.