Physician Groups DAG
DAG ID: spines_physician_groups
Schedule: Manual trigger only
Tags: turquoise-health, data-science, physician-groups, graph-db, provider-network
Source: airflow_dags/dags/data_science/spines/physician_groups
Overview
The spines_physician_groups DAG builds and refreshes the Physician Groups spines tables. It uses PECOS, Definitive, Komodo, and MRF data to construct a graph database of provider relationships, then derives health system affiliations and produces 6 output spines tables.
Pipeline Phases
The DAG runs sequentially through 6 phases. Most of the pipeline is straightforward SQL — the only complex part is the NetworkX graph analysis in Phase 3, which feeds one spines table (associated_health_systems). The other 5 spines tables are built directly from bridge/similarity tables and can be understood independently.
start_dag
→ Phase 1: Data Preparation ┐
→ Phase 2: HCP Similarity Analysis ├─ SQL-only
→ Phase 3: Analysis & Graph Building │ (NetworkX graph - see below)
→ Phase 4: Spines Table Creation │
→ Phase 5: QA Tests │
→ Phase 6: Promotion ┘
end_dag
Spines Pipeline (SQL-only)
These phases cover the bulk of the DAG. If you only need to understand how the spines tables are built, this section is self-contained — 5 of the 6 output tables are produced here with no dependency on the NetworkX graph.
Phase 1: Data Preparation
Prepares PECOS data and creates bridge tables that map relationships between entity types.
| Task | Description | Output Table |
|---|---|---|
create_clean_provider_types | Joins PECOS data with manual provider type mappings | pecos_data_clean_provider_types_{version} |
create_physician_groups | Loads PECOS/Definitive PGs, filters to PACs with >5 reassignments. Excludes optometry, therapy & rehab, behavioral health, nurse/PA | physician_groups_{version} |
create_tin_hcp_bridge | TIN ↔ HCP NPI bridge from MRF compressed_providers | tin_hcp_bridge_{version} |
create_pac_hcp_bridge | PAC ↔ HCP NPI bridge from PECOS reassignments | pac_hcp_bridge_{version} |
create_hco_hcp_bridge | HCO NPI ↔ HCP NPI bridge from Komodo claims | hco_hcp_bridge_{version} |
create_hs_hco_bridge | Health System ↔ HCO NPI bridge from spines tables | hs_hco_bridge_{version} |
Phase 2: HCP Similarity Analysis
Runs 4 tasks in parallel, each computing Jaccard similarity, containment scores, and intersection counts between pairs of organizations based on shared HCP NPIs.
| Task | Edge Type | Output Table |
|---|---|---|
create_pac_hco_similarity | PAC ↔ HCO | pac_hco_similarity_{version} |
create_pac_tin_similarity | PAC ↔ TIN | pac_tin_similarity_{version} |
create_pac_pac_similarity | PAC ↔ PAC | pac_pac_similarity_{version} |
create_hco_tin_similarity | HCO ↔ TIN | hco_tin_similarity_{version} |
Edge filtering thresholds:
jaccard_threshold: 0.3
min_containment_threshold: 0.5
intersection_count: 50
An edge is kept if the intersection is ≥50 shared HCPs (or ≥25% of reassignments) AND either Jaccard > 0.3 or at least one containment > 0.5.
Phase 4: Spines Table Creation
Creates the 6 final output tables. 5 of 6 are pure SQL built from the bridge and similarity tables above:
| Task | Output Table | Inputs |
|---|---|---|
create_spines_provider_physician_groups | spines_provider_physician_groups | physician_groups, PECOS, Definitive, NUCC taxonomy |
create_spines_associated_tins | spines_provider_physician_groups_associated_tins | pac_tin_similarity, physician_groups |
create_spines_associated_facilities | spines_provider_physician_groups_associated_facilities | pac_hco_similarity, physician_groups, hospital spines |
create_spines_associated_physicians | spines_provider_physician_groups_physicians | pac_hcp_bridge, NPPES, NUCC taxonomy |
create_spines_associated_npis | spines_provider_physician_groups_additional_npis | Definitive PG tables, NPPES, spines provider |
The remaining table depends on the NetworkX graph output:
| Task | Output Table | Inputs |
|---|---|---|
create_spines_associated_health_systems | spines_provider_physician_groups_associated_health_systems | pg_hs_affiliations (from Phase 3 — see NetworkX Graph Analysis below) |
Phase 5: QA Tests
Runs 16 tests in parallel across 7 QA sub-groups:
- spines_provider_physician_groups_qa — uniqueness on
provider_id+ row count checks - associated_tins_qa — uniqueness on
(provider_id, tin), row count, containment/Jaccard/intersection rule validation - associated_health_systems_qa — uniqueness on
(provider_id, hs_id), distinct count checks - spines_physicians_qa — uniqueness on
(provider_id, hcp_npi), row count checks - additional_npis_qa — uniqueness on
(provider_id, npi) - associated_facilities_qa — uniqueness on
(provider_id, facility_npi) - general_qa — cross-table count validations
Phase 6: Promotion
Promotes all 6 spines tables from the dev schema to the production schema using CTAS statements.
NetworkX Graph Analysis (Phase 3)
This phase builds a NetworkX graph from the similarity tables (Phase 2) and traverses it to determine which health systems each physician group is affiliated with. Its sole downstream consumer is the associated_health_systems spines table.
| Task | Description | Output |
|---|---|---|
create_pac_connections | Aggregates similarity tables; creates arrays of HCO neighbors, TIN neighbors, and union of HCPs per PAC | pac_connections_{version} |
create_networkx_graph | Builds undirected graph with PAC, HCO, TIN, HS nodes and all edges. Saves to S3 as pickle | S3: provider_network_graph_{version}.pkl |
create_pac_to_hs_associations | Traverses graph for PAC→HCO→HS and PAC→TIN→HCO→HS paths. Classifies as "Single Affiliated" or "Multiple Affiliated" | pac_to_hs_associations_{version} |
create_final_output | Merges graph associations with manual HS assignments, PAC metadata, and HS name lookups | pg_hs_affiliations_{version} |
Affiliation logic: If the graph identifies a single health system → use it. If multiple → fall back to manual assignments → fall back to a fallback health system.
For more detail on the graph methodology, see PG Graph.
Configuration
Default DAG parameters:
| Parameter | Default | Description |
|---|---|---|
jaccard_threshold | 0.3 | Jaccard similarity threshold for filtering edges |
min_containment_threshold | 0.5 | Containment score threshold |
intersection_count | 50 | Minimum HCP NPI overlap to include edge |
sub_version | 2025_07 | Version suffix for table naming |
schema | tq_dev.internal_dev_csong_graph | Working dev schema |
prod_schema | tq_dev.internal_dev_csong_graph | Target promotion schema |
External table config: Defined in config/table_config.yaml — maps CMS, Definitive, Spines, Komodo, and MRF source tables.
Data Lineage
PECOS Raw Data + Manual Provider Type Mappings
↓
[clean_provider_types]
↓
[physician_groups] ← Definitive
↓
[Bridge Tables: PAC-HCP, HCO-HCP, TIN-HCP, HS-HCO]
↓
[Similarity Tables: PAC-HCO, PAC-TIN, HCO-TIN, PAC-PAC]
│
├──→ 5 Spines Tables (SQL-only) → QA → Promotion
│
└──→ [NetworkX Graph] → S3
↓
[PAC-HS Associations] ← Manual HS assignments
↓
[pg_hs_affiliations]
↓
associated_health_systems → QA → Promotion
Data Sources
| Source | Tables Used |
|---|---|
| CMS PECOS | pecos_provider_revalidation, pecos_provider_revalidation_reassignment |
| Definitive | physician_groups_overview, physician_groups_locations, physician_groups_affiliations |
| Komodo | medical_headers, medical_service_lines |
| MRF | compressed_providers (dynamically constructed per month) |
| Spines | spines_provider_hospitals, spines_provider_health_systems, and association tables |