Artificial intelligence is moving from proof-of-concept to point-of-care across nephrology — risk prediction, image and biopsy interpretation, dialysis optimization, transplant decision support, and now generative tools at the bedside. Yet, as of this writing, no major society (KDIGO, ASN, ERA) has issued a dedicated AI practice guideline. This guide fills that gap with a pragmatic, physiology-grounded, evidence-anchored perspective for the practicing nephrologist — an editorial that organizes the current evidence and offers pragmatic guardrails until formal guidance exists.
What This Guide Is — and What It Is Not
This is a clinician-facing perspective: editorial and integrative, not a disease-specific patient handout, not a systematic review, and not a technical AI primer. Consistent with an integrative-functional approach to nephrology, AI is framed here as an amplifier of physiologic reasoning and the therapeutic toolkit — not a replacement for them. Every use case is tied back to mechanism and to cross-organ (kidney–cardiovascular–metabolic) integration.
On completing this guide, the reader will be able to: define and distinguish machine learning, deep learning, and large language models, and map each to concrete nephrology tasks; interpret common performance metrics (AUROC, calibration, sensitivity at a chosen alert threshold, lead time) and their clinical trade-offs; critically appraise an AI study using a structured checklist (data provenance, external validation, calibration, fairness, prospective evidence); identify the highest-yield, evidence-supported use cases by domain (AKI, CKD progression, pathology, dialysis, transplant); use generative AI safely for documentation, literature synthesis, and patient education while avoiding hallucination and confidentiality pitfalls; and apply governance, bias-mitigation, and regulatory principles — including the eGFR race-coefficient debate and software-as-a-medical-device pathways.
The editorial throughline
AI should sharpen pathophysiologic reasoning and therapeutic precision, not substitute for them. A prediction is only useful when it routes to a physiologically rational, guideline-aligned action.
Patient-facing companion figure for sharing in the clinic or on social media — where AI typically appears across kidney care, with the line that says the clinician still owns the decision.
The guide is organized into eight modules. Each module carries the same internal structure: objective → mechanistic framing → key content → clinical decision point → evidence anchors → caveats. The build map below is the orientation; bracketed numbers are reference IDs in the bibliography at the end of the guide.
| Mod | Module | Core question it answers | Evidence anchors |
|---|---|---|---|
| 1 | Foundations & taxonomy | What is AI / ML / DL / LLM, and why nephrology now? | 4,1,2 |
| 2 | Decision support & CKD risk | Who will progress, and how should that change management? | 1,14 |
| 3 | Acute kidney injury prediction | Can we see AKI coming early enough to act? | 5,7,8,6,9,10 |
| 4 | Digital pathology & imaging | Can algorithms read the biopsy / image and add signal? | 11,12,13 |
| 5 | Dialysis optimization | Can AI improve volume, anemia, and PD decisions? | 15,16,17 |
| 6 | Transplantation | Can AI improve matching, rejection, and dosing? | 18,19 |
| 7 | Generative AI & LLMs | How do I use ChatGPT-class tools safely in practice? | 20,21,22 |
| 8 | Governance, bias & regulation | How do I deploy this responsibly and lawfully? | 23,24,25,26 |
Foundations & Taxonomy — a Mental Model Without the Engineering Jargon
Objective. Give the clinician a working mental model of AI methods without engineering jargon — enough to read a paper, interrogate a vendor, and ask the right second question.
A nested mental model. AI ⊃ ML ⊃ DL ⊃ LLM — each layer maps to a concrete nephrology task. The renal-example cards on the right are what each layer actually does at the bedside.
Three nested ideas anchor the field. Machine learning (ML) is the broad family — patterns learned from data rather than hand-coded rules. Deep learning (DL) is the subset that uses multilayer neural networks, well suited to images and sequences (biopsy slides, dialysis waveforms, EHR time series). Large language models (LLMs) are deep networks trained on text via next-token prediction; their power and their pitfalls (hallucination, confidence without grounding) both flow from that single objective.
Three learning regimes are the second axis. Supervised learning trains on labeled examples — the canonical case is AKI label prediction from EHR features. Unsupervised learning finds structure without labels — for example, phenotype clustering of CKD that surfaces hidden subgroups within an apparently homogeneous KDIGO category. Reinforcement learning learns a policy from outcomes — dialysis-dosing or anemia-management agents are the archetype, and also the place where prospective validation lags the furthest behind the headlines.
Inputs that matter in nephrology cluster into four families: structured EHR labs and vitals, waveform and dialysis-machine telemetry, whole-slide histology, and free clinical text. Each maps to a different model family — gradient-boosted trees and tabular networks for the first, recurrent and transformer networks for the second, convolutional and vision-transformer networks for the third, language models for the fourth. The choice is not aesthetic; it shapes what the model can and cannot see.
Metrics literacy — why a high AUROC can still be useless at the bedside
Discrimination (AUROC) is the model's ability to rank a positive case above a negative one. Calibration is whether a predicted 30% means 30 patients out of 100 actually had the event — the property that decides whether a threshold means what you think it does. Clinical utility is net benefit at the threshold you would actually act on, and depends on prevalence, alert burden, and downstream action. Lead time and alert specificity decide whether the prediction arrives early enough — and cleanly enough — to change care rather than merely annotate it.
Clinical decision point
Before trusting any model, ask three questions in order: (1) what was it trained to predict, (2) in whom was it trained, and (3) does its output arrive early enough and specifically enough to change an action? A discrimination metric in isolation answers none of these.
Evidence anchors: Loftus 2022 1 and Hueso 2024 2 for the field map; Cheungpasitporn 2024 3 for critical-care nephrology framing; Filler 2022 4 for the call-to-action framing in pediatric nephrology.
CKD Risk Stratification — from Population Risk to Action-Linked Prediction
Objective. Move from population risk to individualized, action-linked prediction of CKD progression.
Mechanistic framing. CKD progression integrates three axes: glomerular hemodynamics (intraglomerular pressure, single-nephron hyperfiltration), proteinuria-driven tubulointerstitial injury (the strongest modifiable driver of decline at any baseline eGFR), and cardiometabolic load (diabetes, hypertension, obesity). Multivariable ML extends the logic of the Kidney Failure Risk Equation (KFRE) by capturing nonlinear interactions among these axes — for example, how an elevated UACR amplifies risk far more steeply in the presence of poorly-controlled diabetes than in its absence.
From KFRE to ML — what the added complexity buys
KFRE remains the right tool for most outpatient stratification: it is parsimonious, externally validated, and easy to operationalize. Where ML adds discrimination is in cardiometabolic populations whose risk is dominated by nonlinear interactions the four- and eight-variable KFRE cannot fully capture. The worked exemplar is Klinrisk, validated within the CANVAS program and CREDENCE trial — externally validated ML for CKD progression in a cardiometabolic population, with discrimination meaningfully above KFRE in patients with type 2 diabetes 14. The headline is not that ML always wins; it is that ML wins where the underlying biology is most nonlinear, and the modeling choice should be driven by the population.
The decision a risk model should change is not diagnosis (the patient already has CKD) but therapeutic escalation timing and access planning: SGLT2-inhibitor and non-steroidal MRA intensification, nephrology referral timing, and vascular access planning all benefit from a numerically anchored future risk. KDIGO recommends KFRE-anchored referral thresholds; an ML risk score that beats KFRE in your population is a legitimate substitute for the same workflow, not a license to defer therapy until the score crosses an arbitrary line.
Clinical decision point
Use validated risk output to escalate guideline-directed therapy and to time referral and access — not as a standalone prognosis delivered to the patient. A 30% 2-year risk routed to "do nothing differently" is a wasted prediction.
Evidence anchors: Tangri 2024 14; framed within Loftus 2022 1.
Acute Kidney Injury Prediction — the Modifiable Window
Objective. Show where continuous, EHR-driven AKI prediction is mature enough to influence care — and where it is not.
Mechanistic framing. AKI is a final common pathway of hemodynamic, septic, nephrotoxic, and obstructive insults. Early prediction targets the modifiable window — perfusion, nephrotoxin exposure, and fluid strategy — before tubular injury becomes established and creatinine has risen. This window is what makes lead-time prediction valuable; it is also what makes alert burden so dangerous, because the model that fires 48 hours early is the same model that, miscalibrated, fires on half the ICU.
The landmark — and its honest caveats
The 2019 DeepMind continuous AKI model predicted 55.8% of inpatient AKI and 90.2% of AKI requiring dialysis up to 48 hours ahead — a step-change in lead time over creatinine-trigger systems 5. The honest caveats were published alongside the headline numbers and matter at least as much: two false alerts per true alert at the operating threshold, and a training dataset (US Department of Veterans Affairs) that was ~94% male. The first caveat is the alert-fatigue problem in numerical form; the second is the transportability problem made concrete.
Setting-specific models have followed: cardiac surgery AKI 6, sepsis-associated AKI with interpretable approaches that surface the contributing features rather than emitting a single opaque score 7,8, and pediatric critical-care AKI 9 where the pre-AI baseline is least mature. Pickkers 2021 10 is the indispensable physiology and management backdrop — the "why" without which an AKI alert is just a notification.
Audit alert burden locally before adoption
The published positive-predictive value at any threshold is a function of the original population's AKI prevalence. In a lower-acuity unit, the same model will fire less specifically and the false-alert ratio will rise. Pilot on a quality metric (UF rate, nephrotoxin holds, dose-adjustment uptake) before letting it touch order sets — and define a kill-switch criterion in writing.
From alert to action — the five-step bundle every AI AKI alert should route to. Without the bundle, the algorithm has annotated the chart without changing care.
Evidence anchors: Tomašev 2019 5; Tseng 2020 6; Yue 2022 7; Fan 2023 8; Dong 2021 9; Pickkers 2021 10.
Digital Pathology & Imaging — Reproducibility, Not Autonomy
Objective. Assess where computational pathology and imaging add reproducible signal to nephrology diagnosis — and where the right framing is augmentation rather than replacement.
Deep-learning histopathologic assessment of kidney tissue is the most mature image domain. Hermsen 2019 11 demonstrated automated segmentation of glomeruli, tubules, and interstitium on PAS-stained biopsies with performance approaching trained pathologists — turning what had been a qualitative impression ("mild to moderate IFTA") into a quantitative, reproducible measurement. The clinical value is less in diagnosing the unknown and more in reducing inter-reader variability for grading metrics that drive prognosis.
A perennial reproducibility barrier in digital pathology is stain variability: a model trained on one laboratory's slides degrades on another's. Bouteldja 2022 12 addressed this directly with stain-independent deep learning that generalizes across laboratories — an important step toward tools that survive the move from the research site to the community lab. The lesson generalizes: any pathology AI should be re-validated on the local lab's stains, scanners, and case mix before being trusted at the report level.
The non-invasive frontier is the oculo-renal axis: Meng 2025 13 showed deep learning on retinal images can infer diabetic kidney disease at the population level. The mechanistic plausibility is real — the retina and the glomerulus share microvascular biology, and diabetic retinopathy and DKD co-occur far more often than chance — but the bedside application is screening, not biopsy substitution. The right framing is a non-invasive microvascular window, not a non-invasive biopsy.
Clinical decision point
Treat algorithmic pathology as a quantification and consistency aid for the pathologist, not an autonomous diagnostician; require local validation on the lab's own stains and scanners before relying on the output at the report level.
Why deep learning on retinal images can infer diabetic kidney disease. The retina and the glomerulus share the same microvascular biology, so a model trained on fundus photographs learns a non-invasive microvascular signature that tracks DKD risk. Screening only — not a biopsy substitution.
Evidence anchors: Hermsen 2019 11; Bouteldja 2022 12; Meng 2025 13.
Dialysis Optimization — Volume, Anemia, and Modality-Specific Risks
Objective. Map AI to the recurring decisions of HD and PD: volume, anemia, intradialytic events, access surveillance, and PD-specific risks.
Mechanistic framing. Intradialytic instability is the mismatch between ultrafiltration rate and plasma-refill / cardiovascular reserve. Predictive models target this mismatch to pre-empt hypotension and chronic fluid overload — the two failure modes that dominate hospitalization risk on chronic hemodialysis. Volume models are usefully framed as UF-rate decision support, not as autonomous prescribers.
Volume & IDH prediction
Intradialytic-hypotension models built on machine vitals, IDWG, prescription, and ultrafiltration trajectory — best deployed against a quality metric (UF rate exceedances, IDH episodes) rather than direct prescription edits 15,16.
Anemia / ESA dosing
Reinforcement-learning agents for ESA titration in maintenance HD. Promising on dose stability and target-band time; evidence remains predominantly single-center and retrospective 15,16.
AV access surveillance
Image- and waveform-based stenosis detection on fistula and graft monitoring — most useful as a triage layer that routes ambiguous studies to the access team 16.
Peritoneal dialysis
Technique-failure risk, peritonitis prediction, and cardiovascular-event prediction. PD-specific evidence is more recent and remains preliminary 17.
Reality check
Guidelines remain cautious; most published dialysis AI is single-center and retrospective. Pilot dialysis AI as decision support on quality metrics (UF rate, hypotension episodes, dose stability) with prospective audit before it touches prescriptions 15,16.
Evidence anchors: Burlacu 2020 15; Sandys 2022 16; Bai 2022 17.
Transplantation — Matching, Rejection, and Immunosuppression Dosing
Objective. Survey AI across the transplant continuum: pre-transplant organ matching, post-transplant rejection prediction, and tacrolimus dose individualization.
Pre-transplant, the workhorses are matching and waitlist decision support — models that rank donor–recipient pairs on graft-survival probability or composite utility scores 18. Post-transplant, graft-rejection prediction draws on serial labs, DSA dynamics, and (where available) donor-derived cell-free DNA, with ML offering modest discrimination gains over single-marker thresholds 18. Tacrolimus dose individualization is one of the more clinically convincing transplant use cases: nonlinear pharmacokinetics, narrow therapeutic index, and genuine variability in trough achievement make pharmacometric ML models more useful than a flat per-kilogram rule 19.
The delivery layer wrapping these models is increasingly important. Schwantes 2021 19 frames technology-enabled care and remote monitoring — connected blood-pressure cuffs, home labs, asynchronous symptom check-ins — as the platform on which transplant AI actually lands. The implication is operational: a dosing model without a remote-monitoring infrastructure rarely changes care.
Clinical decision point
AI may refine donor–recipient matching and dosing, but allocation and immunosuppression remain physician-and-protocol governed. Use AI to surface candidates and propose doses, not to decide. The accountable physician owns both the rationale and the outcome.
Evidence anchors: Alamgir 2022 18; Schwantes 2021 19.
Generative AI & LLMs in Practice — Patterns That Are Safe at the Bedside
Objective. Give practical, safe patterns for ChatGPT-class tools at the point of care and the desk — where they help, where they fail, and what the non-negotiables are.
The high-yield uses are surprisingly narrow and very real: drafting documentation (a discharge summary skeleton, a clinic letter first pass, a procedure note template), summarizing literature (a structured abstract digest, a comparison table across two trials, a glossary for a patient), generating patient-education material (consistent with this site's guide library — the lay-language paragraph the clinic visit did not have time for), and answering bounded clinical questions where the answer is easy to verify against a primary source. Each of these reuses what LLMs are genuinely good at: fluent text from a structured starting point, with a human verifier downstream.
Retrieval-Augmented Generation (RAG) — the safer architecture for clinical use
A general-purpose LLM hallucinates partly because it has no idea what it does not know. Retrieval-Augmented Generation (RAG) grounds the model in a curated corpus — for nephrology, that might be KDIGO guidelines, ASN core curriculum, your institution's order sets, the specific calculator scripts on this site — so the model retrieves the cited passage, then writes around it 20. The hallucination rate drops because the model is no longer generating plausible-sounding citations from training-time priors; it is paraphrasing a retrieved source you can audit. For any clinical-content deployment, prefer a RAG architecture over a raw LLM.
Retrieval-Augmented Generation in five blocks. The model paraphrases an audited source instead of generating plausible-sounding citations from training-time priors — and the clinician still signs.
Performance and limits
Structured evaluation of LLM decision support shows the same pattern across specialties: competence with meaningful error rates. Niel 2025 22 is a pediatric-nephrology exemplar — capable performance on bounded clinical questions, but errors that a non-expert reader would not catch. Clinician adoption is also moving ahead of validation: Eppler 2023 21 documented widespread ChatGPT use among trainees and clinicians at a rate that already outstrips the evidence base.
Non-negotiables
(1) No PHI into consumer tools. A free public chatbot is not HIPAA / DPA-aligned and the prompt is part of the training pipeline by default. (2) Verify every fact and citation before it reaches a chart or a patient — LLMs fabricate confidently. (3) The model never owns the decision. The clinician signs the note, the order, and the chart.
Clinical decision point
Use LLMs to draft and synthesize, then apply physician verification before anything reaches the chart or the patient. Prefer RAG / grounded tools for clinical content; reserve raw LLMs for non-clinical drafting where the verifier downstream is also you.
Patient-facing companion figure for sharing in the clinic or on social media — the verification loop in plain language. The AI tool never owns the decision; the clinician does.
Evidence anchors: Miao 2024 20; Niel 2025 22; Eppler 2023 21.
Governance, Bias & Regulation — Deploying AI Responsibly
Objective. Equip the clinician to deploy AI responsibly — fairness, oversight, law, and the human relationship.
Algorithmic bias made concrete — the eGFR race-coefficient debate
Algorithmic bias is easy to discuss abstractly and hard to feel as a clinician until you watch it reclassify your own patients. The eGFR race coefficient is the worked example. For two decades, MDRD- and CKD-EPI-based equations applied a multiplicative adjustment to creatinine-based eGFR in patients reported as Black, producing a higher estimated GFR for the same creatinine. Removing the coefficient — and migrating to cystatin C-based equations where feasible — shifts who is labeled as having kidney dysfunction, which in turn shifts referral, transplant listing, and drug-dosing decisions. Pinsino 2023 26 traces this reclassification through an ICU-survivor cohort: a non-trivial share of patients move between CKD categories, and the downstream care follows.
Honesty box — bias is not abstract
Removing the eGFR race coefficient is not a cosmetic adjustment. It changes who is diagnosed, who is referred, who is dosed, and who is listed. The same logic generalizes to any AI deployed in nephrology: the variables embedded in the model — and the populations it was trained in — are clinical decisions, not engineering details. Inspect them.
Bias made concrete. Same creatinine, different equation, different CKD category, different downstream decisions — the same logic that applies to any AI model deployed in nephrology.
Regulation — what "FDA-cleared" does and does not guarantee
The regulatory frame for medical AI is the Software-as-a-Medical-Device (SaMD) pathway. Yu 2023 25 maps the FDA innovation process and the cleared landscape — useful context for understanding what a clearance actually certifies. The short answer: clearance attests that the manufacturer documented intended use, demonstrated substantial equivalence (510(k)) or supplied premarket data (PMA), and committed to post-market surveillance. It does not certify clinical superiority over existing standards, generalizability to your population, or maintenance of performance after model updates. The clearance is a floor, not a ceiling.
The five phases a deployable nephrology AI has to live inside. If your institution cannot name the accountable owner and the kill-switch criterion, the tool is not ready — regardless of its publication record.
Ethics, accountability, and the therapeutic relationship
The ethical, social, and legal scaffolding around medical AI — accountability, transparency, consent, liability — is still being constructed in case law and regulation 23. Two practical implications: (1) every deployed tool needs a named accountable owner within the institution, and (2) patients increasingly deserve disclosure that an AI tool participated in a decision. Morrow 2023 24 adds a softer but equally important consideration: as workflows automate, preserving the compassion and the therapeutic alliance becomes an active design choice, not an automatic outcome.
Clinical decision point
Adopt only tools with transparent training populations, demonstrated calibration in your patients, a named accountable owner, and a documented audit and override mechanism. If any of these are missing, the answer is not "later" — it is "not until."
Evidence anchors: Sung 2023 23; Morrow 2023 24; Yu 2023 25; Pinsino 2023 26.
The 7-Point AI Appraisal Checklist
A reusable structure for any nephrology AI claim — from a peer-reviewed paper to a vendor demo. Each item is small; together they are the difference between adopting a tool and being adopted by it. Run the seven questions in order. A "no" on any one is not necessarily a veto, but it is a question that must be answered before deployment, not after.
How to use the checklist
Print it on the back of a card or paste it into your appraisal template. Apply it to one published model per month — the discipline matters more than the throughput. Two of the seven items (calibration; subgroup fairness) are the ones most often glossed over in conference talks; spend extra time there.
| # | Question | Why it matters |
|---|---|---|
| 1 | What exactly does it predict? | Outcome definition and label quality determine everything downstream. "AKI" defined as a coding flag, a KDIGO creatinine criterion, and a clinical adjudication are three different models in a trench coat. |
| 2 | In whom was it trained? | Population mismatch (age, sex, race/ethnicity, comorbidity, region, care setting) breaks transportability. A model trained on a 94%-male VA cohort behaves differently in a community OB-medical service. |
| 3 | Externally validated? | Internal-only performance routinely overstates real-world accuracy. Require validation in a population that did not contribute to training — and report the drop honestly. |
| 4 | Is it calibrated? | Good discrimination with poor calibration misleads at the decision threshold. A model that ranks well but predicts 50% events that occur 20% of the time will systematically over-treat. |
| 5 | Does it arrive in time to act? | Lead time and alert specificity decide whether the prediction can change care. An alert that fires when the action window has closed is documentation, not decision support. |
| 6 | Is it fair? | Check subgroup performance and embedded variables (e.g., race coefficients). A model whose accuracy is preserved by reclassifying patients who should not be reclassified is not fair — it is convenient. |
| 7 | Who is accountable? | Named owner, override path, monitoring cadence, and regulatory status. If no one in the institution can name the responsible clinical lead and the kill-switch criterion, the deployment is not ready. |
A printable card to apply to any AI claim. Map the model to a tier; the tier should track how aggressively it is allowed to touch a workflow.
Tier 1 Strongest evidence (externally validated, calibrated, prospective). Tier 2 Externally validated but retrospective. Tier 3 Single-center, internal validation only. Tier 4 Preprint or vendor claim without peer review. Map each model to a tier when you fill the checklist; the tier should track how aggressively the model is allowed to touch a workflow.
Kidney–Cardiovascular–Metabolic Integration — the Sidebar Every Module Carries
A recurring thread runs across all eight modules: every AI use case ties back to mechanism and to the interconnected physiology of the kidney, heart, and metabolism. CKD progression is also cardiovascular progression — and an SGLT2 inhibitor escalated on the back of a CKD-risk model is also a heart-failure and cardiovascular-event intervention. AKI prediction is also hemodynamic prediction, and the patient flagged by the model is also the patient at risk for a non-renal cardiovascular event. Anemia management in dialysis is also a cardiovascular conversation. Even the LLM use case — synthesizing patient-education material — is most useful when it explains the same physiology to the patient that the clinician is acting on.
The corollary is a deployment principle: a prediction is only useful when it routes to a physiologically rational, guideline-aligned action (KDIGO, ADA, ACC, ERA). The editorial throughline, again: AI should sharpen pathophysiologic reasoning and therapeutic precision — not substitute for them.
The kidney–cardiovascular–metabolic triangle every AI use case in nephrology touches. A prediction is only useful when it routes to a physiologically rational, guideline-aligned action.
Guideline alignment and positioning
As of this draft, KDIGO, ASN, and ERA have not published a dedicated AI practice guideline. This guide is therefore positioned as a perspective that organizes the current evidence and offers pragmatic guardrails until formal guidance exists — explicitly aligned with existing KDIGO (CKD/AKI), ADA (diabetic kidney disease), and ACC (cardiorenal) recommendations wherever an AI use case touches a managed decision. When society guidance arrives, this perspective should yield to it; until then, the seven-point checklist and the per-module decision points are the working scaffold.
