From Promising to Proven: Lessons from the Validation Trenches of Biomarker Development

Published on linkedin on June 17th

1. Not enough evidence!

When we first embarked on developing Hepatopredict, our idea was very clear: create a biomarker-based test that could reliably inform clinical decisions about liver transplantation in patients with hepatocellular carcinoma (HCC). In our minds, the pathway was straightforward: demonstrate a strong correlation with clinical outcomes, publish the results, and voilà, we’d be ready to introduce it into practice. After all, that’s what the literature seemed to say. Retrospective cohort studies, often multicenter, published promising performance metrics, and the market was ready to adopt. So, naively we thought: “Publish a paper showing better performance than existing markers or tools, and the rest will follow.” But reality, as always, had other plans.

Initially, we envisioned a business model centered around a central laboratory model, where we’d run the tests and provide results to transplant centers – that was what we knew from our previous company, a clinical testing laboratory. But, as I discuss in another chapter, the decision to shift from a laboratory service to a kit that other labs could use was transformative. In fact, it was one of the most momentous decisions we made, for better or worse. It took us into the world of in vitro diagnostics (IVD) manufacturing, with all the regulatory, quality, and validation challenges that entails. We didn’t fully appreciate at the outset was how complex the validation process truly is. In a clinical context, the market, the medical community, and the regulatory bodies demand rigorous demonstration that not only that the biomarker correlates with outcomes, but that it can be reliably measured across different laboratories, that it truly improves patient management, and that it does so in a way that justifies its costs and risks. Why were we so surprised? Well, in part ignorance: coming from the diagnostics services, developing what is commonly described as Laboratory Developed Tests (LDTs), we did not realize that a manufactured device was a different beast.

A second reason is harder to rationalize. HepatoPredict was aimed at improving over a multitude of clinical decision criteria used by hepatology and hepatobiliary teams to select liver cancer patients for transplantation. These criteria are mostly based on scoring how many tumors the patient has and how large they are: if they are too many or too big the patient is no longer eligible for transplantation. These are empirical rules and the international guidelines give the freedom to the clinical center to decide which empirical criteria to use. For most of these criteria, they were adopted based on an individual, single center retrospective observational cohort study. Many of these studies had underwhelming number of patients, which when we consider the specialized nature of the procedure – liver transplantation – would be expected as it is hard to assemble large case series. Our initial assumption was then that if we assembled a series of retrospective, multicenter studies that showed better performance metrics, i.e. superior accuracy, sensitivity, and specificity, clinicians will adopt it.” But that’s not how it works. In fact, for biomarkers, the standard of evidence required is higher. It is that realization that motivated me to write this article, so as to share some learning and simplify life for those trying to develop their own biomarker sets.

But there is a third reason why stronger clinical evidence was required: the need for an extra procedure with associated risks – a liver biopsy. The sample we relied on was, at first glance, familiar: coming from an oncology background, we thought that tissue samples or blood tests would be straightforward. But in hepatocellular carcinoma, the most common type of liver cancer, the landscape is unique as diagnosis often doesn’t require a tissue biopsy at all. And as I already alluded above, the decision to transplant hinges on imaging and serum markers like AFP, and obviously liver function assessments. So, a new procedure would have to be introduced, one with risks associated. In fact, as we were developing this test I saw a close friend having a massive bleeding following a liver biopsy, and he had to spend time at the ICU before he could be discharged (he is fine now!). In addition, there was one single report that biopsing a liver tumor seeded new tumors, and because of that, the patient, now with more tumors that before the biopsy, became ineligible for transplantation. Technology has evolved and we learnt from the specialists that biopsies now are conducted with Teflon-coated needles, that glue(!) is inject to seal the wound and prevent tumor spreading, etc. and that seeding is not a real problem. However, the notion that a liver biopsy was likely to risk that the patient would lose access to a life-saving transplantation made it a difficult sell. So, in the end, the hepatology and hepatobiliary communities kept telling us that our HepatoPredict was great, that it was really needed, but that we needed… more evidence!

How do we demonstrate that our biomarker-based test performs reliably across different labs, patient populations, and clinical settings? How do we prove that it truly impacts patient outcomes, not just correlates with them? The scientific literature offers little clarity on what validation is truly required for an IVD to be adopted. The nature of the clinical need, the risks involved, and the perspectives of clinicians and patients all influence what evidence is deemed sufficient. I learned that, beyond the technical performance, the validation process must encompass a broader view that considers reproducibility, robustness, and real-world performance. In this article, I want to share what I’ve learned from that journey, lessons I wish I knew at the start. Because, in the end, the path from a promising biomarker to a clinically validated IVD is far more complex than a simple correlation. It’s a process that demands a clear understanding of what validation entails, what evidence is needed, and how to navigate the murky waters of clinical and regulatory expectations. My hope is that by outlining these insights, I can help others avoid some of the pitfalls we encountered and better prepare for the rigorous validation journey ahead. It doesn’t matter whether you’re an investigator, a clinician, or an investor -understanding what validation really means is essential to bringing meaningful, reliable diagnostics to the bed-side.

2. What Type of Evidence Is Required for Biomarker Adoption?

Despite thousands of biomarker discoveries reported in the academic literature (!), very few ever reach clinical use. A widely cited review in Nature estimated that fewer than 1 in 100 biomarker candidates undergo sufficient validation to be implemented in patient care.[1] The majority fail not because the underlying biology is flawed, but because researchers rarely provide the rigorous, multi-dimensional evidence required for adoption. These failures stem from limited investment in analytical validation, insufficient demonstration of clinical utility, and a disconnect between discovery science and real-world healthcare needs.[2,3] As a result, the translational bottleneck remains a critical challenge: promising biomarkers are published, praised, and forgotten, never reaching the patients.

For a biomarker to reach clinical adoption, it must meet evidence standards that extend far beyond statistical correlation. Over the past two decades, a structured vocabulary has emerged, anchored in regulatory frameworks such as the FDA’s Biomarker Qualification Program and the EU’s In Vitro Diagnostic Regulation (IVDR), that defines four core categories of evidence: scientific validity, analytical validity, clinical validity, and clinical utility. This framework is now widely used by regulators, health technology assessors, and payers to evaluate the totality of evidence supporting diagnostic tools.[4-6]

Scientific validity refers to the biological plausibility of the biomarker, its mechanistic link to disease processes. This evidence typically arises from genomic, proteomic, or pathophysiological studies that show a consistent, causally plausible relationship between the biomarker and the condition it purports to detect or predict. Without this foundational rationale, even a technically sound test will face skepticism.

Analytical validity addresses whether the test can reliably and reproducibly measure the biomarker across different laboratory conditions, operators, and platforms. It encompasses performance characteristics such as accuracy, precision, linearity, limit of detection, and robustness against pre-analytical variation. Failure to demonstrate analytical validity precludes adoption, especially in decentralized testing environments.

Clinical validity evaluates the degree to which biomarker measurements are associated with a clinical condition or outcome. This requires evidence that the test meaningfully distinguishes between disease states or predicts future outcomes. Unlike therapeutic trials that assess intervention efficacy, clinical validity studies for biomarkers may employ case-control designs, enrichment models, or prospective-retrospective studies. Importantly, the choice of study design depends on the biomarker’s intended role, predictive, prognostic, or companion, and must align with its regulatory “context of use” as defined by the FDA and EMA.

Clinical utility addresses the final and often most demanding question: does the use of the biomarker lead to improved clinical decisions, better patient outcomes, or increased cost-effectiveness? It shifts the focus from correlation to consequence. Demonstrating utility typically requires health-economic modeling, comparative effectiveness studies, or decision impact analyses.

Together, these four pillars provide a comprehensive framework for biomarker validation, and I’ll go over them individually in the next sections.

3. Scientific Validity

Scientific validity refers to the extent to which a biomarker is plausibly linked to a biological process relevant to disease. It is the foundational layer of evidence in biomarker development, establishing the conceptual rationale that a measurable molecular or physiological parameter reflects a meaningful clinical state. This concept goes beyond statistical association; it demands mechanistic plausibility based on pathophysiology, molecular biology, or disease modeling. According to the European Medicines Agency (EMA), scientific validity forms the initial basis for determining whether a biomarker should be pursued in further development or regulatory qualification [7]. Without it, any downstream efforts in assay development or clinical validation are at risk of failure, as regulators and clinicians will lack confidence in the marker’s relevance to health outcomes. The importance of scientific validity has been repeatedly emphasized in the translational science literature: biomarkers that fail to demonstrate a credible biological link rarely survive the leap from academic publication to clinical practice [8,9].

For scientists trained in the hypothesis-testing paradigm, particularly those rooted in the reductionist views of molecular biology, this may represent a self-evident truth. But in the era of big data, establishing scientific validity has become increasingly complex. Many biomarker discovery efforts now begin with high-throughput statistical or machine learning analyses of large genomic datasets, rather than a specific biological hypothesis. Once patterns emerge, researchers often retrofit mechanistic plausibility using gene set enrichment analyses or pathway annotations—tools that offer interpretability but may not confirm biological causality. This reversal of the traditional hypothesis-testing model is increasingly common and sometimes problematic. Notably, several gene expression-based prognostic tools that are now approved to support specific clinical decisions in the clinical management of breast cancer, such as MammaPrint, Oncotype DX, EndoPredict, and PAM50/Prosigna, originated from such agnostic approaches. Their gene signatures were derived from statistical pattern recognition across large patient cohorts without an a priori biological framework, yet each has demonstrated clinical utility. MammaPrint was developed through unsupervised clustering of expression data in young breast cancer patients [10], while Oncotype DX used regression modeling to select recurrence-associated genes in tamoxifen-treated cohorts [11]. EndoPredict and PAM50 similarly emerged from transcriptomic analysis linked to outcome prediction [12,13]. These tests validate the notion that statistical significance and clinical usefulness can exist even in the absence of clear mechanistic insight, but they also underscore the need to distinguish predictive performance from biological plausibility when evaluating new biomarkers.

My personal bias here has always been to exploit big data to generate hypotheses based on predictive power alone, independent of biological rationale, as this would be the approach that maximized not only predictive power, but was less likely to be limited by our ignorance of biological mechanisms. In some specific cases, I would argue that biological processes don’t exist in isolation: as long as we capture a snapshot of the status cell or tissue through a small number of biomarkers, then we are capturing the full status, even if we are not measuring directly the causal mechanisms of disease. There are many reasons to defend this view, for example the observation that many random gene expression signatures have comparable value in prognosticating liver cancer as the “best gene signature” [14], and for the breast cancer signatures described above, one of the most data-driven signatures, Prosigna, is the one with the highest prognostic value [15] – also, whole they are all very good at prognosticating exactly the same clinical decision… they have few genes in common! However, I need to heed some caution. Having morphed into a bioinformatician early in my scientific career, I have been involved in the design and analysis of multiple high-throughput studies, and have come to realize that they are not all designed equally. Data from a badly conceived study may lead to incorrect conclusions. I was fortunate to have had the opportunity to interact with Sidney Brenner occasionally in the last decade of his life, the Nobel Prize-winning scientist that played a pivotal role in deciphering the genetic code, including demonstrating the triplet nature of codons and helping establish the concept of messenger RNA, and probably the most intelligent man I have ever met. It comes to mind a sentence that he often repeated when discussing large-scale data sets, and the lack of reasoning that went into planning them properly: low input, high throughput, no output science.

Nonetheless, scientific validity is not an abstract academic requirement: it is an expectation enforced across the translational ecosystem. Regulatory agencies such as the FDA and EMA require a clearly articulated rationale linking biomarker biology to its proposed clinical role, especially when seeking formal qualification or use as a companion diagnostic. Health technology assessment bodies often demand evidence that biomarker targets are biologically stable and disease-relevant before accepting cost-effectiveness models or outcome projections. Perhaps more important, peer reviewers and clinicians, particularly in conservative fields like oncology and transplantation, demand biological coherence before accepting biomarker data as actionable, or your publication describing the wonderful biomarker performance, for publication. Trust me, been there! So, even if you are a data-driven person (like me) with a technically sound or statistically significant test in hand, you will be met with skepticism or outright resistance if you cannot show biological plausibility.

4. Analytical Validation

Analytical validation defines the technical reliability and performance of a biomarker assay. It ensures that the test consistently and accurately measures the intended analyte under predefined conditions. In clinical diagnostics, analytical validation acts as a cornerstone for downstream clinical and scientific credibility. It answers the fundamental question: Does the assay work as a measurement tool? This means answering many questions: Does it measure what you think it measures? How little can it detect? Does it always provide the same measurement when faced with the same sample? Independently of operator? Etc.

According to the U.S. FDA and the European Union’s In Vitro Diagnostic Regulation (IVDR), analytical validity must be demonstrated through documented studies of assay performance, covering key characteristics such as accuracy, precision, sensitivity, specificity, and reproducibility [16,17]. The scope of analytical validation depends on the nature of the biomarker and the context of use:

For laboratory-developed tests (LDTs), this often involves establishing the assay’s dynamic range, limit of detection, linearity, and robustness under varied sample conditions. LDTs are evolving from a regulatory perspective, with increasingly intense validation requirements under the IVDR in Europa.
For commercial assays undergoing regulatory approval, analytical studies must adhere to Good Laboratory Practices (GLP) and provide validation metrics in line with standards such as CLSI EP05 (precision) and EP17 (detection capability). This process ensures inter-laboratory reproducibility and guards against systematic bias.

Analytical validation, particularly for the more scientifically inclined, is not fascinating nor motivating. It requires obsessive repetition and painstaking demonstration of… sameness. But it cannot be seen as a technical hurdle as its implications are far-reaching: poor analytical performance can lead to false biomarker results, misclassification of patient subgroups, and ultimately, harm in clinical decision-making. Analytical validation is not just a nuisance that one performs to ensure compliance, it is a commitment to data integrity and patient safety.

But note, analytical validation is costly! In our own development of HepatoPredict, it is hard to extricate the costs allocated to this specific task as we were building the manufacture processes and quality system at the same time, but my best estimate, in European context, with a lot of typical Portuguese style “do more with less” places the analytical validation cost for Hepatopredict at $1 million. Published estimates however typically range from $2–5 million, with key cost drivers including sample size, technology complexity, required expertise, regulatory documentation, and laboratory/testing expenses [18,19].

I must also stress that analytical validation is not something you do once and it’s over. In fact, after the initial effort targeted at creating the technical files for the initial regulatory submission, we have revisited time and time again. One example is in extending the equipment range that is acceptable for HepatoPredict. This test is based on a Real Time PCR assay, but not all machines are equal. For example, in our experience, the range of ΔCT we observed, for exactly the same sample, same operator, same batch of reagents, in two equipments from two major makers of qPCR equipment, sitting next to each other on the same bench and running at the same time, was different. So, as we find new potential clients and go over the lab qualification and adoption procedures, we have to come back to validating the assay for that specific equipment. So far, these inter-equipment differences have been minor but have already forced us to re-consider sample acceptance thresholds that had to be re-defined based on the equipment used. In short, assays deployed in real-world clinical settings are subject to batch variation, reagent degradation, and instrument drift. Therefore, routine quality controls, calibration protocols, and re-validation are necessary components of maintaining analytical validity over time.

5. Clinical Validation

Clinical validation represents the cornerstone of a biomarker’s real-world applicability. While scientific validity establishes biological plausibility and analytical validation ensures technical performance, it is clinical validation that demonstrates whether a biomarker reliably identifies or predicts a clinically relevant outcome in actual patients. This is the stage where a test must prove its value in clinical practice: can it identify disease early, stratify risk, guide treatment, or inform prognosis? Without rigorous clinical validation, even the most biologically insightful and analytically precise biomarker remains clinically useless. Regulatory authorities such as the FDA and EMA define clinical validity as the ability of a biomarker to correlate with the presence, absence, or risk of a specific condition or response, and consider it an essential prerequisite for diagnostic approval or clinical guideline integration [20,21]. To be blunt, clinical validation is what get the clinicians to even begin to consider your biomarkers, and this is where most of the time and money will go. It is important, in my experience to have a very clear clinical validation roadmap, extensively discussed with relevant KOLs, as you must target their desires for validation and not your own pre-conceptions of what is enough (yes… experience sucks!)

The question that will arise is: what type of clinical evidence will I need? Unlike therapeutic interventions, for which randomized controlled trials are the gold standard, biomarker validation relies on a more diverse and flexible hierarchy of evidence. This reflects both the diagnostic nature of biomarkers and the impracticality, or ethical complexity, of RCTs in many contexts. Several frameworks have been proposed to classify the strength of evidence supporting biomarker use. I’ll go with a specific one that stratifies clinical evidence in four classes of evidence strength [22,23], but I make no claim of superiority over others, as this is just a way to organize strength of information. From strongest to weakest:

Level I includes prospective studies where biomarker use demonstrably improves patient outcomes;
Level II comprises prospective-retrospective or large retrospective studies with pre-specified analysis plans;
Level III includes observational studies showing statistical associations with outcomes;
Level IV consists of biologically plausible but untested or preliminary findings.

Systems such as the Tumor Marker Utility Grading System (TMUGS) and the Evaluation of Genomic Applications in Practice and Prevention (EGAPP) reflect similar logic, emphasizing that only biomarkers with high-level evidence, typically Level I or II, should inform clinical decision-making [24].

The strength of clinical validation depends not only on statistical metrics but also on the design of the supporting studies, which directly inform the level of evidence assigned to a biomarker. The most common starting point is the fully retrospective cohort study, where archival samples and clinical data are analyzed post hoc. While efficient, these studies are susceptible to bias, lack of pre-specification, and unmeasured confounding, often aligning with Level III evidence. Earlier I mentioned how clinical criteria for liver transplantation in hepatocellular carcinoma were based on single cohort retrospective data: these were level III evidence. In my experience, for biomarkers, not matter how strong your level III evidence is, you will see no acceptance from the medical community. More robust are prospective–retrospective studies, in which samples and outcomes were collected prospectively for another purpose, but the biomarker analysis is planned and executed with a predefined protocol. This design, used in the clinical validation of Oncotype DX and PAM50 [25,26] for example, offers improved credibility and typically supports Level II evidence. At the top of the evidence pyramid are true prospective clinical validation studies, where the biomarker is tested in real time and integrated into patient care decisions. These studies, while rare due to cost and complexity, provide Level I evidence, particularly when they demonstrate improved clinical outcomes. Finally, external validation in independent cohorts is critical at any level to confirm generalizability and prevent overfitting [27,28].

The credibility of clinical validation is completely dependent on the rigorous assessment of a biomarker’s ability to distinguish between relevant clinical states. Core performance metrics include sensitivity (true positive rate), specificity (true negative rate), and area under the receiver operating characteristic curve (AUC), which quantifies discriminative ability. For binary classification tasks, positive predictive value (PPV) and negative predictive value (NPV) offer important insight, especially in populations where disease prevalence may affect test interpretation. In time-to-event contexts, metrics such as Harrell’s c-index, Kaplan–Meier survival stratification, and hazard ratios are commonly used. Importantly, biomarkers intended for prognostic use must demonstrate statistical independence from existing clinical predictors, often through multivariate regression or Cox proportional hazards models. These validation methods form the statistical backbone of clinical validation but stop short of demonstrating clinical benefit. That next step, whether the test improves patient care, is addressed in the domain of clinical utility [29][30][31], discussed next.

It is ever so easy to obsess about performance that we sometimes forget that we are discussing, potentially, human lives. So at Ophiomics I always tried to discuss these metrics in terms of lives saved and lost, which I believe makes it all more real to the team. For HepatoPredict, for example, instead of thinking of sensitivity in absolute terms, I find it more useful to think in terms of an improvement of sensitivity by 5%, means offering a curative-intent approach to more 5 people out of one hundred, people that would have otherwise no curative avenue ahead of them.

Everybody will require clear and strong clinical validation. Health authorities such as the U.S. FDA, European Medicines Agency (EMA), and agencies overseeing In Vitro Diagnostic Regulation (IVDR) require well-documented evidence that a biomarker consistently correlates with a defined clinical condition, prognosis, or treatment response. This correlation must be established with high precision in a population relevant to where the test will be applied – this is one of the stickiest issues as every single clinicians, early on, will question ”will this work on my patients?”, questioning results obtained from cohorts they deem irrelevant. This will be the source of endless frustration if you are validating “your biomarkers”. In parallel, health technology assessment (HTA) bodies and clinical guideline committees expect validation in representative real-world populations, often requiring Level I or II evidence before endorsing routine use. In most cases, this is not the language that will be used (levels of evidence), but it will be what will be asked. Clinical societies, especially in conservative specialties such as oncology or transplantation, will further demand that biomarkers be validated in multicenter or international settings to ensure generalizability. I strongly recommend that you attend a consensus conference by the relevant clinical society(ies) for your biomarker to understand the process. In practice, even a highly innovative and potentially effective biomarkers will face resistance if they cannot present robust, peer-reviewed, and reproducible clinical validation data. This is the de facto gatekeeper for regulatory approval, reimbursement, and clinical trust [32-34]. Aim for the long and frustrating process, complain as much as you want, but you need to do it – if your biomarker is important, you will be wanting to improve or save people’s lives, so you better show that it really, really, really works!

6. Clinical Utility

We’ve been involved in many a discussion with clinical teams, and the question that always arises is: what will I do different to my patients? Clinical utility refers to the extent to which a biomarker improves patient outcomes by influencing or altering clinical decision-making. It represents the final and most consequential tier in the biomarker validation framework: a biomarker may be biologically plausible, analytically accurate, and statistically significant, but if it does not impact care, its clinical value remains null.

Different types of biomarkers manifest utility in different ways: A prognostic biomarker has utility if it enables physicians to stratify patients based on risk and choose surveillance or intervention accordingly, thus improving timing or intensity of care. A predictive biomarker is clinically useful if it helps identify subgroups of patients who are more likely to respond (or not respond) to a given treatment, preventing both over- and under-treatment. A monitoring biomarker demonstrates utility when it allows clinicians to assess treatment effectiveness or disease progression over time, informing adjustments in therapy. A stratification biomarker shows value if it guides trial design or therapeutic decisions by allocating patients to distinct clinical pathways based on biological profiles. In all these cases, utility is demonstrated when use of the biomarker leads to improved outcomes, reduced harm, or more efficient resource allocation.

In practice, several types of studies are used to assess clinical utility. The most robust are prospective interventional trials, where patients are randomized to standard care versus biomarker-informed care. These provide Level I evidence but are rarely feasible due to cost, ethical complexity, or logistical constraints. More commonly employed are decision impact studies, which evaluate whether the availability of biomarker results leads to changes in clinician behavior or management. A well-known example is the DECIDE study, which evaluated the clinical utility of the Decipher genomic classifier in prostate cancer and showed that it influenced treatment choices by reducing the number of patients undergoing unnecessary adjuvant therapy [35].

Health economics is an inseparable part of clinical utility. A biomarker might guide optimal therapy, but unless it is also cost-effective, its adoption can be impeded. Health Technology Assessment (HTA) bodies increasingly demand models incorporating not just clinical outcomes but also cost-effectiveness ratios, quality-adjusted life years (QALYs), and real-world implementation feasibility. For example, a test that reduces chemotherapy in low-risk breast cancer may be considered valuable not only due to patient benefit but also cost savings to healthcare systems. Health economics studies require data, and ideally independent validation data on which to be based. These studies are based on mathematical modelling of patient pathways with and without your test, assessing life quality and quantity gained or lost (measured in QALYs, standing for Quality-Adjusted Life Year) and computing the costs associated to determine the ICER (Incremental Cost-Effectiveness Ratio). A good intervention is one that gains QALYs below an ICER treshold that represents the national heath system of insurance willingness to pay, highly variable by geography. This is absolutely seencial for reimbursement which will be the crucible of success of most biomarker-based tests. As everything else in medical research, this will be costly, and we’ve had quotes ranging from $200K USD to $1M USD for an independent study and preparation of reimbursement files. We decided to run our own health economics modelling in house prior to engaging external providers to ensure that our tests performance was already with an acceptable ICER range – it was! – which I belive is a good way of de-risking this expense.

A growing body of literature and regulatory guidance now recognizes real-world evidence (RWE) as a viable path to demonstrating utility. Observational studies in real-world clinical settings can demonstrate that biomarker-guided management leads to better clinical outcomes, fewer adverse events, or improved resource allocation. In fact, HTA bodies and payers often require economic models derived from such studies before agreeing to reimbursement: utility cannot be inferred, it must be demonstrated.

It is also critical to note that clinical utility is not a fixed concept. It depends on clinical context, available interventions, healthcare system structures, and even patient preferences. A biomarker test may offer clear value in a well-resourced tertiary center but fail to improve outcomes in a setting lacking access to targeted therapies. As such, utility must be demonstrated through continuous evaluation and context-specific adaptation. This is why many diagnostics enter the market under restricted indications and gradually expand as evidence builds across diverse populations and systems.

7. Clinical Validation: Ethics

It is impossible to discuss clinical evidence and not make space for the issue of ethics, and because this became such an important issue in the development of HepatoPredict – after all, it is about life or death decisions, so I’m taking a separate section just to address this issue

Traditional clinical trial frameworks create significant ethical challenges when applied to biomarker-based in vitro diagnostics (IVDs). Clinical equipoise, i.e. the requirement for genuine uncertainty about treatment merits, becomes problematic for diagnostic validation, shifting from treatment effectiveness to diagnostic accuracy. Research reveals seven logically distinct definitions of equipoise among stakeholders, creating inconsistent ethical standards [36]. Unlike therapeutic trials focusing on treatment efficacy, diagnostic trials must address the ethics of withholding potentially actionable diagnostic information from participants [36].

A central tension emerges around biomarker disclosure: participants may legitimately want their results, yet disclosure can cause harm when effective interventions are unavailable. In Alzheimer’s disease research, for example, experts argue against disclosing biomarker results to cognitively unimpaired participants due to limited clinical validity [37]. Ethical frameworks must balance return of individual results, participants’ data access rights, and transparent trial enrollment.

Biomarker trials face unique challenges that therapeutic frameworks cannot address adequately. Informed consent complexities arise because participants may not understand diagnostic versus therapeutic implications, much more problematic when “the investigator is also the patient’s treating clinician.” Risk-benefit calculations for therapeutics don’t translate to diagnostics, where “risks” may be psychological while “benefits” are informational rather than therapeutic. Diagnostic contexts demand nuanced consent models that distinguish between knowledge provision and therapeutic intervention [38].

Research demonstrates that “context matters for successful implementation of medical interventions,” with effectiveness varying by healthcare systems and cultural factors. This creates ethical complexities when designing validation trials across diverse contexts [39]. Additional concerns include algorithm fairness, surveillance implications, and data privacy, issues that are distinct from therapeutic trial ethics but critically important in diagnostics.

Recognition of these limitations drives development of diagnostic-specific ethical approaches. Researchers advocate moving beyond traditional RCT frameworks, suggesting evidence-based models tailored to diagnostics. This includes adaptive consent procedures, flexible study designs, and integration of post-market surveillance [40].

Current frameworks provide “incomplete guidance” for biomarker-specific ethical challenges, including disclosure standards, appropriate control designs, and real-world evidence integration. The literature demonstrates clear recognition that traditional clinical trial frameworks inadequately address biomarker-based IVD validation ethics, driving development of diagnostic-specific frameworks balancing scientific rigor with ethical imperatives [40,41].

At Ophiomics we engaged an ethics mentor, initially at the bequest of the EU commission – a recommendation by evaluators of our EIC accelerator grant. Later we retained this ethics mentor and continued engaging, for example organizing a full internal training day, and more are likely to come in the future. We’ve learnt a lot from the process and from having this mentor with us. Ethics also plays an important part in how we interact with our clinical research partners, beyond the obvious needs for ethics approvals for any study submitted In Europe, in between the AI act, the General Data Protection Regulation and, obviously, a multitude of ethical rules, written, interpreted or imagined, investment in understanding ethics-related issues in pur activity is not an option, but a necessity.

8. Real-World Evidence

The evaluation of biomarkers increasingly depends on real-world evidence (RWE) as a complement, or even an alternative, to randomized controlled trials (RCTs). Understanding this begins by understanding the difference between real-world data (RWD) from RWE. RWD encompasses health-related data collected outside traditional clinical trials: electronic health records (EHRs), insurance claims, registries, laboratory systems, and patient-generated data. RWE, by contrast, is the clinical insight derived from rigorous analysis of RWD. According to the FDA, RWE is “the clinical evidence regarding the usage and potential benefits or risks of a medical product derived from analysis of real-world data”. In short: RWD is the raw input; RWE is the interpretable output used to support regulatory, clinical, and reimbursement decisions [42-47].

RWE plays a particularly vital role in biomarker validation, where highly controlled trials sometimes fail to reflect the realities of clinical settings. RWE allows evaluation of a test’s performance across diverse populations and practice contexts. It is especially relevant in establishing clinical utility, i.e. showing whether a biomarker meaningfully affects treatment decisions, outcomes, or costs. Observational studies, registries, and claims databases can highlight reductions in unnecessary procedures or treatment optimization, even when prospective trials are impractical.
For example, decision impact studies show whether test results change physician behavior, and real-world implementation studies assess how tests influence outcomes and cost-effectiveness. Health technology assessment (HTA) agencies increasingly accept RWE for modeling quality-adjusted life years (QALYs), budget impact, and incremental cost-effectiveness, which are critical steps in reimbursement approval.
However, generating robust RWE is not simple. It demands sophisticated epidemiological methods to control for confounding, bias, and missing data. Techniques like propensity score matching or causal inference are often required. This is not work to be done “on the side” by undertrained people. It requires professionals with deep understanding of both biomedicine and real-world data science. Unfortunately, such profiles are rare. What is not is a high number of people who claim to be experts.
Regulators, including the FDA and EMA, now emphasize RWE in guidance documents, particularly for post-market surveillance and adaptive validation strategies. When carefully conducted and transparently reported, RWE enhances biomarker credibility and accelerates real-world adoption. But poorly designed or opportunistic RWE studies risk misleading conclusions and regulatory setbacks.
In short, RWE is a complementary and sometimes superior pathway for demonstrating real-world value. For biomarkers, whose performance often depends on complex implementation environments, RWE provides the bridge from potential to practical benefit. Obviously you need to get users to adopt your product before you can start collecting RWD, and here lies the reason most people developing diagnostics solutions, will not even consider this early on. However, if one can think of ethically acceptable contexts for early use, this is a potential avenue where early investment will make sense.

9. Conclusion: Lessons and Roadmap

After years of navigating the long, winding path of biomarker development, from early discovery through to regulatory approval and clinical adoption, if there’s one overarching lesson, it’s that it will take a lot longer, a lot more money and a lot more evidence than one imagines. Diagnostics have their own validation path, as I tried to describe here. They are not miniature therapeutics and we cannot simply repurpose frameworks, expectations, or regulatory routes developed for drugs and assume they’ll apply to biomarkers. Unfortunately, many in the medcial establishment won’t understand the difference and demand the same standard of evidence. Investors, on the other hand, will expect a speediness in reaching market readiness and commercial success that is unreasonably at odds with the long validaiton pathways that, like for new drugs, characterise new diagnostics.

There are some aspects that we left out of this discussion. Some will be considered in the chapter dedicated at discussing business models (the impact of pre-analytical variables and sample handling on assay reliability, the need for ongoing post-market surveillance and lifecycle management as well as the integration of validated biomarkers into clinical workflows and decision support systems), other will get dedicated chapters (equity and representativeness of studied populations and software as medical devices) Still, I hope this chapter provides some useful insight.

For those venturing into this field, here are the practical takeaways I wish I had fully appreciated when I started:

All evidence is necessary: Scientific, analytical, and clinical validity are all necessary and none can be sacrificed; try to start on all fronts as early as you can.
Design for utility from the start: Clinical utility isn’t just a bonus, it’s the goal. So, don’t wait until the end to think about how your biomarker will be used.
Do not underestimate analytical validation: It’s thankless and technical, but without it, everything else is moot.
Ethics are real: Especially in diagnostics, where information can harm as much as it helps. Engage patients and ethicists early.
Real-World Evidence is not a shortcut: It’s a powerful complement, but it demands rigor and specialized skill. Doing it poorly is worse than not doing it at all.
Communicate: Your audience may include clinicians, payers, regulators, patients. They all speak different languages. Learn to adapt.
Don’t go at it alone: Unless you’re sitting on a multidisciplinary team of clinical researchers, biostatisticians, regulatory experts, reimbursement strategists, and lab operations specialists (and most startups aren’t), seek expert guidance early. The right advisors and consultants won’t just save time: they’ll prevent errors that can de-rail an otherwise good diagnostic. Hire advisors, sub-contract tasks, find mentors. But don’t improvise or try to bootstrap it you will regret it sooner or later.

9. References

[1] Poste G. Bring on the biomarkers. Nature. 2011;469(7329):156–157. https://doi.org/10.1038/469156a

[2] Pepe MS, Etzioni R, Feng Z, et al. Phases of biomarker development for early detection of cancer. J Natl Cancer Inst. 2001;93(14):1054–1061. https://doi.org/10.1093/jnci/93.14.1054

[3] Kern SE. Why your new cancer biomarker may never work: recurrent patterns and remarkable diversity in biomarker failures. Cancer Res. 2012;72(23):6097–6101. https://doi.org/10.1158/0008-5472.CAN-12-3488

[4] U.S. Food and Drug Administration. Biomarker Qualification Program. https://www.fda.gov/drugs/cder-biomarker-qualification-program

[5] European Commission. Regulation (EU) 2017/746 on in vitro diagnostic medical devices. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32017R0746

[6] McShane LM, Altman DG, Sauerbrei W, et al. REporting recommendations for tumor MARKer prognostic studies (REMARK). J Natl Cancer Inst. 2005;97(16):1180–1184. https://doi.org/10.1093/jnci/dji237

[7] EMA. Qualification of novel methodologies for drug development: Scientific advice and qualification of biomarkers. https://www.ema.europa.eu/en/documents/scientific-guideline/qualification-opinion-novel-methodology-biomarker-qualification-november-2020_en.pdf

[8] Kang B, Fan R, Cui C, Cui Q. Comprehensive prediction and analysis of human protein essentiality based on a pretrained large language model. Nat Comput Sci. 2025. https://www.nature.com/articles/s43588-024-00733-1

[9] Donaubauer AJ, Frey B, Weber M, et al. Defining intra-tumoral and systemic immune biomarkers for locally advanced head-and-neck cancer. Front Oncol. 2024. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427411/

[10] van ’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. https://doi.org/10.1038/415530a

[11] Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004;351(27):2817–2826. https://doi.org/10.1056/NEJMoa041588

[12] Filipits M, Rudas M, Jakesz R, et al. A new molecular predictor of distant recurrence in ER-positive, HER2-negative breast cancer. Clin Cancer Res. 2011;17(18):6012–6020. https://doi.org/10.1158/1078-0432.CCR-11-0926

[13] Parker JS, Mullins M, Cheang MCU, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160–1167. https://doi.org/10.1200/JCO.2008.18.1370

[14] Itzel T, Maass T, Munker S, et al. Random gene sets in predicting survival of patients with hepatocellular carcinoma. J Mol Med. 2019;97:1463–1472. https://link.springer.com/article/10.1007/s00109-019-01764-2

[15] Dubsky P, Brase JC, Jank P, et al. Comparative survival analysis of multiparametric tests—when molecular tests disagree: A TEAM Pathology study. NPJ Breast Cancer. 2023. https://www.nature.com/articles/s41523-023-00530-w

[16] FDA. Biomarker Qualification: Evidentiary Framework. U.S. Food & Drug Administration. https://www.fda.gov/drugs/cder-biomarker-qualification-program/biomarker-qualification-evidentiary-framework

[17] European Commission. In Vitro Diagnostic Medical Devices Regulation (IVDR). Regulation (EU) 2017/746. https://health.ec.europa.eu/system/files/2022-10/md_ivdr_en_0.pdf

[18] Booz Allen Hamilton. (2016). The Cost of Biomarker Development. U.S. Department of Health and Human Services. https://aspe.hhs.gov/sites/default/files/private/pdf/257926/BiomarkrCost.pdf

[19] Poste, G. (2011). Bring on the biomarkers. Nature, 469(7329), 156–157. https://doi.org/10.1038/469156a

[20] FDA. Biomarker Qualification Program. https://www.fda.gov/drugs/cder-biomarker-qualification-program

[21] EMA. Qualification of Novel Methodologies for Drug Development. https://www.ema.europa.eu/en/human-regulatory/research-development/scientific-advice-protocol-assistance/qualification-novel-methodologies-medicine-development

[22] Pepe MS, et al. Phases of biomarker development for early detection of cancer. J Natl Cancer Inst. 2001;93(14):1054–61.

[23] Simon, R. M., Paik, S., & Hayes, D. F. (2009). Use of archived specimens in evaluation of prognostic and predictive biomarkers. Journal of the National Cancer Institute, 101(21), 1446-1452.

[24] Hayes DF, et al. Tumor Marker Utility Grading System: A framework to evaluate clinical utility of tumor markers. J Natl Cancer Inst. 1996;88(20):1456–66.

[25] Paik S, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004;351(27):2817–2826. https://doi.org/10.1056/NEJMoa041588

[26] Filipits M, et al. Predicting distant recurrence in receptor-positive breast cancer patients with limited clinicopathological risk: using the PAM50 Risk of Recurrence score in 1,478 postmenopausal women of the ABCSG-8 trial. Ann Oncol. 2014;25(2):339–345. https://doi.org/10.1093/annonc/mdt494

[27] McShane LM, et al. REporting recommendations for tumor MARKer prognostic studies (REMARK). J Natl Cancer Inst. 2005;97(16):1180–4.

[28] Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. J Clin Oncol. 2005;23(29):7332–41.

[29] Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19(4):453–73.

[30] Pencina MJ, et al. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157–72.

[31] Vickers AJ, Elkin EB. Decision curve analysis: A novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565–74.

[32] FDA. In Vitro Companion Diagnostic Devices. https://www.fda.gov/media/119851/download

[33] EMA. Guideline on Clinical Evaluation of Diagnostic Agents. https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-clinical-evaluation-diagnostic-agents_en.pdf

[34] Schwarzer R, et al. Health technology assessment of biomarkers. Int J Technol Assess Health Care. 2013;29(3):300–7.

[35] Badani KK, Thompson DJ, Buerki C, Davicioni E, Garrison JC, Amling C, et al. Impact of a genomic classifier of metastatic risk on postoperative treatment recommendations for prostate cancer patients: a report from the DECIDE study group. Urology Practice. 2019;6(2):88–95. https://doi.org/10.1097/UPJ.0000000000000055

[36] Miller FG, Brody H. A critique of clinical equipoise: therapeutic misconception in the ethics of clinical trials. Hastings Cent Rep. 2003;33(3):19–28.

[37] Grill JD, Karlawish J. Disclosing Alzheimer Disease Biomarker Results to Research Participants. JAMA Neurol. 2022 Jul 1;79(7):645-646. doi: 10.1001/jamaneurol.2022.1307. PMID: 35666532.

[38] Resnik DB. The ethics of research with human subjects: protecting people, advancing science, promoting trust. Springer, 2018.

[39] Heintz E, Lintamo L, Hultcrantz M, et al. Framework for systematic identification of ethical aspects of healthcare technologies: the SBU approach. International Journal of Technology Assessment in Health Care. 2015;31(3):124-130. doi:10.1017/S0266462315000264. https://www.cambridge.org/core/journals/international-journal-of-technology-assessment-in-health-care/article/ethics-in-health-technology-assessment-a-systematic-review/7A4D7A4D4F3A4D4A4D4A4D4A4D4A4D4A.

[40] Makady A, et al. Real-world evidence for health technology assessment of cancer medicines: A stakeholder analysis. Value Health. 2018;21(10):1229–1236.

Makady A, van Veelen A, Jonsson P, Moseley O, D’Andon A, de Boer A, Hillege H, Klungel O, Goettsch W. Using Real-World Data in Health Technology Assessment (HTA) Practice: A Comparative Study of Five HTA Agencies. Pharmacoeconomics. 2018 Mar;36(3):359-368. doi: 10.1007/s40273-017-0596-z. PMID: 29214389; PMCID: PMC5834594. https://pubmed.ncbi.nlm.nih.gov/29214389/

[41] Beauchamp TL, Childress JF. Principles of biomedical ethics. 7th ed. Oxford University Press; 2013.

[42] FDA. (2018). Framework for FDA’s Real-World Evidence Program. https://www.fda.gov/media/120060/download

[43] Makady, A., de Boer, A., Hillege, H., Klungel, O., & Goettsch, W. (2017). What Is Real-World Data? A Review of Definitions Based on Literature and Stakeholder Interviews. Value in Health, 20(7), 858–865. https://doi.org/10.1016/j.jval.2017.03.008

[44] Sherman, R. E., et al. (2016). Real-world evidence – what is it and what can it tell us? N Engl J Med, 375(23), 2293–2297. https://doi.org/10.1056/NEJMsb1609216

[45] Berger, M. L., et al. (2017). Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Value in Health, 20(8), 1003–1008. https://doi.org/10.1016/j.jval.2017.08.3016

[46] Corrigan-Curay, J., et al. (2018). Real-world evidence and real-world data for evaluating drug safety and effectiveness. JAMA, 320(9), 867–868. https://doi.org/10.1001/jama.2018.10136

[47] Malone, D. C., et al. (2018). Real-world data: A new gold standard for health care research? J Manag Care Spec Pharm, 24(10), 968–972. https://doi.org/10.18553/jmcp.2018.24.10.968