Modelling lets evaluators test-drive change safely and cheaply, using a diversity of non-RCT evidence

by sally Brailsford

Enhanced decision-making, blue-skies thinking and quick trials of hypotheses are all much easier if modelling is in your evaluation tool kit, explains Sally Brailsford

Everyone thinks that they know what a model is. But we all have different conceptions. I like the definition from my colleague Mike Pidd, from Exeter University. He sees a model as ‘an external and explicit representation of a part of reality’.  People use it ‘to understand, to change, to manage, and to control that part of reality’.

We tend to acknowledge the limitations that models have, but fail to fully appreciate their potential.  ‘All models are wrong,’ as George Box said, ‘but some are useful’.

I work in Operational Research. It’s a tool kit discipline. In one part, we make use of statistics, mathematics and highly complex algorithmic models. In another, we draw pictures and play games. I use these elements to create simulation – I build a model in a computer which replicates a real system and then we can play ‘what if’ with it.

Models inform decision-making

I use models mainly for informing decision-making. Sometimes, they don’t actually need much data to be very useful. For example, there is a famous model about optimal hospital bed occupancy, created by Adrian Bagust and colleagues at Liverpool University’s Centre for Health Economics.  It includes some numbers but they are not based on any specific hospitals. It shows that if a hospital tried to keep all its beds fully occupied, then some patients would inevitably have to be turned away.

The model varies patient arrivals as occupancy increases and demonstrates how often the hospital has to turn away emergency patients. It shows that hospitals deemed inefficient, because they occasionally have empty beds, are actually operating effectively. The finding really influenced policy. It showed that, as a hospital reaches about 85 per cent occupancy, it is increasingly likely to have to turn emergency patients away. It is a simple model. It did not involve long-running, expensive randomised controlled trials. Yet it provided vital evidence and was powerful in influencing occupancy targets.

30 year clinical trial in five minutes

In another model, we looked at patients with diabetes at risk of developing retinopathy. Everyone agreed that it was a good idea to screen patients with diabetes to prevent retinopathy before it leads to blindness. However, there was a whole range of screening practices. We used data from all over the place, from the US and from the UK. The model followed patients with diabetes through the life course and through different progression stages.

We had to draw data from very early studies because it would be unethical to conduct a clinical trial that did not treat people according to best practice. We then adapted the model for different populations, with varying ethnic mixes and probabilities of diabetic incidents. We superimposed on the model a range of different screening policies to see which was most cost-effective. In effect, once we felt confident that the model was valid, we could run a clinical trial on a computer in five minutes rather than running a real clinical trial for 30 years. As a result, we discovered really valuable findings.

The beneficial difference between all the various techniques and screening programmes proved to be minor compared with the large impact of more people being screened. We realised that raising attendance, perhaps by social marketing, offered much better value than buying expensive equipment.

Guiding design of hypothetical systems

The next model is even more hypothetical. Three engineers had an exciting, blue skies idea for patients with bipolar disorder. What if, they asked, different sensors tracked a person’s behavioural patterns and, having established an individual’s ‘activity signature’, could spot small signs of a developing episode that would trigger a message that the person might need help?

We expected, rightly, that success depended on what monitoring individuals could tolerate – perhaps a bedside, touch sensor mat, or a light sensor in their sitting room, sound sensors or GPS. We built these different possibilities into the model. We could also check how accurate the algorithms would have to be, if this technology was developed. So we were guiding design of a hypothetical system.

Many, particularly those from clinical backgrounds, find it hard to accept that modelling can provide evidence upon which to make a major decision. People often expect the same kind of statistical evidence as from randomised controlled trials. Modelling does not claim to provide that level of certainty. It is a decision-support tool, helping you understand what might happen if you do something.

Appreciate modelling advantages

We should recognise the advantages of models. They are quick and cheap – you can run a clinical trial that could last decades in a matter of minutes. If you lack confidence statistically in your model, there are solutions: expert opinion and judgement can help fill the gaps. A model allows people to talk about issues in a policy setting and to articulate their assumptions. Quite often the conversations along the road are more important than the eventual model and the model is just a means to that end.

Like in the bi-polar project, you can model innovations that don’t even exist. So I often use modelling for hospitals around redesigning a system or a service. The development does not exist yet, so there are no data – you must gather all the available evidence you can and build it into your model. It lets you explore more than when using traditional methods because your assumptions can be more flexible.

Collecting primary data is hugely expensive, sometimes impossible.  You can consider all sorts of options that it would be unethical to explore in reality. As the bed occupancy model shows, the findings can be powerful and influential.

There is a saying that, if all you have is a hammer, then every problem is a nail. As researchers, we should avoid being confined by preferred methods, whatever our discipline. Modelling can be a valuable research tool.

Sally Brailsford is Professor of Management Science at the University of Southampton. Her blog is based on her presentation on 4 July 2014 at PIRU’s Conference: ‘Evaluation – making it timely, useful, independent and rigorous’.

‘Different contexts should not be allowed to paralyse wider roll-out – some differences don’t really matter.’

by mark petticrew

 Interventions that succeed in some instances may or may not work in other circumstances. You have to consider whether the contextual differences really are ‘significant’, says Mark Petticrew

How important is the particular context of a policy intervention in deciding whether that intervention can work elsewhere? The answer must lie in the significance of the context. Every place is different. Every time is different. Everybody is different. The important question must be: which differences really matter, which are actually significant? We should avoid mistakenly thinking that the inherent uniqueness of everything means that a particular intervention will never work elsewhere. It might still be generalisable and transferable elsewhere.

Similarity and uniqueness

It is, of course, highly implausible that interventions work the same way across different contexts. Nevertheless, it is equally implausible that that evidence collected in one context has no value for another. These polar positions are unhelpful because neither is true. (‘We are all individuals,’ shouted the crowd to Brian in the Monty Python movie.  ‘I’m not,’ said a lone dissenter). Clearly, all individual study contexts are different, but there may be similarities.

Similarity and portability across apparently very different contexts were aptly illustrated to me when I was involved in housing research. The earliest controlled trial of a housing improvement intervention was done in Stockton on Tees in 1929. Families were moved out of the slums, which were then demolished, and moved into new housing. Unexpectedly, many people’s health deteriorated.

This type of intervention is common today.  Urban improvement accompanied by large-scale housing regeneration occurs frequently. However, the context is very different from 1929.  In those days, poverty was probably more widespread, as was slum housing. Yet, more recently the same unanticipated adverse effect has been found in one study, with a minority of people’s health deteriorating when their housing improves. Although the context looks very different, the underlying mechanisms seem to be the same, namely that, when the housing is improved, rents rise and so people scrimp on their diets and their health gets worse.

Another field where the same mechanisms apparently work across different contexts is smoke-free legislation which aims to restrict the impact of second-hand smoke in work and public places. This has been evaluated at least 11 times in very different contexts. When the issue reached the UK, critics, often in the hospitality industry, said this might have worked in these other countries but it wasn’t going to work in pubs in Glasgow, say, or in London. The same arguments were raised around the implementation of smoke-free legislation in Ireland, that these are very different contexts, that people’s drinking and smoking were wedded. Yet, in fact, the success of implementation has been broadly similar across many different states and countries.

Aspects of context that matter

In short, predicting the generalisability of an intervention is all about understanding the significance of context. So the first step must be to reflect on which aspects of contexts might really matter. A lot of checklists to help this task have been put together. Dr Helen Burchett from the London School of Hygiene and Tropical Medicine has reviewed dozens of these frameworks which are used to help users to judge whether evidence collected in one setting might be applicable in another context.  Her study found that there are 19 categories of context that might be important and a few more can probably be added.

Some of the work that we have been doing as part of the NIHR School for Public Health Research has been particularly enlightening around economic contexts. Local practitioners tell us that the current economic climate has been a big constraint not only on the use of evidence by, for example, local government, but also on evaluation itself, which is often seen as a luxury.

However, as I have tried to show, context always varies and simply pointing out the differences is not sufficient. You have to determine – or sometimes make assumptions – about which of these variations actually matter – which are likely to be clinically, or socially significant. How do you do this? This assessment should be informed by at least three considerations. First, there is knowledge of the existing evidence, which helps one discover whether and how the intervention has worked in other settings. Second, understanding the underlying theory and assumptions about how the intervention works and is moderated can be helpful. Finally, one can draw on the judgement of experts, practitioners and policy makers who might have insights into whether one context is significantly different from another.

There is a lot more scope for research in this field. For example, there may be classes of interventions that are less context-dependent than others. Smoke-free legislation with its 11 evaluations would be a case in point, and suggests that perhaps regulatory interventions are less affected by context than interventions that require more individual behavioural change.

Context and interventions intertwined

We may also need to revise our sometimes simplistic view of the relationship between context and intervention. There is a tendency to see context merely as a moderator, something that interferes with an intervention in some way. Yet there are many situations and policies where the intervention is the context. The intervention changes the nature of the system in some way so that the intervention and the context are, in effect, the same thing.  This makes defining the start and the end of an intervention and its boundaries – and thinking about how you evaluate it – hugely challenging.

The significance of context in generalisability also places question marks against the culture of systematic reviews. During such reviews, researchers aim to put all the evidence together from interventions and attempt to discern a single effect based on everything that is known about an issue. It is an attempt to separate the ‘things that work’ from the ‘things that don’t work’ and identify an overall effect size. This may be problematic because, during this process, the context that produces that effect usually gets stripped away. As a result, in the process of producing evidence, we lose the context.

As researchers we also have a tendency to see the world in terms of studies of ‘magic bullets’ which tell us that, if things work, then they work everywhere. However, at least in public health, we are increasingly putting together assemblages of evidence from different contexts that show what happened when those interventions were implemented in different places to guide future decision makers. This is very different from saying simply that something always ‘works’.  It might be more helpful to see the wider goal of collecting evidence as being to inform decisions, rather than to simply test hypotheses. This may be one way forward to make proper sense of context, rather than trying either to eradicate it or allowing its uniqueness to rule out the possibility that an intervention can be transferred across time and space.

Dr Mark Petticrew is Professor of Public Health Evaluation at the London School of Hygiene and Tropical Medicine and a member of PIRU. He is also a co-director the NIHR School for Public Health Research at LSHTM. (


‘Research units are performing a difficult balancing act … but we’re still smiling.’

by nicholas mays

Our ambition to co-produce evidence with advisors and officials is fraught with challenges, but remains a worthy goal with valuable benefits, explains PIRU director, Nicholas Mays.

When PIRU was set up three and a half years ago, there was a great deal of ambition on all sides. The Department of Health, as funder, wanted us ‘to strengthen the use of evidence in the initial stages of policy making’. That was the distinctive, exciting bit for us. We were to support or undertake evaluation of policy pilots or demonstration initiatives across all aspects of the Department’s policy activity – public health, health services and social care.

We were also brave, seeking to ‘co-produce’ evidence by working closely with policy advisors and officials, aiming to break down conventional sequences in which evaluation tends to follow policy development. We wanted early involvement from horizon scanning to innovation design and implementation design, plus support work for evaluations or to do them ourselves. It was clear that if we could be engaged, flexible and responsive, officials would be more likely to work with us.

Some researchers prefer planned, longer term work. They see the responsive element as regrettably necessary to pay the mortgage. In fact, our more responsive work has often turned out to be the most interesting:  some of it we would probably have planned to do in any case; other parts have led to substantial pieces of research. It can be highly productive, not least because policy advisors are fired up about the findings.

Wide-ranging roles

In our first years, we have tried hard to work across all stages of policy development. To support the early stages of policy innovation, we did some rapid evidence syntheses.  We have advised on the feasibility of a number of potential evaluations – for example, we looked at the Innovation Health and Wealth Strategy to examine which of the strategy’s 26 actions could credibly be evaluated. We have advised on the commissioning and management of early stage policy evaluations. We have also helped define more precisely what the intervention is in a particular pilot because, in pilot schemes or demonstrations, the ‘what’ is often presumed, but can actually be rather unclear.

We had expected to guide roll-out, using the learning from evaluations, but that’s not always easy for academic evaluators. PIRU often works with different parts of the social care and health policy system, perhaps for quite short periods of time, which is a very different relationship from working, say, with clinicians for an extended period.  Also, in policy and management, unlike the clinical world, people change jobs fairly frequently making it difficult to sustain relationships.

We have also advised on modelling and simulation, which is useful for playing out possible effects of innovations and to debate potential designs. However, that work typically tends to happen within government rather than through outsiders such as PIRU.


Indeed, we have found it difficult to become involved in the early stages of policy development, partly because health and social policy decision-making in England has been restructured and become more complicated as a result of the Health and Social Care Act 2012. There are new agencies and new people, altering long-established relationships between policy makers and evaluators.

Engaging us early on is also demanding. It requires greater openness and communication within government, so that research managers actually know when an initiative is starting, and a willingness to share early intelligence with outsiders in the research community. Some policy makers also find that the perceived benefits of sharing new thinking with us fails to outweigh the perceived risks of having us at the table early on.


There have been other big issues. How close should evaluators get to those who commission an evaluation? How candid – and sometimes negative – should we be?  Should we refuse to do an impact evaluation because we know that too little time will be allowed to elapse to demonstrate a difference?  Should we actively create dissonance with customers who are also funders through a process of constructive challenge? Strangely, the researchers are sometimes the ones saying, ‘No, we should not be looking at outcomes. You are better doing a process evaluation or no evaluation at this stage.’ In some cases, the researchers are asking for less evaluation and the policy makers are asking for more.

Can it be predicted that certain pilots do not realistically lend themselves to being evaluated? For example, we conducted a study of a pilot scheme allowing patients to either visit or register with GP practices outside the area in which they live.  We highlighted in our report that we couldn’t look at the full range of impacts in the 12 months for which the pilot ran.  Nevertheless, critics of the policy were annoyed with the evaluation because it was seen to legitimise what was, in their minds, an inadequate pilot of a wrong-headed policy.

We frequently have to say that the policy pilot will take a lot longer than expected to be implemented. However, the commissioners of evaluation often have no time to wait and want the results right away. The danger is that lots of time is spent interviewing people and looking for implementation effects, only to discover that not very much has happened yet.

So we face many challenges. But that’s hardly surprising. In an ideal world, we would have closer sets of relationships with a defined set of potential users. In reality, we are working across a very wide range of policy issues with an overriding expectation that we should engage at an early stage and speedily. It’s a difficult but rewarding balancing act.

Nicholas Mays is Professor of Health Policy at the London School of Hygiene and Tropical Medicine and Director of PIRU. This piece is based on a presentation that Professor Mays gave at the meeting ‘Evaluation – making it timely, useful, independent and rigorous’ on 4 July 2014, organised by PIRU at the London School of Hygiene and Tropical Medicine, in association with the NIHR School for Public Health Research and the Public Health Research Consortium (PHRC).


Evaluations should share interim findings only when pieces of research are completed and on terms that are agreed in advance

by oona campbell

A global maternal health initiative that could save thousands of lives has highlighted dilemmas for those assessing its performance, says Oona Campbell


 It is difficult to over-estimate the urgency of improving maternal health in developing countries. Women die in childbirth or from complications during pregnancy, day in and day out. Some 99 per cent of maternal deaths are in the developing world – these tragic mortality figures are the public health indicator showing the greatest gulf between rich and poor countries. Most of these deaths occur during labour or within 24 hours after delivery, typically because of excessive bleeding.

So it is extremely important to us to have been asked to evaluate MSD’s 10 year, $500 million MSD for Mothers initiative, designed to create a world where no woman dies giving life.  As the team chosen to evaluate parts of the initiative, we had to think about which interim findings to share, when and how and with whom. We appreciate the importance of communicating findings quickly, but it is vital that evaluation is independent and that learning is robust. We wonder whether communicating too soon or too frequently would undermine independence.  Right now, our view is that we should not wait until the study ends to detail some interim findings. But we plan to share only completed pieces of research with clear protocols and objectives.

Why evaluate?

MSD for Mothers focuses on two leading causes of maternal mortality – post-partum haemorrhaging and pre-eclampsia. There are a number of priority countries – among them we focus on work in India and Uganda. It is a big initiative, with multiple pillars: product innovation, global awareness and advocacy and also numerous projects aiming to improve access to affordable, quality care for women.

Why was the company interested in having an evaluation? They sought an independent assessment of their contribution to maternal mortality reduction and to identifying sustainable solutions. They wanted guidance in their existing strategy and to ensure they were investing in high impact programmes. From the policy perspective the aim is to contribute to the evidence base for better decision-making globally and to have robust research available through publications in peer reviewed journals.

The difficult issue for us was how we might contribute to guiding the existing strategy, ensuring investment in high impact potential projects. When do we provide input? What do we do? How do we do it? Does this affect our ability to be independent? If we get that involved in programme design, will it affect our ability to do robust evaluation? How do we work with the implementers who are actually doing the projects? Will they continue to work with us if we share interim findings? How does all this affect our ability to be relevant?

Trying to be helpful

Our initial thought was that we wanted our evaluation to be used. Too few resources go into women’s maternal health in low income countries, so, we certainly did not want to say, at the end of 10 years: ‘No, it didn’t work.’ We wanted to maintain a dialogue with policy makers, commissioners of research and the implementers. As Tom Woodcock from NIHR CLAHRC Northwest London has explained, this can be very successful. So there was an assumption that we should be responsive, engaged and give feedback. But when should we do this? What does it mean to be ‘maximally responsive’ and should policy makers – or in this case the funder – have sight of the interim findings?

Our approach

We are using a multi-disciplinary, mixed method approach that tries to capture the scope and range of the activities. Our basic approach is to work with MSD to identify overarching questions and then to work with the implementers, usually non-governmental organisations in specific countries, to understand what they are trying to do. Then we identify projects to evaluate and agree key evaluation questions. Within that, we try to understand exactly what people are doing, the theory of change and how the implementer thinks it is going to work. Where possible, we like to recommend ways to design their implementation that allow for evaluation, but typically that’s not possible. Then we provide technical support to improve the rigour of monitoring in specific projects and we aim to use robust, analytical methods for the independent evaluation, including by gathering further data.

Guidance from global literature on sharing of findings tends to be vague, but mentions sharing interim findings. The Centers for Disease Control and Prevention is probably most explicit saying: ‘It’s important to use the findings that you learn all along the way because if we don’t, opportunities are missed, if you wait until the very end of your evaluation to use some of those results. And sometimes those key nuggets of information in terms of interim findings may not necessarily be captured in that final report, so it’s important to use them as you learn about them.’

Clinical trials have formal mechanisms for interim findings.  Data monitoring committees look at elements of implementation such as adequacy of enrolment, as well as trial endpoints and adverse events.  Insights from clinical trials tend to focus on ethical obligations to stop trials early to reduce study participants exposure to inferior treatment. But there is also concern that multiple interim analyses of accumulating data can find differences when actually there are none.

We wish to learn from these approaches, but in terms of our evaluation, an important consideration is the multiplicity of interventions underway in this wide-ranging programme.  The application of interim findings about simple interventions, such as those usually tested in clinical trials, is more straightforward. Imagine evaluating interventions to reduce the incidence of maternal tetanus. The interventions might be ensuring clean delivery – because an unhygienic birth environment exposes a mother potentially to tetanus spores – plus immunization with sufficient doses of tetanus toxoid to prevent the onset of maternal tetanus.

But what about a complex intervention where you are trying to change maternity care?  A huge range of interventions are required, including, for example, health worker training, changes to ambulance services, accreditation of private providers, behavioural change communication, health insurance etc. This programme might involve a long complex causal chain with feedback loops and multiple groups of individuals. Is an interim finding on one aspect a solid basis for changing the implementation?

Uses of interim findings

There is also a wide variety of potential purposes for interim evaluation. They might include: stopping a complex intervention that is harmful; to proclaim success and roll out elsewhere; to improve implementation of intervention; to change the intervention and bolster failing/problematic bits; to ensure politicians/ policymakers remain engaged; to respond to a need for quick results.

We’re trying to better understand how to share interim findings and with whom. Should it be with programme implementers, policy makers, funders or others? Or, perhaps, all of them? Should it be simply findings on implementation or just on outputs and impacts? Our ‘interim conclusion’ on the interim findings is that we certainly do not have to wait until the end of the programme and we do want to communicate some research. But, by and large, we will be clear that these will only be completed pieces of research, with clear protocols and objectives, formally specified before implementation begins.

Dr Oona Campbell is Professor of Epidemiology and Reproductive Health at the London School of Hygiene and Tropical Medicine. This piece is based on a presentation that Professor Campbell gave at the meeting ‘Evaluation – making it timely, useful, independent and rigorous’ on 4 July 2014, organised by PIRU at the London School of Hygiene and Tropical Medicine, in association with the NIHR School for Public Health Research and the Public Health Research Consortium (PHRC).

Rapid, real-time feedback from evaluation, as well as programme flexibility, is vital to health service improvement

By Tom Woodcock


We should learn from Melanesian Islanders that feedback provides the deep understanding of interventions which is key to wide-scale, successful roll-out, says Tom Woodcock

Osmar White, the Australian war correspondent, told an amazing story about how, after the Second World War, when military bases had closed in the Pacific, Melanesian Islanders built crude imitation landing strips, aircraft and radio equipment, and mimicked the behaviour that they had observed of the military personnel operating them. White’s book, ‘Parliament of a Thousand Tribes’, explains how the islanders were trying to reproduce the glut of goods and products that Japanese and American combatants had brought to the region. They believed that they could actually summon these goods again.

In her recent paper published in Implementation Science*, Professor Mary Dixon-Woods of Leicester University highlights this extraordinary story as a graphic illustration of how an innovation fails to be replicated successfully in different circumstances because there is poor understanding of the original intervention. It illuminates the difficulties that can arise when one tries to implement and roll out improvement programmes.  Deep understanding of the intervention is vital.

How do we achieve that understanding? It’s a big issue for NIHR’s Collaboration for Leadership in Applied Health Research and Care (CLAHRC). In Northwest London, we’re funded for the next five years to accelerate the translation of health research into patient care. Our experience is that rapid and continuous real-time feedback from evaluation, combined with flexibility in programme adaptation, is vital to ensure rapid improvement of health service practice. It is also central to meeting the longer-term challenges of achieving sustainability and reproducibility of change.

Challenges of transferring successful interventions

The nature of the challenge was highlighted by the Michigan Central Line Project. This was a highly successful US quality improvement project designed to reduce central line infections. Mortality was reduced significantly. ‘Matching Michigan’ was a subsequent initiative in 200 English hospitals to replicate Michigan’s results. It didn’t work as well as hoped. Drawing parallels with the Melanesian story, Professor Dixon-Woods’ paper argues that the Michigan innovation transfer likewise demonstrated inadequate understanding of the true intervention.

How can real-time evaluation help to avoid these misunderstandings? First, it offers a better chance to optimise interventions in their original settings, as well as in subsequent roll-out sites. Secondly, it can lead to a richer, more real understanding of the system and how it works. This can lead, I believe, to a fuller evaluation and more successful transferability.  The opportunity offered by real-time evaluation might be at a specific project level, implementing an intervention at a specific setting, but its strengths are also useful at higher policy levels and in the support and training levels lying between policy and practice.

Why does testing an intervention in situ with real time evaluative feedback produce a better eventual implementation? That’s partly due to being able to fit the intervention to its context effectively. The project team gain much better insight into what is actually currently happening during implementation, which is sometimes highly complex, making it easy to miss key aspects of what is occurring. There can also be early checks on the intended impacts – if an intervention is being implemented successfully but not improving outcomes, there are statistical approaches that allow evaluators to explore the reasons quickly and take appropriate action. Feedback also increases motivation and engagement within the initiative, encouraging reflective thought.

A closer working relationship between evaluators and the team can expose underlying assumptions within an intervention which might otherwise be obscured. Typically, members of the team also better appreciate the value of evaluation, leading them to develop higher quality data. Team challenges to the data – observations that ‘this does not make sense to me’ – can be illuminating and help create both between and within site consistency. In her ‘Matching Michigan’ study, Mary Dixon-Woods highlights huge inconsistencies between the data collected in the different sites despite each site supposedly working to an agreed, common operational framework.  Achieving such consistency is extremely difficult.  Close working between the evaluation and implementation teams can help and it provides greater access to the mechanism in which, and by which, an intervention works. It offers a lot of information about sensitivity and specificity of measures.

Challenges of real time evaluation

Real time feedback and evaluation does have problems, being more resource intensive and potentially blurring the lines between an evaluation and the intervention itself. There are methodological challenges – if early feedback is followed by a working and responsive change, then the evaluation is, in theory, dealing with a different intervention from the one it began to examine.  Inevitably, there are questions about the impartiality of the evaluators if they work very closely with the implementation team.

At CLAHRC Northwest London, we reckon that the increased costs of real time feedback are more than outweighed by the benefits.  It helps that the very nature of the interactive feedback implies starting on a smaller scale, which can allow an initial programme to build in the interactive feedback and then later findings can be used to roll out.

It is vital to clarify the intervention.  Laura J Damschroder’s 2009 paper** published in Implementation Science reviews the literature to articulate a framework distinguishing the core and the periphery of an intervention. The core represents the defining characteristics which should be the same wherever implemented, but there is also the flexible, context-sensitive periphery.

Regarding concerns about compromising objectivity, that is essentially a case of planning carefully, delivering against the protocol and then justifying and accurately reporting any additional analyses or modifications so that anyone reading an evaluation understands what was planned originally and what was added as part of the interactive feedback.

Typically, people tend to think of two distinct processes – implementation and evaluation. In CLAHRC NWL, there is much more overlap.  The CLAHRC NWL support team essentially perform an evaluative role and attend implementation team meetings to provide real time evaluation feedback on the project measures. Biannually, Professor James Barlow and his team at Imperial College London provide evaluation of the CLARC NWL programme, predominantly at higher levels, but there is still an interactive process going on.

Clarity about interventions

Take, for example, our programme to improve the management of chronic obstructive pulmonary disease (COPD).  There are some high level factors that we wish to influence by implementing the intervention, including reduced patient smoking, increased capacity to use inhalers properly when patients are out of hospital plus better general fitness and levels of exercise. There are a whole series of interventions, ranging from general availability of correct inhaler advice to much more specific provision of specialist staff education sessions, improving their ability to train patients in inhaler techniques. This is a useful way of separating the core of the intervention from the periphery – the more one is discussing generalities, the closer one is to the core of the intervention, whereas detailed particular measures are more sensitive to local context. So, for example, it may be in one hospital, there is already an embedded staff training programme on inhaler technique, so it is unnecessary to implement this peripheral intervention in that situation.

Implementation is clearly complex. Real time feedback, I believe, can help improvement programmes develop and to be implemented successfully.  It also can make for a better evaluation as well, but that requires very particular approaches to ensure rigour.

Dr Tom Woodcock is Head of Information at NIHR CLAHRC Northwest London and Health Foundation Improvement Science Fellow. This piece is based on a presentation that Dr Woodcock gave at the meeting ‘Evaluation – making it timely, useful, independent and rigorous’ on 4 July 2014, organised by PIRU at the London School of Hygiene and Tropical Medicine, in association with the NIHR School for Public Health Research and the Public Health Research Consortium (PHRC).


* Dixon-Woods, M. et al (2013) “Explaining Matching Michigan: an ethnographic study of a patient safety program”, Implementation Science 8:70.

** Damschroder, L.J. (2009) “Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science”, Implementation Science 4:50.

Let’s be honest that pilots are not just about testing: they’re also about engineering the politics of change

By Stefanie Ettelt

There is more to policy piloting than evaluation – piloting is a policy tool in itself, not only a means for conducting research, says Stefanie Ettelt


Pilot evaluation tends to frustrate and disappoint some or all of its stakeholders, be they policy-makers, local implementers or evaluators, according to a study I have been working on for PIRU. Policy makers typically want robust, defensible proofs of success, ideally peppered with useful tips to avoid roll-out embarrassments. But they are distinctly uncomfortable with potentially negative or politically damaging conclusions that can also spring from rigorous evaluation.

Meanwhile, implementers of pilots at a local level don’t welcome the ambivalence that evaluation suggests, particularly when randomised controlled trials (RCTs) are used, given the associated assumption of uncertain outcome (equipoise). Implementers understandably worry that all their hard work putting change into action might turn out to have been a waste of time, producing insufficient improvement and leading to a programme being scrapped.

The evaluators may prefer a more nuanced approach than either of the above want, in order to capture the complex results and uncertainties of change. But this approach might find little favour with those commissioning the work.  Evaluators are often dissatisfied with the narrow time frames and limited sets of questions that are allowed for their investigations. They may feel tasked with gathering what they consider to be over-simplistic measures of success as well as being disappointed to discover that a roll-out has begun regardless of –  or even in advance of – their findings.

Keeping all of these stakeholders happy is a big ask. It’s probably impossible, not least because satisfying any one of these stakeholders may mutually exclude contentment among the others.  Why do we find ourselves in such a difficult situation?

Why is it so hard to satisfy everyone about pilots?

Perhaps this tricky issue is linked to the particular way in which British policy-making is institutionalised. These days, policy-making in the UK seems to be less ideologically-driven – or least supported by ideology – than it was in the past. With this loss of some ideological defences has also gone some of the perceived – albeit sometimes flawed – certainties that may once have protected policies from criticism. As a result, there are sometimes overblown expectations of research evidence in the UK and sometimes illusory beliefs that evidence can create new certainties.

The institutional design of the Westminster system perhaps invites excessive expectations that policy can be highly informed by evidence, because political centralisation means that there seem to be fewer actors who can veto decisions, than in some countries, for example, in Germany.  There are more regional players in Germany’s federal system who can veto, obstruct or influence a decision. Relatively minor coalition partners in Berlin also have a long standing tradition of providing strong checks and balances on the larger governing party. So, in Germany, there is more need for consensus and agreement at the initial policy-making stage. This participative process tends to reduce expectations of what a policy can deliver and also, perhaps, the importance of evidence in legitimising that policy.

Britain compared with Germany

In contrast, the comparatively centralised Westminster system seems more prone to making exaggerated claims for policy development and more in need of other sources of legitimacy. Piloting may, thus, at times become a proxy for consensus policy-making and a means of securing credibility for decisions. It might help to reduce expectations, and thus avoid frustration, if policy makers were clearer about their rationale for piloting. So, for example, they might explain whether a pilot is designed to promote policy or to question if the policy is actually a good way forward. If the core purpose is to promote policy, then some forms of evaluation such as RCTs may be inappropriate.

Evaluators understandably find it difficult to accept that the purpose of piloting and evaluation might first and foremost be for policy-makers to demonstrate good policy practice and to confirm prior judgements (i.e. ‘symbolic’). But there should be recognition that piloting sometimes does have such a political nature which is genuinely distinct from it having a purely evaluative role.

Of course, such a distinction is not made any easier by policy makers who tend to use rhetoric such as ‘what works’ and ‘policy trials/experiments’ when they already know that the purpose of the exercise is simply to affirm what they are doing. If policy makers – including politicians and civil servants – use such language, they really are inviting, and should be prepared to accept, robust evaluation and acknowledge that sometimes the findings will be negative and uncomfortable for them.

Improving piloting and evaluation

There are ways in which we can improve evaluation methods to make them more acceptable to all concerned. More attention could be given to identifying the purpose of piloting to avoid disappointment and manage the expectations of evaluators, policy-makers and local implementers. If the intention is to promote local and national policy learning more participation from local implementers in the objectives and design of evaluations of pilots would be desirable, so that these stakeholders might feel less worried by the process. Evaluators might also be more satisfied with more extensive use of ‘realist evaluation’. This approach particularly explores how context influences the outcomes of an intervention or policy, which is useful information for roll-out.

I would like to see local stakeholders to be more directly involved in policy-making and their role more institutionalised. So their involvement would be ongoing and not abandoned if it was considered unhelpful by a different incoming government. These are roles that need time to grow, to become embedded and for skills to develop.  Such a change would enhance the localism agenda.  It would also acknowledge that local implementers are already key contributors to national policy learning through all the local trial and error that they employ.

Dr Stefanie Ettelt is a Lecturer in Health Policy at the London School of Hygiene and Tropical Medicine. She contributes to PIRU through her work on piloting and through participating in the evaluation of the Direct Payments in Residential Care Trailblazers. She also currently explores the role of evidence in health policy, comparing England and Germany, as part of the “Getting evidence into policy” project at the LSHTM.



Follow Africa’s lead in meticulous evaluation of P4P schemes for healthcare

By Mylene Lagarde

Working with researchers to evaluate the introduction of financial incentives in developed healthcare economies would yield vital knowledge, explains Mylene Lagarde

The jury is very much out on pay-for-performance (P4P) schemes in healthcare – at least as far as the research community is concerned. Lots of unanswered questions remain over their effectiveness and hidden costs, as well as potential unintended consequences and their merit relative to other potential approaches. Yet many policy makers seem to have made their minds up already. These schemes, which link financial rewards to healthcare performance, make sense intuitively. They are being introduced widely.

This disconnection between the research and policy-making worlds means that we are almost certainly not getting the best out of P4P initiatives. Perhaps more worrying, there is a danger that the tree will hide the forest – that the attractive, sometimes faddish, simplicity of pay-for-performance may obscure other, perhaps more complicated but possibly more cost-effective ways to improve healthcare. As systems struggle to configure themselves to address modern demographics and disease profiles, harnessing latest technologies, we need to know what works best to reshape behaviours.

There are three key issues that weaken case for P4P in healthcare, as we set out in the PIRU report “Challenges of payment-for-performance in health care and other public services – design, implementation and evaluation”. These concern a lack of evidence about their costs and effectiveness and for identifying which particular P4P designs may work better than others.

First, costs. P4P schemes are complex to design. They usually involve lots of preliminary meetings between the many participants. Yet studies have largely ignored these transaction costs and frequently also fail to track and record carefully the considerable costs of monitoring performance.

Second, the effectiveness of P4P is often impossible to assess with enough certainty. Typically, introduction of a new scheme does not include a control group. For example, if a scheme incentivises reduced hospital length of stay or emergency admissions for one hospital, it may be difficult to find a comparable hospital to serve as a counterfactual. That makes it harder to attribute a particular change to P4P – maybe it would have happened anyway.

Furthermore, only small groups of outcomes are usually monitored by P4P schemes, so evaluators may be left with a narrow, and thus weak, selection of effects. For example, reductions in hospital lengths of stay may be identified, but these may coincide with poorer outcomes elsewhere in the system, such as increased admissions to nursing homes. These unintended effects, perhaps reflecting a shift rather than a reduction in costs and problems, are often not collected by the programme. That makes whole system analysis difficult.

Third, P4P is not a unique and uni-dimensional intervention. It is a family of interventions. They are all based on the premise that financial incentives can support change, but there are many variables: the size of the reward; how frequently it is offered; whether it is focussed on relative or absolute targets; whether it is linked to competition between providers or it is universally awarded. Very often, one type of intervention is used but another might equally well be employed. Each variation can produce different results, yet we still know little about the relative performance of alternative designs for these incentive schemes.

Researchers are not completely in the dark about P4P in healthcare. We are beginning to understand factors that characterise successful schemes. These typically involve a long lead-in time to plan, test and reflect carefully on the different elements of a programme. However, we must strengthen evaluation.

The first step would be to involve researchers at an early stage of the programme design. That’s the moment to spot where in the system you might need data to be collected. It’s also the time to identify control groups so that the causal impacts of these programmes can eventually be attributed more confidently.

Good evaluation requires political willingness to evaluate, which is sometimes lacking. When an initiative has a political breeze behind it, policy makers worry that researchers will let the wind out of the sails. But some Low and Middle Income Countries are taking the risk. There have been large numbers of randomised controlled trials over the last few years in African countries, looking at the effects of P4P schemes. Most are ongoing, but, so far, the evidence is promising. Rwanda was one of the first African countries to evaluate these financial incentives, mainly for increasing uptake of primary healthcare. Its programme is now being scaled up.

Why is Africa leading the way in setting high standards for P4P evaluation? Because the funders of these schemes, typically external donors (e.g. the World Bank, DfID, USAID), are well placed to demand meticulous evaluation by the receiving governmental authorities as a condition for the cash. Researchers, particularly in developed countries, rarely enjoy such firm leverage over national policy makers. And national policy-makers in these countries do not apply to themselves the degree of scrutiny they exercise with international aid recipients. Yet, if we are to get the best out of P4P – and not attach potentially false hopes to this healthcare innovation – we need more of the disciplined approach that is currently being used in Africa.

Dr Mylene Lagarde is a Senior Lecturer in Health Economics at the London School of Hygiene and Tropical Medicine. “Challenges of payment-for-performance in health care and other public services – design, implementation and evaluation” by Mylene Lagarde, Michael Wright, Julie Nossiter and Nicholas Mays is published by PIRU.

Don’t ditch evaluations just because pilots are hitch-free

By Nicholas Mays

PIRU’s experience with the ‘Choice of GP Practice Pilot’ suggests the need for continuing independent evaluation of policy roll-outs, 

The greatest benefits – and potential disbenefits – of any piloted policy change are usually felt in the longer term and after roll-out. Yet evaluations are often quite short term, sometimes ending before really important issues emerge and possibly even cast a shadow over the enterprise. So we should think carefully before we ditch evaluations once initial pilots show few or no major hitches.

PIRU has evaluated a pilot that let patients register with a GP practice even if they lived outside the practice’s catchment area.  Some 43 practices in three urban areas, half of them in Westminster, were involved in the 12 month pilot and just over a thousand patients registered ‘out of area’.  About a third taking advantage were commuters, often young, working and in good health. About a quarter were moving house and keen to retain their GP practice, while another quarter had picked a local practice only to find that, though they were technically outside the practice catchment area, they were able to register. Finally, about one in seven used the option to register out-of-area for different reasons, such as wanting a practice that offered specialisation in a particular condition.

In short, the pilot revealed a small number of generally positive patients. There were a few practical problems but they did not seem insurmountable. Armed with these findings, the Government recently announced the scheme’s roll-out across the country on a voluntary basis. Our evaluation finished when the pilot ended.

Yet that is really just the beginning, rather than the end of the story. Roll-out will affect not just a thousand but possibly hundreds of thousands of patients, as well as hundreds of practices  – not just in the pilot areas of Westminster, Salford, Manchester and Nottingham City. The pilot was for 12 months, but some of the practices did not register any ‘out of area’ patients until six months and a quarter of the practices didn’t register any at all. The roll-out will carry on until further notice. It is likely to gather momentum as the option of ‘out of area’ registration becomes increasingly widely known. But we don’t really know the full consequences. Why? Because the roll-out is essentially an experiment. Yet the evaluation has ceased.

What should any further evaluation look at?  It would be good to be able to look at the set up and running of this scheme on a national basis and to assess the overall impacts in terms of costs, usage and health outcomes. There are some important other questions to answer.

First, will there be problems managing GP capacity in areas with large inward and outward flows of patients? For example, a GP from a rural area expressed concern to me about the potential flight of mainly young, relatively healthy commuters, who might prefer to register close to their work (as our evaluation suggested). These comparatively infrequent, fitter users of health services partly cross-subsidise older, more frequent users. The GP feared that their loss might challenge practice viability in rural areas.

At the other end, some GPs in London have expressed concerns about striking the right balance of care between residents and incomers. Some GPs feel their practices are already over-stretched by a high-need, elderly population with multiple long-term conditions. They worry about resources being diverted by an influx of younger commuters attending with mainly self-limiting conditions. Practices might end up not having the capacity to register local residents who would then have to travel further and register out-of-area themselves.  GPs also worry about the consequences of patients staying on their lists when they move house even short distances beyond the practice catchment, particularly if they are elderly and require home visits.  In a congested city, this could make a big difference to the number of patients that the doctor can see in a day.

More broadly, we have yet to see whether loosening the rules of registration may lead to lists becoming socio-economically segregated and how that shift might be managed in terms of the allocation of finance to different practices.

Second, there are also the unexplored issues of the challenges and costs to CCGs of funding diagnostics and hospital care for those registered with GPs far from their homes. It will be important that the numbers of out-of–area patients registered with practices within CCGs are kept up to date, so undercounting does not lead to underfunding of the CCG.

The system will also need to be sensitive to the possibly rapidly changing needs of patients registered out of area. For example, a pregnant woman might wish to receive her ante-natal care close to work in London, but access peri-natal, delivery, post-natal and paediatric care closer to home. Similar issues may arise with patients requiring continuing care. Will GP practices be flexible about de-registering and re-registering patients in such circumstances?  And how well will emergency primary care be provided to patients near where they live when they are registered with practices elsewhere?

We can expect that at least some of these issues will cause problems. The fact that the ‘pilot’ phase of this scheme has not been long enough to explore them raises questions about the purpose of pilots. Researchers tend to think of a pilot as an experiment before a programme’s adoption. Others see pilots simply as feasibility studies. More often than not, the fact that a pilot has been set up shows that it already has a lot of government support and roll-out is essentially a done deal, with the pilot designed to spot any big wrinkles and to deal with critics.

Whatever the truth about pilots – and it probably varies across government – we do need to appreciate that roll-outs often remain, as in this case, experiments just as much as the initial pilots. No doubt NHS England will monitor developments. However, knowledge and policy would benefit from further in-depth, independent evaluation of how things are working out.

Nicholas Mays is Professor of Health Policy at the London School of Hygiene and Tropical Medicine and Director of PIRU. He is lead author of “Evaluation of the choice of GP practice pilot, 2012-13: Final Report”, published by PIRU in March 2014.

Payment by results – a route to social policy innovation in nervous, cash-strapped times


to read Toby’s blog click here >>

Be pragmatic: ask not ‘whether’ but ‘how’ to adopt public sector performance-related pay

By Simon Burgess

to read Simon’s blog click here >>