How to commission evaluations of national policy pilots

BY STEFANIE ETTELT AND NICHOLAS MAYS

Evaluations of national policy pilots are often embarked on with high expectations and end with a sense of frustration on all sides.  Policy-makers often expect clearer, and more positive, verdicts from evaluation than researchers are able to provide; researchers hope for their findings to be more influential; and implementers in pilot sites struggle to put in place what they think they are being expected to deliver within the limited timescale of the pilot while wondering what they have gained from either the pilot programme or the national evaluation.

To ease some of these frustrations, we have developed guidance aimed primarily at national level staff involved in policy-making and in initiating policy-relevant pilots and their evaluations.  We think the guidance will also be helpful to evaluators. Our advice stems from both experience and analysis of the fate of policy pilots (Ettelt et al, 2015a; Ettelt et al, 2015b).  Two observations, in particular, from evaluating policy pilots in health and social care have shaped our thinking.

The first observation is that many times it is not clear what an evaluation is intended to contribute to policy development.  This lack of clarity is often a symptom of a deeper problem which has more to do with confusion and conflicts over the reasons for piloting than with the evaluation itself.  Indeed, the objectives of the evaluation can be perfectly clearly expressed, and yet it can entirely ‘miss the point’ if the purpose of piloting is not thought through.  As we have argued elsewhere, policy pilots can serve different purposes, many of which have more to do with the realities of policy-making, and the dynamics of policy formulation and implementation, than with piloting for the purpose of testing effectiveness (Ettelt et al, 2015a).  Different groups involved in a policy pilot can have different ideas about the purpose of piloting.  Also, these purposes often change over time, for example, as a consequence of a ministerial decision to roll out the policy irrespective of whether the evaluation has been completed or not.  For example, the Direct Payments in Residential Care pilots, which PIRU is evaluating, were rebranded early in the life of the programme to become ‘trailblazers’ as it was decided, ahead of the results of the pilots, that direct payments would be rolled out nationally in 2016 alongside other aspects of the 2014 Care Act.  However, the policy context of the ‘trailblazers’ continues to change.  As a result, the Department of Health is currently reconsidering whether direct payments should move forward at the same speed as expected earlier.

We think it is important that the goals of such programmes are stated explicitly and that their implications are thought through carefully at the beginning of a pilot programme while it is still possible to make adjustments more easily than later in the process.  This is also the time to identify the target audience for the evaluation.  Whose knowledge is the evaluation aiming to contribute to?  There are likely to be important differences in the information needs and preferences of national policy-makers and local implementers that require some forethought if they are to be addressed adequately.

The second observation is that, under the influence of the mantra of ‘evidence-based policy’, policy-makers increasingly feel that they should prioritise specific research designs for the evaluations of policy pilots, especially experimental designs.  Yet, this consideration often comes too early in the discussion about pilot evaluations and is introduced for reasons that have more to do with the reputation of the design as producing particularly ‘valid’ evidence of policy effectiveness than with its appropriateness to generate insights given the objectives of the specific pilot programme.  The choice of research design does not make a programme more or less effective.  Conducting an RCT is pointless if the purpose of a pilot is to find out whether or not, and, if so, how, a policy can be implemented.  In such a situation, the ‘active ingredients’ of the intervention have not yet been determined and thus cannot be easily experimented with.  The Partnerships for Older People Projects (POPPs) pilots, conducted in the mid-2000s, are an example of a pilot programme that brought together a large number of local projects (of which about 150 were considered ‘core’), indicating an intention to foster diverse local innovations in care, with an evaluation commissioned and designed accordingly.  However, this did not stop national policy-makers subsequently changing direction and demanding a robust outcome analysis from a pilot programme and related evaluation which were both established to meet a different set of objectives.

A similar tension between piloting to encourage local actors to develop their own solutions to problems of service delivery and the desire for definitive (cost-) effectiveness evaluation of ‘what works’ can be seen in other pilot programmes.  For example, the Integrated Care and Support Pioneers were selected as leaders in their potential ability to develop and implement their own solutions to overcoming the barriers to integrating health and social care.  Yet, the evaluation requirement includes a focus on assessing the cost-effectiveness of integrated care and support.  This is extremely challenging in the face of such a diverse programme.

Beyond our two initial observations, the question of ‘evaluability’, which is relevant to all policy evaluation, is particularly pertinent in relation to RCTs and similar experimental designs.  RCTs require a substantial degree of researcher control over both the implementation of the pilots (e.g. a degree of consistency to ensure comparability) and the implementation of the evaluation (e.g. compliance with a randomised research protocol).  This level of control is not a given, and the influence of researchers on pilot sites is much more likely to be based on negotiation and goodwill than compliance.  This does not mean that conducting RCTs is impossible, but that pilot evaluations of this type require a significant and sustained commitment from pilot sites and policy-makers for the duration of the pilot programme to stick with the research protocol, and manage the added risk and complexity associated with the trial.

To help policy-makers to make these decisions and plan (national) pilot programmes and their evaluations better, we have developed a guidance document.  ‘Advice on commissioning external academic evaluations of policy pilots in health and social care’ is available as a discussion paper here   We are keen to receive comments, addressed to .

This is an expanded version of an article written for the December 2015 edition of ‘Research Matters’, the quarterly magazine for members of the Social Research Association.

References

Ettelt, S., Mays, N. and P. Allen (2015a) ‘The multiple purposes of policy piloting and their consequences: Three examples from national health and social care policy in England’. Journal of Social Policy 44 (2): 319-337.

Ettelt, S., Mays, N. and P. Allen (2015b) ‘Policy experiments: investigating effectiveness or confirming direction?’ Evaluation 21 (3): 292-307.

‘We need critical friends and robust challenge, not aloofness and separation’

by anna dixon

A strong relationship between policy-makers and academic evaluators is vital, particularly to support high quality implementation of change, says Anna Dixon, the Department of Health’s Director of Strategy and Chief Analyst.

There continues to be a view that policy making is a very neat process. An issue supposedly arises and there’s an option appraisal about how we might address it. Then, following some consultation, an implementation process is designed. After that, as good policy makers, we always evaluate what we did, how it worked and those insights feed back very nicely to inform future policy making.

Alas, it’s all a bit more complicated than that. However, my message is that in health policy – as well as in other areas of government – we are serious about commissioning evaluation, and ambitious about using the results. Evaluation matters to us. The conditions for it, albeit imperfect, are improving. The impacts and benefits of evaluation can either be formative, to provide learning and feedback as a policy is rolled out or focused on impact to learn retrospectively. In practice many cover both implementation and impact.

Strong support for evaluation

Enthusiasts will be relieved that aspiration for evidence-based policy is very much alive in government.  Sir Jeremy Heywood, the Cabinet Secretary, has said that an excellent civil service should be skilled in high quality evidence-based decision-making. The Treasury is a crucial driver, requiring the Department of Health to do process, impact and cost-benefit evaluations of policy interventions, particularly where they involve significant public expenditure and regulation.

However, delivering on good intentions can be difficult. The National Audit Office (NAO) recently defined best practice as ‘evaluations that can provide evidence on attribution and causality and whether the policy delivered the intended outcomes and impact and to what extent these were due to the policy’. Doesn’t that sound very simple and easy? If only it were so.

In reality, it is incredibly difficult in the messy world of policy implementation to tease out the isolated impacts of one policy compared with all the layering effects of many policies changing as they are implemented. It is far from easy to identify any neat causality between particular policy interventions and outcomes.

The NAO found that much more could be done to use previous evaluations in developing impact assessments of new policies. A survey of central government departments found that plans for evaluation are sometimes not carried out.

Large evaluations commissioned

The Department of Health commissioned a large scale programme of evaluation of the Labour government’s NHS reforms which was coordinated by Nicholas Mays (now director of PIRU). We’re now also commissioning an evaluation of the Coalition’s reforms of the English NHS and also thinking about evaluating impacts from policy responses to the Francis Inquiry. These are substantial evaluation programmes tackling many interventions, occurring simultaneously against a background where much else is changing. It will not be easy to tease out the ‘Francis effect’ in the current economic context with many other policy initiatives taking place at the same time. As well as funding the NIHR and the Policy Research Units like PIRU, the Government recently developed the ‘What Works Centres’. These aim to help government departments and local implementers – schools and probation services and others – to access higher quality evidence of what works.

 Policy-making misunderstood

Will all this activity make a difference? I feel confident that it can lead to more successful implementation of particular interventions and can contribute to better policy-making. But it is only one input into the process. Policy is often driven by evidence of a different kind. That may be the personal experience of the Minister, deliberative exercises, practical wisdom and so on. Insights about what can work on the ground – ‘implementability’ – are also rightly important. And there is the more political dimension – what is acceptable to the public? All these elements go into the mix along with more formal research evidence.

Benefits of implementation evaluation

The influence of evaluation on implementation is more compelling than its influence on policy and demonstrating real value. We have seen this recently with the Care Quality Commission’s new hospital inspection programme. Researchers, led by Kieran Walshe, went out with hospital inspectors on the first wave which immediately fed into the design of the second wave. That’s also been evaluated and is now feeding into the approach that will be rolled out for future hospital inspections and in other sectors of health and care. These pragmatic, real time evaluations can be very useful. They are critical now for the Department of Health because its separation from NHS England means that many people who had experience of more operational roles are no longer working directly within the policy making environment.

Implementation evaluation is beginning to be reflected in the language used by government. The Treasury continues to emphasise summative evaluation, focussing on outcomes and cost benefit ratios but the policy implementation ‘gap’ is now recognised as being particularly important.  We are in a phase where ‘policy pilots’ seem to be out and we have tried ‘demonstrators’. Now we have ‘pioneers’. The language is becoming clearer that the main goal is to understand how to make something work better.

Evaluation can be more effective

What can government and academia do to increase the influence and usefulness of evaluation? We share a challenge to create engagement at the earliest possible stage – ideally the policy design stage. This means building relationships so that academics understand the policy questions and policy makers can share their intentions. So, evaluators should make sure that they talk to the relevant officials and find out who’s working on what. Success can yield opportunities to help design policy or implementation in ways that will support better evaluation.

Academics should be willing to share interim findings and work in progress, even if it is not complete. Otherwise there is a risk that they will miss the boat. On the Government side, we need to be more honest and open about the high priority evaluation gaps at our end.

In terms of rigour, Government is trying to provide better access to data. For example, organisations implementing interventions in criminal justice are able to use large linked data sets, established by the Ministry of Justice, so it is much easier to see impacts of policy changes on reoffending rates. We must make sure that our routine data collections measure the most important outcomes and that these measures are robust. Clearly, one of the challenges for evaluators is to understand the messiness of context.

Independence

The one word I have avoided is ‘independence’ of researchers. If independence means aloofness and separation, I don’t think the relationship works well.  We need to know each other: academics need to know the policy world; the policy world needs to understand academia.  In government, we need critical friends and robust challenge. The fruitful way forward for both sides is to have ongoing discussion, engagement, creating good relationships that mean, even in this messy world, that we can make greater use of evaluation to inform decision-making.

Dr Anna Dixon is Director of Strategy and Chief Analyst at the Department of Health.  This blog is based on a presentation she gave at the meeting, ‘Evaluation – making it timely, useful, independent and rigorous’ on 4 July 2014 organised by PIRU at the London School of Hygiene and Tropical Medicine, in association with the NIHR School for Public Health Research and the Public Health Research Consortium (PHRC).

Policy process for implementing individual budgets highlights some of the tensions in public policy evaluation

by gerald wistow

A high profile initiative to transform social care delivery demonstrates how the demand for rigorous evaluation can be difficult to fulfil alongside enthusiastic policy advocacy, explains former government advisor, Gerald Wistow

Over 40 years ago, the eminent social psychologist, Donald T Campbell, complained that excessive commitment to policies had prevented proper evaluation of Lyndon Johnson’s ‘Great Society’ reforms. Campbell urged social scientists to engage with policy makers to ensure that they appreciated the value of evaluation and did not allow its political risks to preclude its thorough application. His comments are just as relevant today.

I am grateful to Stefanie Ettelt for drawing my attention to a quote from Campbell’s 1969 paper, ‘Reforms as experiments’. In it, he declares: ‘If the political and administrative system has committed itself in advance to the correctness or efficacy of its reforms, it cannot tolerate learning of failure. To be truly scientific we must be able to experiment. We must be able to advocate without that excess of commitment that blinds us to reality testing.’ 

These sentiments spring to mind when reflecting on the piloting of individual budgets for adult social care that took place from 2005. This process highlights the risk that powerful advocacy within government can still lead to what, from the perspective of evaluation, might be considered excessive commitment and so obscure the ‘reality testing’ that evaluation is supposed to provide.

I was a scientific advisor from 2005 to the individual budgets policy team at the Department of Health, providing advice and support through all stages of the evaluation.  At the time, policy processes were being modernised and made more professional. The New Labour mantra, ‘what matters is what works’, meant policy makers were supposed to favour analysis over ideology not least through experimentation and evaluation in advance of universal national roll out. The Modernising Government White Paper (1999) emphasised that evaluation should have a clearly defined purpose with criteria for success established from the outset, that evaluation methods should be built into the policy-making process from the beginning and that learning from pilots should be interpreted and applied.

A key starting point for the formal introduction of individual budgets was the implementation of the ‘Valuing People’ White Paper (2001) which established the central importance of people with learning disabilities being treated as full citizens rather than being excluded from living normally in society. Its four key principles were rights, choice, independence and inclusion.

The Department of Health established a ‘Valuing People Support Team’ to help local authorities and the NHS to implement these principles. In 2003, the Team formed a partnership with Mencap, known as ‘In Control’, to implement a process of ‘self-directed support’ which was piloted with limited evaluation in six local authorities.  The pilots were designed to enable people with learning disabilities to assess their own needs, write their own care plans and organise their own support. The background to this initiative was the need for people with learning disabilities to have greater opportunities to secure more flexible and individualised services because of the low take-up of direct payments (one per cent of all community care packages in around 2003). At the time, some 75 per cent of all money on learning disabilities was still being spent on three traditional, institutional services – residential and nursing home care and day care.

In Control quickly became an organised movement which penetrated national and local government (almost every local authority in the country soon signed up to its programme). By 2005, it had also allied with the physical disability movement which had been working with the Cabinet Office to develop a national strategy that included proposals for a programme of individual budgets.  The concept envisaged that individuals would be able to combine into a single budget all the different funding streams to which an individual might be entitled – such as social security, housing, access to employment and social care.  Individuals would be able to use such a budget on the basis of their assessed needs to purchase the services that they thought most suited those needs. This fitted in with the principles of improving social care services, scoring high on choice, control and independent living.

So by 2005 we had had proposals for individual budgets that were coming from the heart of government, from the Prime Minister’s Strategy Unit, the Department of Health and the Department of Work and Pensions. It was in the 2005 Labour party manifesto and, during the General Election itself, Downing Street wrote a scoping paper on implementation. All of these champions envisaged a process of piloting and evaluation would be necessary and appropriate. In January 2005, the Cabinet Office had described individual budgets as a radical initiative, which would take time to get right, but which would be progressively implemented and, subject to evaluation and resource availability, would be rolled out nationally by 2012. However, by March, the DWP was saying it would be rolled out nationally by 2010.

There remained in these narratives the possibility of failure – everything was subject to evidence that it worked. Evaluation was part of the Government’s risk management – the risk of introducing a radical change that some people strongly supported but whose workings remained unclear. It also appealed to sceptics by saying, ‘Let’s do it progressively, let’s evaluate, let’s make sure that it works’.

The Treasury also had considerable interest in what the programme would cost to introduce, its outcomes and cost effectiveness compared with conventional approaches to service delivery. This last requirement drove the evaluation design so that its core element was a randomised controlled trial. There was also a process evaluation of factors that facilitated and inhibited implementation but the central focus at the outset was to evaluate how the costs and outcomes of individual budget pilots would compare with standard service delivery arrangements.

Although RCTs were widely regarded in DH as the gold standard for evaluation methodologies, especially for clinical interventions, other government departments were less comfortable with the idea that trials were appropriate in the context of individual budgets. The DH implementation support team, and some local staff, shared these concerns and particularly questioned the ethics of denying some participants in the trial access to individual budgets in order to provide comparisons with those who received such budgets.

Meanwhile, the evaluators soon realised, as is often the case, that the intervention to be evaluated was poorly specified. With the policy team, they had to ask: What is an individual budget? How is it allocated? What’s the operating system? How is need to be assessed? How would an assessment of need be converted into a financial sum that someone had available to spend on their care and support? Fortunately, from one point of view, ‘In Control’ had developed a model in their earlier six pilots that not only filled the vacuum but effectively became the intervention to be piloted and evaluated.

Then, in 2006, a new Minister moved the goal posts and announced that, in his view, the inherent value of individual budgets was not in doubt and that he had decided that the initiative should be rolled out nationally from 2010. The evaluation still had an important role, but it would now advise on how best to implement that decision rather than provide evidence to inform whether such a decision should in fact be made. So the RCT continued, but it was undermined. Sites felt more reluctant to identify participants in the study who would not receive a service that had now been ministerially endorsed. Recruitment to the study was slow and, with systems change lagging behind the evaluation timetable, some participants had not received services for the full follow up period before the pilots ended.

The evaluation reported on time and found that people in receipt of budgets, and their carers, reported greater independence and control over how care was provided. Individual budgets were slightly more cost-effective for some (but not all) groups of people. In addition, the implementation of individual budgets had important implications for staff roles, training and the management of funding streams.

In practice, the evaluation was conducted at the intersection with politics, policy-making and implementation. Ministers wanted to prove they could deliver change in what were their frequently short periods in a particular post. They were also influenced greatly by their own informal networks, including in the case of the second minister, his own previous experience of social care services and knowledge of the ‘In Control’ model.

The Department of Health implementation support team who were helping the local sites to implement individual budgets, were also closely associated with ‘In Control’ and its operating model for individual budgets.

The experience of implementing the individual budget pilots demonstrated how the value base of health and social care competed with arguments about technical rationality underlying the modernising government and public sector reform agendas. The former values emphasised the rights of older people and people with disabilities to have greater control over their lives while the latter argument required evidence to demonstrate the benefits of such control, or at least the costs and effectiveness of an intervention which more anecdotal evidence already appeared to support in advance of results being available from the DH commissioned independent evaluation.

As Russell and colleagues (2008) argue – and the individual budgets example supports – policy-making in practice is more a ‘formal struggle over ideas and values’ than a systematically structured search to find and apply the best evidence of what works. As the same authors also underline, there is no single ‘right answer’ to be identified in the messy world of policy-making but only ‘more-or-less good reasons to arrive at more-or-less plausible conclusions’ (Russell et al 2008).

It is sometimes argued that policy makers need better understanding of evaluation but it is perhaps no less true that evaluators need better understanding of policy-making and political processes. There are, for example, some givens in public policy which inevitably and necessarily impact on the conduct and interpretation of evaluation. These givens include the impact of electoral and financial cycles as well as electoral and bureaucratic politics. There are also multiple actors and stakeholders, some of whose actions and influence within policy processes are less apparent than others. For example, for policy researchers there are fascinating questions about how the radical concept of individual budgets was developed and rolled out universally within less than a decade. How a small and newly established organisation such as ‘In Control’ was able to achieve the transformation of national social care policy and service delivery guidelines so rapidly and subsequently begin to extend its model into the NHS is, in itself, an evaluation topic of great interest and relevance to policy researchers.

As for social policy evaluators, these reflections underline the advice of Donald Campbell cited above from another era of social policy transformation. Moreover, in an inherently political clash between values and evidence, the roles of evaluators can perhaps usefully be summarised as being to provide challenge which is both rigorous and sustained; to serve as professional sceptics where others are the professional advocates of change; and, finally, to suspend belief in the absence of independent analysis.

Gerald Wistow is Visiting Professor in Social Policy at the London School of Economics. This piece is based on a presentation that Professor Wistow gave at the meeting ‘Evaluation – making it timely, useful, independent and rigorous’ on 4 July 2014, organised by PIRU at the London School of Hygiene and Tropical Medicine, in association with the NIHR School for Public Health Research and the Public Health Research Consortium (PHRC).

Modelling lets evaluators test-drive change safely and cheaply, using a diversity of non-RCT evidence

by sally Brailsford

Enhanced decision-making, blue-skies thinking and quick trials of hypotheses are all much easier if modelling is in your evaluation tool kit, explains Sally Brailsford

Everyone thinks that they know what a model is. But we all have different conceptions. I like the definition from my colleague Mike Pidd, from Exeter University. He sees a model as ‘an external and explicit representation of a part of reality’.  People use it ‘to understand, to change, to manage, and to control that part of reality’.

We tend to acknowledge the limitations that models have, but fail to fully appreciate their potential.  ‘All models are wrong,’ as George Box said, ‘but some are useful’.

I work in Operational Research. It’s a tool kit discipline. In one part, we make use of statistics, mathematics and highly complex algorithmic models. In another, we draw pictures and play games. I use these elements to create simulation – I build a model in a computer which replicates a real system and then we can play ‘what if’ with it.

Models inform decision-making

I use models mainly for informing decision-making. Sometimes, they don’t actually need much data to be very useful. For example, there is a famous model about optimal hospital bed occupancy, created by Adrian Bagust and colleagues at Liverpool University’s Centre for Health Economics.  It includes some numbers but they are not based on any specific hospitals. It shows that if a hospital tried to keep all its beds fully occupied, then some patients would inevitably have to be turned away.

The model varies patient arrivals as occupancy increases and demonstrates how often the hospital has to turn away emergency patients. It shows that hospitals deemed inefficient, because they occasionally have empty beds, are actually operating effectively. The finding really influenced policy. It showed that, as a hospital reaches about 85 per cent occupancy, it is increasingly likely to have to turn emergency patients away. It is a simple model. It did not involve long-running, expensive randomised controlled trials. Yet it provided vital evidence and was powerful in influencing occupancy targets.

30 year clinical trial in five minutes

In another model, we looked at patients with diabetes at risk of developing retinopathy. Everyone agreed that it was a good idea to screen patients with diabetes to prevent retinopathy before it leads to blindness. However, there was a whole range of screening practices. We used data from all over the place, from the US and from the UK. The model followed patients with diabetes through the life course and through different progression stages.

We had to draw data from very early studies because it would be unethical to conduct a clinical trial that did not treat people according to best practice. We then adapted the model for different populations, with varying ethnic mixes and probabilities of diabetic incidents. We superimposed on the model a range of different screening policies to see which was most cost-effective. In effect, once we felt confident that the model was valid, we could run a clinical trial on a computer in five minutes rather than running a real clinical trial for 30 years. As a result, we discovered really valuable findings.

The beneficial difference between all the various techniques and screening programmes proved to be minor compared with the large impact of more people being screened. We realised that raising attendance, perhaps by social marketing, offered much better value than buying expensive equipment.

Guiding design of hypothetical systems

The next model is even more hypothetical. Three engineers had an exciting, blue skies idea for patients with bipolar disorder. What if, they asked, different sensors tracked a person’s behavioural patterns and, having established an individual’s ‘activity signature’, could spot small signs of a developing episode that would trigger a message that the person might need help?

We expected, rightly, that success depended on what monitoring individuals could tolerate – perhaps a bedside, touch sensor mat, or a light sensor in their sitting room, sound sensors or GPS. We built these different possibilities into the model. We could also check how accurate the algorithms would have to be, if this technology was developed. So we were guiding design of a hypothetical system.

Many, particularly those from clinical backgrounds, find it hard to accept that modelling can provide evidence upon which to make a major decision. People often expect the same kind of statistical evidence as from randomised controlled trials. Modelling does not claim to provide that level of certainty. It is a decision-support tool, helping you understand what might happen if you do something.

Appreciate modelling advantages

We should recognise the advantages of models. They are quick and cheap – you can run a clinical trial that could last decades in a matter of minutes. If you lack confidence statistically in your model, there are solutions: expert opinion and judgement can help fill the gaps. A model allows people to talk about issues in a policy setting and to articulate their assumptions. Quite often the conversations along the road are more important than the eventual model and the model is just a means to that end.

Like in the bi-polar project, you can model innovations that don’t even exist. So I often use modelling for hospitals around redesigning a system or a service. The development does not exist yet, so there are no data – you must gather all the available evidence you can and build it into your model. It lets you explore more than when using traditional methods because your assumptions can be more flexible.

Collecting primary data is hugely expensive, sometimes impossible.  You can consider all sorts of options that it would be unethical to explore in reality. As the bed occupancy model shows, the findings can be powerful and influential.

There is a saying that, if all you have is a hammer, then every problem is a nail. As researchers, we should avoid being confined by preferred methods, whatever our discipline. Modelling can be a valuable research tool.

Sally Brailsford is Professor of Management Science at the University of Southampton. Her blog is based on her presentation on 4 July 2014 at PIRU’s Conference: ‘Evaluation – making it timely, useful, independent and rigorous’.

‘Different contexts should not be allowed to paralyse wider roll-out – some differences don’t really matter.’

by mark petticrew

 Interventions that succeed in some instances may or may not work in other circumstances. You have to consider whether the contextual differences really are ‘significant’, says Mark Petticrew

How important is the particular context of a policy intervention in deciding whether that intervention can work elsewhere? The answer must lie in the significance of the context. Every place is different. Every time is different. Everybody is different. The important question must be: which differences really matter, which are actually significant? We should avoid mistakenly thinking that the inherent uniqueness of everything means that a particular intervention will never work elsewhere. It might still be generalisable and transferable elsewhere.

Similarity and uniqueness

It is, of course, highly implausible that interventions work the same way across different contexts. Nevertheless, it is equally implausible that that evidence collected in one context has no value for another. These polar positions are unhelpful because neither is true. (‘We are all individuals,’ shouted the crowd to Brian in the Monty Python movie.  ‘I’m not,’ said a lone dissenter). Clearly, all individual study contexts are different, but there may be similarities.

Similarity and portability across apparently very different contexts were aptly illustrated to me when I was involved in housing research. The earliest controlled trial of a housing improvement intervention was done in Stockton on Tees in 1929. Families were moved out of the slums, which were then demolished, and moved into new housing. Unexpectedly, many people’s health deteriorated.

This type of intervention is common today.  Urban improvement accompanied by large-scale housing regeneration occurs frequently. However, the context is very different from 1929.  In those days, poverty was probably more widespread, as was slum housing. Yet, more recently the same unanticipated adverse effect has been found in one study, with a minority of people’s health deteriorating when their housing improves. Although the context looks very different, the underlying mechanisms seem to be the same, namely that, when the housing is improved, rents rise and so people scrimp on their diets and their health gets worse.

Another field where the same mechanisms apparently work across different contexts is smoke-free legislation which aims to restrict the impact of second-hand smoke in work and public places. This has been evaluated at least 11 times in very different contexts. When the issue reached the UK, critics, often in the hospitality industry, said this might have worked in these other countries but it wasn’t going to work in pubs in Glasgow, say, or in London. The same arguments were raised around the implementation of smoke-free legislation in Ireland, that these are very different contexts, that people’s drinking and smoking were wedded. Yet, in fact, the success of implementation has been broadly similar across many different states and countries.

Aspects of context that matter

In short, predicting the generalisability of an intervention is all about understanding the significance of context. So the first step must be to reflect on which aspects of contexts might really matter. A lot of checklists to help this task have been put together. Dr Helen Burchett from the London School of Hygiene and Tropical Medicine has reviewed dozens of these frameworks which are used to help users to judge whether evidence collected in one setting might be applicable in another context.  Her study found that there are 19 categories of context that might be important and a few more can probably be added.

Some of the work that we have been doing as part of the NIHR School for Public Health Research has been particularly enlightening around economic contexts. Local practitioners tell us that the current economic climate has been a big constraint not only on the use of evidence by, for example, local government, but also on evaluation itself, which is often seen as a luxury.

However, as I have tried to show, context always varies and simply pointing out the differences is not sufficient. You have to determine – or sometimes make assumptions – about which of these variations actually matter – which are likely to be clinically, or socially significant. How do you do this? This assessment should be informed by at least three considerations. First, there is knowledge of the existing evidence, which helps one discover whether and how the intervention has worked in other settings. Second, understanding the underlying theory and assumptions about how the intervention works and is moderated can be helpful. Finally, one can draw on the judgement of experts, practitioners and policy makers who might have insights into whether one context is significantly different from another.

There is a lot more scope for research in this field. For example, there may be classes of interventions that are less context-dependent than others. Smoke-free legislation with its 11 evaluations would be a case in point, and suggests that perhaps regulatory interventions are less affected by context than interventions that require more individual behavioural change.

Context and interventions intertwined

We may also need to revise our sometimes simplistic view of the relationship between context and intervention. There is a tendency to see context merely as a moderator, something that interferes with an intervention in some way. Yet there are many situations and policies where the intervention is the context. The intervention changes the nature of the system in some way so that the intervention and the context are, in effect, the same thing.  This makes defining the start and the end of an intervention and its boundaries – and thinking about how you evaluate it – hugely challenging.

The significance of context in generalisability also places question marks against the culture of systematic reviews. During such reviews, researchers aim to put all the evidence together from interventions and attempt to discern a single effect based on everything that is known about an issue. It is an attempt to separate the ‘things that work’ from the ‘things that don’t work’ and identify an overall effect size. This may be problematic because, during this process, the context that produces that effect usually gets stripped away. As a result, in the process of producing evidence, we lose the context.

As researchers we also have a tendency to see the world in terms of studies of ‘magic bullets’ which tell us that, if things work, then they work everywhere. However, at least in public health, we are increasingly putting together assemblages of evidence from different contexts that show what happened when those interventions were implemented in different places to guide future decision makers. This is very different from saying simply that something always ‘works’.  It might be more helpful to see the wider goal of collecting evidence as being to inform decisions, rather than to simply test hypotheses. This may be one way forward to make proper sense of context, rather than trying either to eradicate it or allowing its uniqueness to rule out the possibility that an intervention can be transferred across time and space.

Dr Mark Petticrew is Professor of Public Health Evaluation at the London School of Hygiene and Tropical Medicine and a member of PIRU. He is also a co-director the NIHR School for Public Health Research at LSHTM. (mark.petticrew@lshtm.ac.uk)

 

‘Research units are performing a difficult balancing act … but we’re still smiling.’

by nicholas mays

Our ambition to co-produce evidence with advisors and officials is fraught with challenges, but remains a worthy goal with valuable benefits, explains PIRU director, Nicholas Mays.

When PIRU was set up three and a half years ago, there was a great deal of ambition on all sides. The Department of Health, as funder, wanted us ‘to strengthen the use of evidence in the initial stages of policy making’. That was the distinctive, exciting bit for us. We were to support or undertake evaluation of policy pilots or demonstration initiatives across all aspects of the Department’s policy activity – public health, health services and social care.

We were also brave, seeking to ‘co-produce’ evidence by working closely with policy advisors and officials, aiming to break down conventional sequences in which evaluation tends to follow policy development. We wanted early involvement from horizon scanning to innovation design and implementation design, plus support work for evaluations or to do them ourselves. It was clear that if we could be engaged, flexible and responsive, officials would be more likely to work with us.

Some researchers prefer planned, longer term work. They see the responsive element as regrettably necessary to pay the mortgage. In fact, our more responsive work has often turned out to be the most interesting:  some of it we would probably have planned to do in any case; other parts have led to substantial pieces of research. It can be highly productive, not least because policy advisors are fired up about the findings.

Wide-ranging roles

In our first years, we have tried hard to work across all stages of policy development. To support the early stages of policy innovation, we did some rapid evidence syntheses.  We have advised on the feasibility of a number of potential evaluations – for example, we looked at the Innovation Health and Wealth Strategy to examine which of the strategy’s 26 actions could credibly be evaluated. We have advised on the commissioning and management of early stage policy evaluations. We have also helped define more precisely what the intervention is in a particular pilot because, in pilot schemes or demonstrations, the ‘what’ is often presumed, but can actually be rather unclear.

We had expected to guide roll-out, using the learning from evaluations, but that’s not always easy for academic evaluators. PIRU often works with different parts of the social care and health policy system, perhaps for quite short periods of time, which is a very different relationship from working, say, with clinicians for an extended period.  Also, in policy and management, unlike the clinical world, people change jobs fairly frequently making it difficult to sustain relationships.

We have also advised on modelling and simulation, which is useful for playing out possible effects of innovations and to debate potential designs. However, that work typically tends to happen within government rather than through outsiders such as PIRU.

Challenges

Indeed, we have found it difficult to become involved in the early stages of policy development, partly because health and social policy decision-making in England has been restructured and become more complicated as a result of the Health and Social Care Act 2012. There are new agencies and new people, altering long-established relationships between policy makers and evaluators.

Engaging us early on is also demanding. It requires greater openness and communication within government, so that research managers actually know when an initiative is starting, and a willingness to share early intelligence with outsiders in the research community. Some policy makers also find that the perceived benefits of sharing new thinking with us fails to outweigh the perceived risks of having us at the table early on.

Dilemmas

There have been other big issues. How close should evaluators get to those who commission an evaluation? How candid – and sometimes negative – should we be?  Should we refuse to do an impact evaluation because we know that too little time will be allowed to elapse to demonstrate a difference?  Should we actively create dissonance with customers who are also funders through a process of constructive challenge? Strangely, the researchers are sometimes the ones saying, ‘No, we should not be looking at outcomes. You are better doing a process evaluation or no evaluation at this stage.’ In some cases, the researchers are asking for less evaluation and the policy makers are asking for more.

Can it be predicted that certain pilots do not realistically lend themselves to being evaluated? For example, we conducted a study of a pilot scheme allowing patients to either visit or register with GP practices outside the area in which they live.  We highlighted in our report that we couldn’t look at the full range of impacts in the 12 months for which the pilot ran.  Nevertheless, critics of the policy were annoyed with the evaluation because it was seen to legitimise what was, in their minds, an inadequate pilot of a wrong-headed policy.

We frequently have to say that the policy pilot will take a lot longer than expected to be implemented. However, the commissioners of evaluation often have no time to wait and want the results right away. The danger is that lots of time is spent interviewing people and looking for implementation effects, only to discover that not very much has happened yet.

So we face many challenges. But that’s hardly surprising. In an ideal world, we would have closer sets of relationships with a defined set of potential users. In reality, we are working across a very wide range of policy issues with an overriding expectation that we should engage at an early stage and speedily. It’s a difficult but rewarding balancing act.

Nicholas Mays is Professor of Health Policy at the London School of Hygiene and Tropical Medicine and Director of PIRU. This piece is based on a presentation that Professor Mays gave at the meeting ‘Evaluation – making it timely, useful, independent and rigorous’ on 4 July 2014, organised by PIRU at the London School of Hygiene and Tropical Medicine, in association with the NIHR School for Public Health Research and the Public Health Research Consortium (PHRC).

 

Evaluations should share interim findings only when pieces of research are completed and on terms that are agreed in advance

by oona campbell

A global maternal health initiative that could save thousands of lives has highlighted dilemmas for those assessing its performance, says Oona Campbell

 

 It is difficult to over-estimate the urgency of improving maternal health in developing countries. Women die in childbirth or from complications during pregnancy, day in and day out. Some 99 per cent of maternal deaths are in the developing world – these tragic mortality figures are the public health indicator showing the greatest gulf between rich and poor countries. Most of these deaths occur during labour or within 24 hours after delivery, typically because of excessive bleeding.

So it is extremely important to us to have been asked to evaluate MSD’s 10 year, $500 million MSD for Mothers initiative, designed to create a world where no woman dies giving life.  As the team chosen to evaluate parts of the initiative, we had to think about which interim findings to share, when and how and with whom. We appreciate the importance of communicating findings quickly, but it is vital that evaluation is independent and that learning is robust. We wonder whether communicating too soon or too frequently would undermine independence.  Right now, our view is that we should not wait until the study ends to detail some interim findings. But we plan to share only completed pieces of research with clear protocols and objectives.

Why evaluate?

MSD for Mothers focuses on two leading causes of maternal mortality – post-partum haemorrhaging and pre-eclampsia. There are a number of priority countries – among them we focus on work in India and Uganda. It is a big initiative, with multiple pillars: product innovation, global awareness and advocacy and also numerous projects aiming to improve access to affordable, quality care for women.

Why was the company interested in having an evaluation? They sought an independent assessment of their contribution to maternal mortality reduction and to identifying sustainable solutions. They wanted guidance in their existing strategy and to ensure they were investing in high impact programmes. From the policy perspective the aim is to contribute to the evidence base for better decision-making globally and to have robust research available through publications in peer reviewed journals.

The difficult issue for us was how we might contribute to guiding the existing strategy, ensuring investment in high impact potential projects. When do we provide input? What do we do? How do we do it? Does this affect our ability to be independent? If we get that involved in programme design, will it affect our ability to do robust evaluation? How do we work with the implementers who are actually doing the projects? Will they continue to work with us if we share interim findings? How does all this affect our ability to be relevant?

Trying to be helpful

Our initial thought was that we wanted our evaluation to be used. Too few resources go into women’s maternal health in low income countries, so, we certainly did not want to say, at the end of 10 years: ‘No, it didn’t work.’ We wanted to maintain a dialogue with policy makers, commissioners of research and the implementers. As Tom Woodcock from NIHR CLAHRC Northwest London has explained, this can be very successful. So there was an assumption that we should be responsive, engaged and give feedback. But when should we do this? What does it mean to be ‘maximally responsive’ and should policy makers – or in this case the funder – have sight of the interim findings?

Our approach

We are using a multi-disciplinary, mixed method approach that tries to capture the scope and range of the activities. Our basic approach is to work with MSD to identify overarching questions and then to work with the implementers, usually non-governmental organisations in specific countries, to understand what they are trying to do. Then we identify projects to evaluate and agree key evaluation questions. Within that, we try to understand exactly what people are doing, the theory of change and how the implementer thinks it is going to work. Where possible, we like to recommend ways to design their implementation that allow for evaluation, but typically that’s not possible. Then we provide technical support to improve the rigour of monitoring in specific projects and we aim to use robust, analytical methods for the independent evaluation, including by gathering further data.

Guidance from global literature on sharing of findings tends to be vague, but mentions sharing interim findings. The Centers for Disease Control and Prevention is probably most explicit saying: ‘It’s important to use the findings that you learn all along the way because if we don’t, opportunities are missed, if you wait until the very end of your evaluation to use some of those results. And sometimes those key nuggets of information in terms of interim findings may not necessarily be captured in that final report, so it’s important to use them as you learn about them.’

Clinical trials have formal mechanisms for interim findings.  Data monitoring committees look at elements of implementation such as adequacy of enrolment, as well as trial endpoints and adverse events.  Insights from clinical trials tend to focus on ethical obligations to stop trials early to reduce study participants exposure to inferior treatment. But there is also concern that multiple interim analyses of accumulating data can find differences when actually there are none.

We wish to learn from these approaches, but in terms of our evaluation, an important consideration is the multiplicity of interventions underway in this wide-ranging programme.  The application of interim findings about simple interventions, such as those usually tested in clinical trials, is more straightforward. Imagine evaluating interventions to reduce the incidence of maternal tetanus. The interventions might be ensuring clean delivery – because an unhygienic birth environment exposes a mother potentially to tetanus spores – plus immunization with sufficient doses of tetanus toxoid to prevent the onset of maternal tetanus.

But what about a complex intervention where you are trying to change maternity care?  A huge range of interventions are required, including, for example, health worker training, changes to ambulance services, accreditation of private providers, behavioural change communication, health insurance etc. This programme might involve a long complex causal chain with feedback loops and multiple groups of individuals. Is an interim finding on one aspect a solid basis for changing the implementation?

Uses of interim findings

There is also a wide variety of potential purposes for interim evaluation. They might include: stopping a complex intervention that is harmful; to proclaim success and roll out elsewhere; to improve implementation of intervention; to change the intervention and bolster failing/problematic bits; to ensure politicians/ policymakers remain engaged; to respond to a need for quick results.

We’re trying to better understand how to share interim findings and with whom. Should it be with programme implementers, policy makers, funders or others? Or, perhaps, all of them? Should it be simply findings on implementation or just on outputs and impacts? Our ‘interim conclusion’ on the interim findings is that we certainly do not have to wait until the end of the programme and we do want to communicate some research. But, by and large, we will be clear that these will only be completed pieces of research, with clear protocols and objectives, formally specified before implementation begins.

Dr Oona Campbell is Professor of Epidemiology and Reproductive Health at the London School of Hygiene and Tropical Medicine. This piece is based on a presentation that Professor Campbell gave at the meeting ‘Evaluation – making it timely, useful, independent and rigorous’ on 4 July 2014, organised by PIRU at the London School of Hygiene and Tropical Medicine, in association with the NIHR School for Public Health Research and the Public Health Research Consortium (PHRC).

Rapid, real-time feedback from evaluation, as well as programme flexibility, is vital to health service improvement

By Tom Woodcock

 

We should learn from Melanesian Islanders that feedback provides the deep understanding of interventions which is key to wide-scale, successful roll-out, says Tom Woodcock

Osmar White, the Australian war correspondent, told an amazing story about how, after the Second World War, when military bases had closed in the Pacific, Melanesian Islanders built crude imitation landing strips, aircraft and radio equipment, and mimicked the behaviour that they had observed of the military personnel operating them. White’s book, ‘Parliament of a Thousand Tribes’, explains how the islanders were trying to reproduce the glut of goods and products that Japanese and American combatants had brought to the region. They believed that they could actually summon these goods again.

In her recent paper published in Implementation Science*, Professor Mary Dixon-Woods of Leicester University highlights this extraordinary story as a graphic illustration of how an innovation fails to be replicated successfully in different circumstances because there is poor understanding of the original intervention. It illuminates the difficulties that can arise when one tries to implement and roll out improvement programmes.  Deep understanding of the intervention is vital.

How do we achieve that understanding? It’s a big issue for NIHR’s Collaboration for Leadership in Applied Health Research and Care (CLAHRC). In Northwest London, we’re funded for the next five years to accelerate the translation of health research into patient care. Our experience is that rapid and continuous real-time feedback from evaluation, combined with flexibility in programme adaptation, is vital to ensure rapid improvement of health service practice. It is also central to meeting the longer-term challenges of achieving sustainability and reproducibility of change.

Challenges of transferring successful interventions

The nature of the challenge was highlighted by the Michigan Central Line Project. This was a highly successful US quality improvement project designed to reduce central line infections. Mortality was reduced significantly. ‘Matching Michigan’ was a subsequent initiative in 200 English hospitals to replicate Michigan’s results. It didn’t work as well as hoped. Drawing parallels with the Melanesian story, Professor Dixon-Woods’ paper argues that the Michigan innovation transfer likewise demonstrated inadequate understanding of the true intervention.

How can real-time evaluation help to avoid these misunderstandings? First, it offers a better chance to optimise interventions in their original settings, as well as in subsequent roll-out sites. Secondly, it can lead to a richer, more real understanding of the system and how it works. This can lead, I believe, to a fuller evaluation and more successful transferability.  The opportunity offered by real-time evaluation might be at a specific project level, implementing an intervention at a specific setting, but its strengths are also useful at higher policy levels and in the support and training levels lying between policy and practice.

Why does testing an intervention in situ with real time evaluative feedback produce a better eventual implementation? That’s partly due to being able to fit the intervention to its context effectively. The project team gain much better insight into what is actually currently happening during implementation, which is sometimes highly complex, making it easy to miss key aspects of what is occurring. There can also be early checks on the intended impacts – if an intervention is being implemented successfully but not improving outcomes, there are statistical approaches that allow evaluators to explore the reasons quickly and take appropriate action. Feedback also increases motivation and engagement within the initiative, encouraging reflective thought.

A closer working relationship between evaluators and the team can expose underlying assumptions within an intervention which might otherwise be obscured. Typically, members of the team also better appreciate the value of evaluation, leading them to develop higher quality data. Team challenges to the data – observations that ‘this does not make sense to me’ – can be illuminating and help create both between and within site consistency. In her ‘Matching Michigan’ study, Mary Dixon-Woods highlights huge inconsistencies between the data collected in the different sites despite each site supposedly working to an agreed, common operational framework.  Achieving such consistency is extremely difficult.  Close working between the evaluation and implementation teams can help and it provides greater access to the mechanism in which, and by which, an intervention works. It offers a lot of information about sensitivity and specificity of measures.

Challenges of real time evaluation

Real time feedback and evaluation does have problems, being more resource intensive and potentially blurring the lines between an evaluation and the intervention itself. There are methodological challenges – if early feedback is followed by a working and responsive change, then the evaluation is, in theory, dealing with a different intervention from the one it began to examine.  Inevitably, there are questions about the impartiality of the evaluators if they work very closely with the implementation team.

At CLAHRC Northwest London, we reckon that the increased costs of real time feedback are more than outweighed by the benefits.  It helps that the very nature of the interactive feedback implies starting on a smaller scale, which can allow an initial programme to build in the interactive feedback and then later findings can be used to roll out.

It is vital to clarify the intervention.  Laura J Damschroder’s 2009 paper** published in Implementation Science reviews the literature to articulate a framework distinguishing the core and the periphery of an intervention. The core represents the defining characteristics which should be the same wherever implemented, but there is also the flexible, context-sensitive periphery.

Regarding concerns about compromising objectivity, that is essentially a case of planning carefully, delivering against the protocol and then justifying and accurately reporting any additional analyses or modifications so that anyone reading an evaluation understands what was planned originally and what was added as part of the interactive feedback.

Typically, people tend to think of two distinct processes – implementation and evaluation. In CLAHRC NWL, there is much more overlap.  The CLAHRC NWL support team essentially perform an evaluative role and attend implementation team meetings to provide real time evaluation feedback on the project measures. Biannually, Professor James Barlow and his team at Imperial College London provide evaluation of the CLARC NWL programme, predominantly at higher levels, but there is still an interactive process going on.

Clarity about interventions

Take, for example, our programme to improve the management of chronic obstructive pulmonary disease (COPD).  There are some high level factors that we wish to influence by implementing the intervention, including reduced patient smoking, increased capacity to use inhalers properly when patients are out of hospital plus better general fitness and levels of exercise. There are a whole series of interventions, ranging from general availability of correct inhaler advice to much more specific provision of specialist staff education sessions, improving their ability to train patients in inhaler techniques. This is a useful way of separating the core of the intervention from the periphery – the more one is discussing generalities, the closer one is to the core of the intervention, whereas detailed particular measures are more sensitive to local context. So, for example, it may be in one hospital, there is already an embedded staff training programme on inhaler technique, so it is unnecessary to implement this peripheral intervention in that situation.

Implementation is clearly complex. Real time feedback, I believe, can help improvement programmes develop and to be implemented successfully.  It also can make for a better evaluation as well, but that requires very particular approaches to ensure rigour.

Dr Tom Woodcock is Head of Information at NIHR CLAHRC Northwest London and Health Foundation Improvement Science Fellow. This piece is based on a presentation that Dr Woodcock gave at the meeting ‘Evaluation – making it timely, useful, independent and rigorous’ on 4 July 2014, organised by PIRU at the London School of Hygiene and Tropical Medicine, in association with the NIHR School for Public Health Research and the Public Health Research Consortium (PHRC).

 

* Dixon-Woods, M. et al (2013) “Explaining Matching Michigan: an ethnographic study of a patient safety program”, Implementation Science 8:70.

** Damschroder, L.J. (2009) “Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science”, Implementation Science 4:50.

Let’s be honest that pilots are not just about testing: they’re also about engineering the politics of change

By Stefanie Ettelt

There is more to policy piloting than evaluation – piloting is a policy tool in itself, not only a means for conducting research, says Stefanie Ettelt

 

Pilot evaluation tends to frustrate and disappoint some or all of its stakeholders, be they policy-makers, local implementers or evaluators, according to a study I have been working on for PIRU. Policy makers typically want robust, defensible proofs of success, ideally peppered with useful tips to avoid roll-out embarrassments. But they are distinctly uncomfortable with potentially negative or politically damaging conclusions that can also spring from rigorous evaluation.

Meanwhile, implementers of pilots at a local level don’t welcome the ambivalence that evaluation suggests, particularly when randomised controlled trials (RCTs) are used, given the associated assumption of uncertain outcome (equipoise). Implementers understandably worry that all their hard work putting change into action might turn out to have been a waste of time, producing insufficient improvement and leading to a programme being scrapped.

The evaluators may prefer a more nuanced approach than either of the above want, in order to capture the complex results and uncertainties of change. But this approach might find little favour with those commissioning the work.  Evaluators are often dissatisfied with the narrow time frames and limited sets of questions that are allowed for their investigations. They may feel tasked with gathering what they consider to be over-simplistic measures of success as well as being disappointed to discover that a roll-out has begun regardless of –  or even in advance of – their findings.

Keeping all of these stakeholders happy is a big ask. It’s probably impossible, not least because satisfying any one of these stakeholders may mutually exclude contentment among the others.  Why do we find ourselves in such a difficult situation?

Why is it so hard to satisfy everyone about pilots?

Perhaps this tricky issue is linked to the particular way in which British policy-making is institutionalised. These days, policy-making in the UK seems to be less ideologically-driven – or least supported by ideology – than it was in the past. With this loss of some ideological defences has also gone some of the perceived – albeit sometimes flawed – certainties that may once have protected policies from criticism. As a result, there are sometimes overblown expectations of research evidence in the UK and sometimes illusory beliefs that evidence can create new certainties.

The institutional design of the Westminster system perhaps invites excessive expectations that policy can be highly informed by evidence, because political centralisation means that there seem to be fewer actors who can veto decisions, than in some countries, for example, in Germany.  There are more regional players in Germany’s federal system who can veto, obstruct or influence a decision. Relatively minor coalition partners in Berlin also have a long standing tradition of providing strong checks and balances on the larger governing party. So, in Germany, there is more need for consensus and agreement at the initial policy-making stage. This participative process tends to reduce expectations of what a policy can deliver and also, perhaps, the importance of evidence in legitimising that policy.

Britain compared with Germany

In contrast, the comparatively centralised Westminster system seems more prone to making exaggerated claims for policy development and more in need of other sources of legitimacy. Piloting may, thus, at times become a proxy for consensus policy-making and a means of securing credibility for decisions. It might help to reduce expectations, and thus avoid frustration, if policy makers were clearer about their rationale for piloting. So, for example, they might explain whether a pilot is designed to promote policy or to question if the policy is actually a good way forward. If the core purpose is to promote policy, then some forms of evaluation such as RCTs may be inappropriate.

Evaluators understandably find it difficult to accept that the purpose of piloting and evaluation might first and foremost be for policy-makers to demonstrate good policy practice and to confirm prior judgements (i.e. ‘symbolic’). But there should be recognition that piloting sometimes does have such a political nature which is genuinely distinct from it having a purely evaluative role.

Of course, such a distinction is not made any easier by policy makers who tend to use rhetoric such as ‘what works’ and ‘policy trials/experiments’ when they already know that the purpose of the exercise is simply to affirm what they are doing. If policy makers – including politicians and civil servants – use such language, they really are inviting, and should be prepared to accept, robust evaluation and acknowledge that sometimes the findings will be negative and uncomfortable for them.

Improving piloting and evaluation

There are ways in which we can improve evaluation methods to make them more acceptable to all concerned. More attention could be given to identifying the purpose of piloting to avoid disappointment and manage the expectations of evaluators, policy-makers and local implementers. If the intention is to promote local and national policy learning more participation from local implementers in the objectives and design of evaluations of pilots would be desirable, so that these stakeholders might feel less worried by the process. Evaluators might also be more satisfied with more extensive use of ‘realist evaluation’. This approach particularly explores how context influences the outcomes of an intervention or policy, which is useful information for roll-out.

I would like to see local stakeholders to be more directly involved in policy-making and their role more institutionalised. So their involvement would be ongoing and not abandoned if it was considered unhelpful by a different incoming government. These are roles that need time to grow, to become embedded and for skills to develop.  Such a change would enhance the localism agenda.  It would also acknowledge that local implementers are already key contributors to national policy learning through all the local trial and error that they employ.

Dr Stefanie Ettelt is a Lecturer in Health Policy at the London School of Hygiene and Tropical Medicine. She contributes to PIRU through her work on piloting and through participating in the evaluation of the Direct Payments in Residential Care Trailblazers. She also currently explores the role of evidence in health policy, comparing England and Germany, as part of the “Getting evidence into policy” project at the LSHTM.

 

 

Follow Africa’s lead in meticulous evaluation of P4P schemes for healthcare

By Mylene Lagarde

Working with researchers to evaluate the introduction of financial incentives in developed healthcare economies would yield vital knowledge, explains Mylene Lagarde

The jury is very much out on pay-for-performance (P4P) schemes in healthcare – at least as far as the research community is concerned. Lots of unanswered questions remain over their effectiveness and hidden costs, as well as potential unintended consequences and their merit relative to other potential approaches. Yet many policy makers seem to have made their minds up already. These schemes, which link financial rewards to healthcare performance, make sense intuitively. They are being introduced widely.

This disconnection between the research and policy-making worlds means that we are almost certainly not getting the best out of P4P initiatives. Perhaps more worrying, there is a danger that the tree will hide the forest – that the attractive, sometimes faddish, simplicity of pay-for-performance may obscure other, perhaps more complicated but possibly more cost-effective ways to improve healthcare. As systems struggle to configure themselves to address modern demographics and disease profiles, harnessing latest technologies, we need to know what works best to reshape behaviours.

There are three key issues that weaken case for P4P in healthcare, as we set out in the PIRU report “Challenges of payment-for-performance in health care and other public services – design, implementation and evaluation”. These concern a lack of evidence about their costs and effectiveness and for identifying which particular P4P designs may work better than others.

First, costs. P4P schemes are complex to design. They usually involve lots of preliminary meetings between the many participants. Yet studies have largely ignored these transaction costs and frequently also fail to track and record carefully the considerable costs of monitoring performance.

Second, the effectiveness of P4P is often impossible to assess with enough certainty. Typically, introduction of a new scheme does not include a control group. For example, if a scheme incentivises reduced hospital length of stay or emergency admissions for one hospital, it may be difficult to find a comparable hospital to serve as a counterfactual. That makes it harder to attribute a particular change to P4P – maybe it would have happened anyway.

Furthermore, only small groups of outcomes are usually monitored by P4P schemes, so evaluators may be left with a narrow, and thus weak, selection of effects. For example, reductions in hospital lengths of stay may be identified, but these may coincide with poorer outcomes elsewhere in the system, such as increased admissions to nursing homes. These unintended effects, perhaps reflecting a shift rather than a reduction in costs and problems, are often not collected by the programme. That makes whole system analysis difficult.

Third, P4P is not a unique and uni-dimensional intervention. It is a family of interventions. They are all based on the premise that financial incentives can support change, but there are many variables: the size of the reward; how frequently it is offered; whether it is focussed on relative or absolute targets; whether it is linked to competition between providers or it is universally awarded. Very often, one type of intervention is used but another might equally well be employed. Each variation can produce different results, yet we still know little about the relative performance of alternative designs for these incentive schemes.

Researchers are not completely in the dark about P4P in healthcare. We are beginning to understand factors that characterise successful schemes. These typically involve a long lead-in time to plan, test and reflect carefully on the different elements of a programme. However, we must strengthen evaluation.

The first step would be to involve researchers at an early stage of the programme design. That’s the moment to spot where in the system you might need data to be collected. It’s also the time to identify control groups so that the causal impacts of these programmes can eventually be attributed more confidently.

Good evaluation requires political willingness to evaluate, which is sometimes lacking. When an initiative has a political breeze behind it, policy makers worry that researchers will let the wind out of the sails. But some Low and Middle Income Countries are taking the risk. There have been large numbers of randomised controlled trials over the last few years in African countries, looking at the effects of P4P schemes. Most are ongoing, but, so far, the evidence is promising. Rwanda was one of the first African countries to evaluate these financial incentives, mainly for increasing uptake of primary healthcare. Its programme is now being scaled up.

Why is Africa leading the way in setting high standards for P4P evaluation? Because the funders of these schemes, typically external donors (e.g. the World Bank, DfID, USAID), are well placed to demand meticulous evaluation by the receiving governmental authorities as a condition for the cash. Researchers, particularly in developed countries, rarely enjoy such firm leverage over national policy makers. And national policy-makers in these countries do not apply to themselves the degree of scrutiny they exercise with international aid recipients. Yet, if we are to get the best out of P4P – and not attach potentially false hopes to this healthcare innovation – we need more of the disciplined approach that is currently being used in Africa.

Dr Mylene Lagarde is a Senior Lecturer in Health Economics at the London School of Hygiene and Tropical Medicine. “Challenges of payment-for-performance in health care and other public services – design, implementation and evaluation” by Mylene Lagarde, Michael Wright, Julie Nossiter and Nicholas Mays is published by PIRU.