GiveWell’s Top Charities Are (Increasingly) Hard to Beat

In this post, “we” refers to Good Ventures and the Open Philanthropy Project, who work as partners.

Our thinking on prioritizing across different causes has evolved as we’ve made more grants. This post explores one aspect of that: the high bar set by the best global health and development interventions, and what we’re learning about the relative performance of some of our other grantmaking areas that seek to help people today. To summarize:

When we were getting started, we used unconditional cash transfers to people living in extreme poverty (a program run by GiveDirectly) as “the bar” for being willing to make a grant, on the grounds that such giving was quite cost-effective and likely extremely scalable and persistently available, so we should not generally make grants that we expected to achieve less benefit per dollar than that. Based on the roughly 100-to-1 ratio of average consumption between the average American and GiveDirectly cash transfer recipients, and a logarithmic model of the utility of money, we call this the “100x bar.” So if we are giving to, e.g., encourage policies that increase incomes for average Americans, we need to increase them by $100 for every $1 we spend to get as much benefit as just giving that $1 directly to GiveDirectly recipients. More.
GiveWell (which we used to be part of and remain closely affiliated with) has continued to find more and larger opportunities over time, and become more optimistic about finding more cost-effective ones in the future. This has the implication that we should raise “the bar” to something closer to the current estimated cost-effectiveness of GiveWell’s unfunded top charities, which they believe to be in the range of 5-15x more cost-effective than 100x cash transfers, meaning a bar of benefits 500-1,500 times the cost of our grants (which we approximate to a “1,000x bar”). More.
Since adopting cash transfers as the relevant benchmark for our giving aimed at helping people alive today, we’ve given ~$100M in U.S. policy, ~$100M in scientific research, and ~$300M based on GiveWell recommendations. According to our extremely rough internal calculations, we do expect many of our grants in scientific research and U.S. policy to exceed the “100x bar” represented by unconditional cash transfers, but relatively few to clear a “1,000x bar” roughly corresponding to high-end estimates for GiveWell’s unfunded top charities. This would imply that it’s quite difficult/rare for work in these categories to look more cost-effective than GiveWell’s top charities. However, these calculations are extraordinarily rough and uncertain. More.
In spite of these calculations, we think there are some good arguments to consider in favor of our current grantmaking in these areas. More.
We continue to think it is likely that there are causes aimed at helping people today (potentially including our current ones) that could be more cost-effective than GiveWell’s top charities, and we are hiring researchers to work on finding and evaluating them. More.
We are still thinking through the balance of these considerations. We are not planning any rapid changes in direction. More.

Cash transfers to people in extreme poverty

In 2015, when we were still part of GiveWell, we wrote:

By default, we feel that any given grant of $X should look significantly better than making direct cash transfers (totaling $X) to people who are extremely low-income by global standards – abbreviated as “direct cash transfers.” We believe it will be possible to give away very large amounts, at any point in the next couple of decades, via direct cash transfers, so any grant that doesn’t meet this bar seems unlikely to be worth making….

It’s possible that this standard is too lax, since we might find plenty of giving opportunities in the future that are much stronger than direct cash transfers. However, at this early stage, it isn’t obvious how we will find several billion dollars’ worth of such opportunities, and so – as long as total giving remains within the … budget – we prefer to err on the side of recommending grants when we’ve completed an investigation and when they look substantially better than direct cash transfers.

It is, of course, often extremely unclear how to compare the good accomplished by a given grant to the good accomplished by direct cash transfers. Sometimes we will be able to do a rough quantitative estimate to determine whether a given grant looks much better, much worse or within the margin of error. (In the case of our top charities, we think that donations to AMF, SCI and Deworm the World look substantially better.) Other times we may have little to go on for making the comparison other than intuition. Still, thinking about the comparison can be informative. For example, when considering grants that will primarily benefit people in the U.S. (such as supporting work on criminal justice reform), benchmarking to direct cash transfers can be a fairly high standard. Based on the idea that the value of additional money is roughly proportional to the logarithm of income, ^[1] and the fact that mean American income is around 100x annual consumption for GiveDirectly recipients, we assume that a given dollar is worth ~100x as much to a GiveDirectly recipient as to the average American. Thus, in considering grants that primarily benefit Americans, we look for a better than “100x return” in financial terms (e.g. increased income). Of course, there are always huge amounts of uncertainty in these comparisons, and we try not to take them too literally.

To walk through the logic of how this generates a “100x” bar a bit more clearly:

We want to be able to compare philanthropic opportunities that will save the U.S. or state governments money, or increase incomes for average Americans, against opportunities to directly help the global poor (or deliver other benefits) in a somewhat consistent fashion. For instance, we could imagine a hypothetical domestic advocacy opportunity that might be able to save the government $100 million, or increase productivity by $100 million, for a cost of $1 million; we would call that opportunity roughly “100x” because the benefit in terms of income to the average American is $100 for every $1 we spend.^[2] If we just directly gave a random person in the U.S. $1,000, we’d expect to get “1x” because the benefit to them is equal to the cost to us (ignoring transaction costs). That is, we take our core unit of measurement for this exercise as “dollars to the average American.” Then we face the question: how should we compare transfers to the global poor (or other programs) to transfers to the average American?
GiveWell reports that the income of GiveDirectly recipients averages $0.79 per day^[3] — so approximately $290 per person per year, compared to more than $34,000 per capita per year in the U.S.^[4] This means $34,000 could double one individual’s income for a year in the U.S., or (after ~10% overhead is taken out) double the income of about 106 GiveDirectly recipients for a year.^[5]
In this context we assume a logarithmic utility function for income, which is a fairly common simplification and assumes that doubling a person’s income contributes the same amount to their well-being regardless of how much income they started with. We think this is a plausible starting point based on evidence from life satisfaction surveys.^[6] However, it is worth noting that there are credible arguments that a logarithmic utility function places either too much or too little weight on income at the high end.^[7]
A logarithmic utility function implies that $1 for someone with 100x less income/consumption is worth 100x as much. This implies direct cash transfers to the extreme global poor go about 100x as far as the same money spent in the U.S., on average, and means any potential grant should create an expected value at least 100x the cost of the grant if it is to be considered a better use of money than such direct cash transfers.
With other causes, in addition to looking at monetary savings or gains, we also use “value of a statistical life” techniques to try to account for health and quality-of-life benefits. That yields more cost-effectiveness estimates, all generally framed in the language of “This seems roughly as good as saving an average American $N for each $1 we spend” or simply “Nx.”

Obviously, calculations like this remain deeply uncertain and vulnerable to large mistakes, so we try to not put too much weight on them in any one case. But the general reality that they reflect — of vast global inequalities, and the relative ease of moving money from people who have a lot of it to people who have little – seems quite robust.

Although we stopped formally using this 100x benchmark across all of our giving a couple of years ago because of considerations relating to animals and future generations, we have continued to find it a useful benchmark against which “near-termist, human-centric” grants – those that aim to improve the lives of humans on a relatively short time horizon, including a mix of direct aid, policy work, and scientific research – can be measured.

The best programs are even harder to beat

In 2015, when we first wrote about adopting the cash transfer benchmark, it looked like GiveWell could plausibly “run out” of their more-cost-effective-than-cash giving opportunities. At the time, they had three non-cash-transfer top charities they estimated to be in the 5-10x cash range (i.e., 5 to 10 times more cost-effective than cash transfers),^[8] with ~$145 million of estimated short-term room for more funding. That, plus uncertainty about the amount of weight to put on these figures, led us to adopt the cash transfer benchmark. (In the remainder of this post, I occasionally shorten “cash transfer” to just “cash.”) But by the end of 2018, GiveWell had expanded to seven non-cash-transfer top charities estimated to be in the ~5-15x cash range, with $290 million of estimated short-term room for more funding, and with the top recommended unfilled gaps at ~8x cash transfers.^[9] If we combine cash transfers at “100x” and large unfilled opportunities at ~5-15x cash transfers, the relevant “bar to beat” going forward may be more like 500-1,500x.^[10] And earlier this year GiveWell suggested that they expected to find more cost-effective opportunities in the future, and they are staffing up in order to do so.

Another approach to this question is to ask, how much better than direct cash transfers should we expect the best underfunded interventions to be? I find scalable interventions worth ~5-15x cash a bit surprising, but not wildly so. It’s not obvious where to look for a prior on this point, and it seems to correlate strongly with general views about broad market efficiency: if you think broad “markets for doing good” are efficient, finding a scalable ~5-15x baseline intervention might be especially surprising; conversely if you think markets for doing good are riddled with inefficiencies, you might expect to find many even more cost-effective opportunities.

One place to potentially look for priors on this point might be compilations of the cost-effectiveness of various evidence-based interventions. I know of five compilations of the cost-effectiveness of different interventions within a given domain that contain easily available tabulations of the interventions reviewed:^[11]

The Washington State Institute for Public Policy benefit-costs results database (archive), focused on U.S. social policies.
Two reviews of public health interventions considered by the UK’s National Institute for Health and Care Excellence (NICE).
The Disease Control Priorities report 2nd Edition (archive), focused on global health interventions.
The Disease Control Priorities report 3rd Edition (archive), focused on global health interventions.
WHO Choice results (archive) for the AFR E region (archive), focused on global health interventions.

For this purpose, I was just curious about the general distribution of the estimates, and didn’t attempt to verify any of them, and was very rough in discarding estimates that were negative or didn’t have numerical answers, which may bias my conclusions. In general, we regard the calculations included in these compilations as challenging and error-prone, and we would caution against over-reliance on them^[12].

I made a sheet summarizing the sources’ estimates here. All five distributions appear to be (very roughly) log-normal, with standard deviations of ~0.7-1, implying that a one-standard-deviation increase in cost-effectiveness would equate to a 5-10x improvement. However, any errors in these calculations would typically inflate that figure, and we think they are structurally highly error-prone, so these standard deviations likely substantially overstate the true ones.^[13]

We don’t know what the mean of the true distribution of cost-effectiveness of global development opportunities might be, but assuming it’s not more than a few times different from cash transfers (in either direction), and that measurement error doesn’t make up more than half of the variance in the cost-effectiveness compilations reviewed above (a non-trivial assumption), then these figures imply we shouldn’t be too surprised to see top opportunities ~5-15x cash. A normal distribution would imply that an opportunity two standard deviations above the mean is in the ~98th percentile. These figures would support more skepticism towards an opportunity from the same rough distribution (evidence-based global health interventions) that is claimed to be even more cost-effective (e.g., 100x or 1,000x cash rather than 10x).

Stepping back from the modeling, given the vast difference in treatment costs per person for different interventions (~$5 for bednets, $0.33-~$1 for deworming, ~$250 for cash transfers), it does seem plausible to have large (~10x) differences in cost-effectiveness.

Even if scalable global health interventions were much worse than we currently think, and, say, only ~3x as cost-effective as cash transfers, I expect GiveWell’s foray into more leveraged interventions to yield substantial opportunities that are at least several times more cost-effective, pushing back towards ~10x cash transfers as a more relevant future benchmark for unfunded opportunities.

Overall, given that GiveWell’s numbers imply something more like “1,000x” than “100x” for their current unfunded opportunities, that those numbers seem plausible (though by no means ironclad), and that they may find yet-more-cost-effective opportunities in the future, it looks like the relevant “bar to beat” going forward may be more like 1,000x than 100x.

Our other grantmaking aimed at helping people today

While we think a lot of our “near-termist, human-centric” grantmaking clears the 100x bar, we see less evidence that it will clear a ~1,000x bar.

Since we initially adopted the cash transfer benchmark in 2015, we’ve made roughly 300 grants totalling almost $200 million in our near-termist, human-centric focus areas of criminal justice reform, immigration policy, land use reform, macroeconomic stabilization policy, and scientific research. To get a sense of our estimated returns for these grants, we looked at the largest grants and found 33 grants totalling $73M for which the grant investigator conducted an ex ante “back-of-the-envelope-calculation” (“BOTEC”) to roughly estimate the expected cost-effectiveness of the potential grant for Open Philanthropy decision-makers’ consideration.

All of these 33 grants were estimated by their investigator to have an expected cost-effectiveness of at least 100x. This makes sense given the existence of our “100x bar.” Of those 33, only eight grants, representing approximately $32 million, had BOTECs of 1,000x or greater. Our large grant to Target Malaria accounts for more than half of that.

Although we don’t typically make our internal BOTECs public, we compiled a set here (redacted somewhat to protect some grantees’ confidentiality) to give a flavor of what they look like. As you can see, they are exceedingly rough, and take at face value many controversial and uncertain claims (e.g., the cost of a prison-year, the benefit of a new housing unit in a supply-constrained area, the impact of monetary policy on wages, the likely impacts of various other policy changes, stated probabilities of our grantees’ work causing a policy change). We would guess that these uncertainties would generally lead our BOTECs to be over-optimistic (rather than merely adding unbiased noise) for a variety of reasons:

Program officers do the calculations themselves, and generally only do the calculations for grants they’re already inclined to recommend. Even if there’s zero cynicism or intentional manipulation to get “above the bar,” grantmakers (including me) seem likely to be more charitable to their grants than others would be.
Many of these estimates don’t adjust for relatively straightforward considerations that would systematically push towards lower estimated cost-effectiveness, like declining marginal returns to funding at the grantee level, time discounting, or potential non-replicability of the research our policy goals are based on. The comparison with the level of care in the GiveWell cost-effectiveness models on these features is pretty stark.
Holden made some more general arguments along these lines in 2011.

We think it’s notable that despite likely being systematically over-optimistic in this way, it’s still rare for us to find grant opportunities in U.S. policy and scientific research that appear to score better than GiveWell’s top charities. Of course, compared to GiveWell, we make many more grants, to more diverse activities, and with an explicit policy of trying to rely more on program officer judgment than these BOTECs. So the idea that our models look less robust than GiveWell’s is not a surprise — we’ve always expected that to be the case — but combining that with GiveWell’s rising bar is a more substantive update.

Some counter-considerations in favor of our work

As we’re grappling with the considerations above, we don’t want to give short shrift to the arguments in favor of our work. We see two broad categories of arguments in this vein: (a) this work may be substantively better than the BOTECs imply; and (b) it’s a worthwhile experiment.

This work may be better than the BOTECs imply

There are a couple big reasons why Open Phil’s near-termist, human-centric work could turn out to be better than implied by the figures above:

Values/moral weights. A logarithmic utility function and view that “all lives have equal value” push strongly towards work focused on the global poor. But many people endorse much flatter utility functions in money and the use of context-specific “value of a statistical life” figures, both of which would make work in the U.S. generally look much more attractive. And of course many people think we have stronger normative obligations to attend to our neighbors and fellow citizens, which would also make our non-GiveWell near-termist work look more valuable (though we have historically been skeptical of such normative views). (You could make similar arguments on instrumental rather normative grounds too, e.g., by arguing that flow-through effects from work in the U.S. would be larger.) Arguably we should put some weight on ideas like these in our worldview diversification process.
Hits. We are explicitly pursuing a hits-based approach to philanthropy with much of this work, and accordingly might expect just one or two “hits” from our portfolio to carry the whole. In particular, if one or two of our large science grants ended up 10x more cost-effective than GiveWell’s top charities, our portfolio to date would cumulatively come out ahead. In fact, the dollar-weighted average of the 33 BOTECs we collected above is (modestly) above the 1,000x bar, reflecting our ex ante assessment of that possibility. But the concerns about the informational value of those BOTECs remain, and most of our grants seems noticeably less likely to deliver such “hits.”
Mistaken analysis. As we’ve noted, we consider our BOTECs to be extremely rough. We think it’s more likely than not that better-quality BOTECs would make the work discussed above look still weaker, relative to GiveWell top charities – but we are far from certain of this, and it could go either way, especially if our policy reform efforts could contribute meaningfully to “tipping points” that lead to accelerating policy changes in the future.

It’s a worthwhile experiment

Our near-termist, human-centric giving since adopting the cash benchmark can be broken into roughly three groups: ~$100M for U.S. policy, ~$100M for scientific research, and ~$300M based on GiveWell recommendations in global health and development. We think given the amount of giving we anticipate doing in the future, an experimental effort of that scale is worth running. As we’ve discussed before, we see many benefits from giving in multiple different kinds of causes that are not fully captured by the impact of the grants themselves, including:

Learning. If our only giving were in long-termist and animal-oriented causes and to the GiveWell top charities, we think we’d learn a lot less about effective giving writ large and the full suite of tools available to a philanthropist aiming to effect change. We think that would make us less effective overall (though we give limited weight to this consideration).
Developing a concrete track record. We see a lot of value for ourselves and others in working in some areas with relatively short feedback loops, where it’s easier to observe whether our giving is achieving its intended impact. We would like it to be possible for us and others to recognize whether we are achieving any of our desired impacts, and that looks far more likely in our near-termist human-centric causes than in the bulk of the other causes we work on.
Option value. Developing staff capacity to work in many (different kinds of) causes provides the ability to adjust if our desired worldview allocation changes over time (which seems quite possible).
Helping other donors. Giving in a diverse set of causes increases our long-run odds of having large effects on the general dialogue around philanthropy, since we could provide tangibly useful information to a larger set of donors.

We see a number of other practical benefits to working in a broad variety of causes, including presenting an accurate public-facing picture of our values and making our organization a more appealing place to work.

Finally, it is worth noting that while we think GiveWell’s cost-effectiveness estimates are (far) more reliable than the very rough BOTECs we have done, we do not think their estimates (or any cost-effectiveness estimates we’ve ever seen) can be taken literally, or even used with much confidence.

It should be possible to outperform the GiveWell top charities

Although this post describes some doubts about how some of our giving to date may compare to the GiveWell top charities, we continue to think it should be possible to achieve more cost-effective results than the current GiveWell top charities via advocacy or scientific research funding rather than direct services. To the extent that there is a single overarching update here — which we are uncertain about — we think it is likely to be against the possibility of achieving sufficient leverage via advocacy or scientific research aimed at benefiting people in the U.S. or other wealthy countries alone. We have only explored a small portion of the space of possible causes in this broad area, and continue to expect that advocacy or scientific research, perhaps more squarely aimed at the global poor, could have outsized impacts. Indeed, GiveWell seems to agree this is possible, with their expansion into considering advocacy opportunities within global health and development.

As we look more closely at our returns to date and going forward, we’re also interested in exploring other causes that may have especially high returns. One hypothesis we’re interested in exploring is the idea of combining multiple sources of leverage for philanthropic impact (e.g., advocacy, scientific research, helping the global poor) to get more humanitarian impact per dollar (for instance via advocacy around scientific research funding or policies, or scientific research around global health interventions, or policy around global health and development). Additionally, on the advocacy side, we’re interested in exploring opportunities outside the U.S.; we initially focused on U.S. policy for epistemic rather than moral reasons, and expect most of the most promising opportunities to be elsewhere.

If this sounds interesting, you should consider applying: we’re hiring for researchers to help.

Conclusion

We are still in the process of thinking through the implications of these claims, and we are not planning any rapid changes to our grantmaking at this time. We currently plan to continue making grants in our current focus areas at approximately the same level as we have for the last few years while we try to come to more confident conclusions about the balance of considerations above. As Holden outlined in a recent blog post, a major priority for the next couple years is building out our impact evaluation function. We expect that will help us develop a more confident read on our impact in our most mature portfolio areas, and accordingly will place us in a better position to approach big programmatic decisions. We will hopefully improve the overall quality of our BOTECs in other ways as well.

If, after building out this impact evaluation function and applying it to our work to date, we decided to substantially reduce or wind down our giving in any of our current focus areas, we’d do so gradually and responsibly, with ample warning and at least a year or more of additional funding (as much as we feel is necessary for a responsible transition) to our key partner organizations. We have no current plans to do this, and we know funders communicating openly about this kind of uncertainty is unusual and can be unnerving, but our hope is that sharing our latest thinking will be useful for others.

Finally, we’re planning to write more at a later date about the cost-effectiveness of our “long-termist” and animal-inclusive grantmaking and the implications for our future resource allocation.

^[1] See e.g. Subjective Well‐Being and Income: Is There Any Evidence of Satiation? (archive)

For instance Deaton (2008) and Stevenson and Wolfers (2008) find that the well-being–income relationship is roughly a linear-log relationship, such that, while each additional dollar of income yields a greater increment to measured happiness for the poor than for the rich, there is no satiation point.

^[2] We’re eliding a huge amount of complexity here in terms of modeling the domestic welfare impacts of various policy changes, which we recognize. In practice, our calculations are often very crude, though we try to be roughly consistent in considering distributional issues and weighing whether incomes are increasing due to productivity changes, prevented waste, or other causes.

^[3] See footnote 33 in GiveWell’s writeup on GiveDirectly.

^[4] 2017 average U.S., per capita income was $34,489, per the U.S. Census. (archive)

^[5] $34,000 / ($288.35 / 0.9) = ~106. Using median U.S. income rather than mean would reduce this ~20% but seems less apt as a comparison since we’re partially modeling foregone spending and taxes are moderately progressive.

^[6] See Economic Growth and Subjective Well-Being: Reassessing the Easterlin Paradox. (archive)

^[7] Too much: there is some evidence of satiation (archive) in terms of self-reported wellbeing even in log terms as incomes get very high by global standards. Additionally, if you think very high incomes carry net negative externalities (e.g., through carbon emissions or excess political influence), you may even think additional income at the high end should be treated as negative. Finally, placing high moral weight on marginal consumption for high-income people seems to imply that their lives “have a lot more value in them” or “are worth a lot more,” which seems problematic.
Too little: people continue to exercise substantial effort to increase their own income, even at high levels, and there seem to be obvious benefits beyond subjective wellbeing that accrue to them from doing so (such as increased lifespan or educational access). Additionally, if you’re discounting income or consumption logarithmically or more, even very small positive spillovers from high income people to others (e.g., through employment, charity, or bequests) could swamp the first order effects in a utility calculation.

^[8] This 5-10x cash range translated to roughly ~$2,000-4,000 per “life saved equivalent” in the 2015 cost-effectiveness calculation - XLSX.

^[9] Based on the median results from GiveWell’s final 2018 cost-effectiveness calculation, 8x cash implies a “cost-per outcome as good as saving an under-5 life” of ~$1,500. This is not directly comparable to the figures from 2015 because GiveWell made some changes in the values and framework used in their cost-effectiveness calculation, which affect both the outcome measures and the comparisons between them.

^[10] Another way to get similarly high overall ROI figures is from comparing GiveWell’s top charity “cost per life saved equivalent” figures to rich world “value of a statistical life” figures:

GiveWell estimates that bednets and seasonal malaria chemoprevention and vitamin A supplementation all can save lives for (low) single digit thousands of dollars and have substantial funding gaps remaining. We tend to place much more weight on GiveWell’s figures than other sources, but estimates in this range are not unusual; Gavi (archive) and the Global Fund (archive) make similar claims (not without (archive) controversy) (archive).
Standard “value of a statistical life” methods in the U.S. and other rich countries imply values in the (high) single digit millions of dollars. For instance, the U.S. Department of Transportation uses a value of $9.6 million for a statistical life (archive), which is on the high side relative to other estimates.
If we value the lives of children saved by GiveWell’s global health charities the same way the DOT values lives lost in car crashes in the U.S., we would estimate ~5,000x returns ($9.6M benefit/~$2K cost per life saved equivalent).

To be clear, this calculation violates the standard assumptions of value of a statistical life, one of which is that the value of a life depends on the income of a person who lives it, and is not endorsed by GiveWell (which has a more complicated moral weights system for comparing outcomes).

^[11] Since this post was first written, we came across Five-Hundred Life-Saving Interventions and Their Cost-Effectiveness (archive).

^[12] When we looked closely at one of the calculations in the the DCP2, we found serious errors. We haven’t looked closely at the other sources at all. Overall, we expect the project of trying to estimate the cost-effectiveness of many different interventions in uniform terms to be extremely difficult and error-prone, so we don’t mean to endorse these specific estimates.

^[13] Some discussion of this in the comments of GiveWell’s 2011 post on errors in the DCP2.