Adsum Insights Blog

Unlikely That the Main Conclusion of Google's Project Aristotle is True

leadership: developing others/building teams

This is the second article in my series on how over emphasizing psychological safety on teams may have undermined team effectiveness. You can find my introduction here.

This post is about Google's Project Aristotle ...the study that almost single-handedly caused what can only be described as an astonishing over-rotation towards psychological safety on teams by concluding it differentiated high-performing teams from less successful ones and "...more than anything else, was critical to making a team work."

In this article, I will share my logic for why I don't think their conclusion is true, not at this point.

This article is also an inquiry into why so many people took the bait…hook, line, and sinker…and a few thoughts about where to from here. I will have more suggestions as the series goes along.

To keep the length more manageable, I have put a significant acknowledgement, a few disclaimers, and an appendix at the end.

The appendix has my best guess about where Project Aristotle might have run into trouble methodologically, using info taken off a Google site. You don't need to read the appendix to understand the overall rationale, and it is primarily intended for propeller heads.

Based on the reactions across the press and blog-o-sphere, they should have called it Project Galileo instead of Project Aristotle because now, suddenly, psychological safety seemed to be the heavenly object that all high-performing teams orbited around.

A What They Said vs. What Was Heard Problem? Not Really.

After my intro to this series was released, the hair-splitting online began immediately about what Project Aristotle actually concluded. So let's get grounded in that.

Remember we don't have much to go on: the NYT article linked to previously and this Google Site.

Before we note what was actually said, keep in mind, the "groups" (referred to next) that Aristotle was trying to distinguish between were high-performing "good" teams and less high-performing teams.

First from the NYT...edited for brevity:

As the researchers studied the groups, however, they noticed two behaviors that all the good teams generally shared. First, on the good teams, members spoke in roughly the same proportion, a phenomenon the researchers referred to as ‘‘equality in distribution of conversational turn-taking." [...] Second, the good teams all had high ‘‘average social sensitivity’’ — a fancy way of saying they were skilled at intuiting how others felt based on their tone of voice, their expressions and other nonverbal cues. [...] When Rozovsky and her Google colleagues encountered the concept of psychological safety in academic papers, it was as if everything suddenly fell into place. [...] Google’s data indicated that psychological safety, more than anything else [emphasis added], was critical to making a team work."

And from Google's own site:

With all of this data, the team ran statistical models to understand which of the many inputs collected actually impacted team effectiveness…in order of importance…#1 Psychological Safety...

It is extremely important to also highlight that Aristotle clearly reported that there were other factors behind team success (Dependability, Structure & Clarity, Meaning, Impact...see Google site link) and that teams mix and match those factors and many others affecting team effectiveness to find their way to win.

That's what Google & the NYT said, but what was heard?

Based on the reactions across the press and blog-o-sphere, they should have called it Project Galileo instead of Project Aristotle because now, suddenly, psychological safety seemed to be the heavenly object that all high-performing teams orbited around.

In addition to being more important than the other team factors the Google researchers noted, psychological safety is apparently more important than Winning Norms that everyone is held accountable to. More important than Operating Mechanisms and an Operating Rhythm that drive transparency, accountability, and focus. More important than having the right Resources to complete the mission.

Since they say it is the #1 input, apparently, though it surprises me, it is even more important than strong Talent at every position on the team. And, again surprising to me, more important than the degree of Buy-in and having a Relentless Drive to Succeed across the team.

So while "what was said," "what was heard," and a "what was done with it" probably all contributed this over-rotation towards safety on teams, as you might have guessed, I think the real problem goes deeper.

Like many famous findings in the Social Sciences, I don't think the finding that "psychological safety, more than anything else, [is] critical to making a team work" would stand under the weight of scrutiny.

The Three Reasons Project Aristotle's Main Conclusion is Unlikely True

With the information we have right now, I think it is highly unlikely that the Project Aristotle conclusion about psychological safety is true.

I don't know that, but I am willing to bet it's not true for three reasons:

The Overall Replication Crisis in the Social Sciences
No Independent Eyes on Their Work Before "Publishing"
Unwillingness to Just Share Their Procedures, Statistical Methods, and Data with Other Researchers

The Replication Crisis in the Social Sciences

By one estimation, only 36% of the studies in premiere Psychology journals are able to be replicated. The problem is particularly pronounced (< 25% replication rate) in Social Psychology (which team research would at least in part fall under).

Let that sink in.

Here are a few highly touted, widely-believed psychological findings that have been walked back because they couldn't be fully replicated:

The Ego-Depletion Effect: The claim was that self-control is a finite resource, and exercising willpower in one task depletes it for subsequent tasks. A large-scale replication project in 2016 found no evidence supporting the ego-depletion effect. The current view is that while the idea is appealing, evidence for a universal ego-depletion mechanism is weak.
Power Poses: The claim here was that adopting expansive, high-power poses increases feelings of power, confidence, and even physiological markers like testosterone levels. Subsequent studies have failed to replicate the physiological effects, though some psychological effects (e.g., feeling more confident) may still hold under certain conditions. The current view is that the ability of power poses as a means of improving outcomes is questionable, with some researchers dismissing the findings as over-hyped.
Loss Aversion: is the belief that people strongly prefer avoiding losses to acquiring equivalent gains, with losses being perceived as approximately twice as powerful as gains. This finding was regarded as axiomatic for decades. Loss aversion is still a valid concept, but the magnitude of it turns out to be very context dependent.

So here's already where we are. Generally, only 36% of the findings in this field can be replicated.

You have, not a research study, but a report about a research study from this field in front of you, in this case a report about the findings of Project Aristotle.

Knowing the overall replication success rate, how much would you be willing to bet that the claim psychological safety is the "#1 input impacting team effectiveness" is true and can be replicated?
I am assuming you don't make big bets with a 1 in 3 chance of success. So if you are willing to bet a sizeable amount, what information did you use to "update your priors" as Bayesians like to say? In other words, what additional pieces of information do you have that give you more confidence the conclusion is true and would thus tip the odds on your bet more in your favor?
- Does the notion of psychological safety on teams have a kind of "face validity" that gives you confidence? Is it what you personally believe to be true about teams? Is it the fact that Google did the research? Something else?

Really, I could stop there. This alone likely gives the statistically-minded enough reason to question the conclusion.

But there's more.

No Independent Eyes on Their Methods Before "Publishing"

The original claims...about power poses, ego depletion, loss aversion...were first published in respected, peer-reviewed journals.

Just to stay grounded, a peer-reviewed journal means that multiple scientists (unknown to the researchers and with no vested interest in the findings) have to review a study's methodology (procedures, controls, data collection, statistical analyses, and (this is important) the conclusions drawn against those methods. They have to give their stamp of approval before that study can be published.

The reason for having peer-review is to prevent weak research and groundless claims from getting published. Weaknesses can come from many sources: low sample sizes or samples that are biased in some way, poor use of control-groups, procedures that don't ensure researcher "blindness," ill-conceived data analyses, conclusions that go far beyond what the data would allow one to conclude, etc.

Spurious findings can still emerge from relatively solid procedures and analyses and, of course, poor studies can still scamper across the QC desk, but peer review keeps a lot of chaff out of the wheat.

Still, after publication of the original results about power poses, etc., other scientists following the same published procedures could not replicate the original findings.

Thus, despite solid efforts on the quality and transparency front, the results about ego-depletion, loss aversion, etc, appear to be anomalies, i.e., the original claims were not true.

At Google, did any outside eyes, ones without a dog in the fight, look over Project Aristotle's procedures and methods before it was, not published because it wasn't in the research sense, but broadcast by the NY Times?

I don't know for sure, but since it was launched as an internal study, my guess is outside, independent eyes did not look over their approach.

And until it's replicated, the conclusion that "psychological safety, more than anything else, [is] critical to making a team work" would have to be regarded...not as false, but not true either...as a hypothesis, at best, in serious need of vetting.

Unwillingness to Just Share Their Procedures, Statistical Methods, or Data

If the general replicability problem in Psychology isn't enough to give you pause, consider this:

aside from the NYT article, Google did not publish the study in a peer-reviewed journal. This despite the fact that their conclusion had the potential to be truly ground-breaking...something that could improve team performance globally! (It was certainly treated as such.). This will sound hyperbolic, but if replicable, this research, in my view, was worthy of a being a Nobel Prize nominee. That is the kind of global impact it could have had.
Now they may have reasons they don't want to publish their research...they have a business to run...too busy making teams successful...onto the next cutting-edge research challenge...blah blah blah...that's fine. But they will not even share their procedures, data, or statistical methods with other scientists so they could try to replicate the study. Or, perhaps better yet, do what science does and stand on the shoulders of previous research, even failed research, to advance the field. Had they shared this so the science could further itself, it would have made those Google researchers close to Social Science royalty.
Because of #1 and #2, the study has never been replicated.

And until it's replicated, the conclusion that "psychological safety, more than anything else, [is] critical to making a team work" would have to be regarded...not as false, but not true either...as a hypothesis, at best, in serious need of vetting.

"A crucial part of reform [for the replicability problem] is a commitment to openness—making data, analysis scripts, and hypotheses accessible to others so they can verify, critique, and improve the findings." ~E. J. Wagenmakers, Psychological Science in the Public Interest, 2016

Despite Warnings, Many Rose to the Bait. Why?

As the study's conclusion began to proliferate, well-known figures sounded the alarms.

Here are but a few:

"The idea that psychological safety is always good is wrong. You can create a psychologically safe environment where people don't take risks, and they don't challenge one another. But in a world where competition is fierce, you need a little bit of discomfort to perform." ~ Jeffrey Pfeffer, Stanford Professor, From the book Leadership BS.

"Psychological safety is important in that people should feel safe to speak up and express themselves, but I think it is just as important that people are held accountable and pushed to succeed. High performance comes from pushing people, not coddling them."
~Ben Horowitz, from the book, The Hard Thing About Hard Things

"Teams need conflict to thrive, and that means sometimes leaving the comfort zone. Psychological safety can be overrated if it leads to complacency." ~ Patrick Lencioni, The Five Dysfunctions of a Team

To wit, if you search for Project Aristotle you will need a snorkel to keep from drowning in the number of sites that just copied, pasted and published...as fact...that psychological safety was the key to high performing teams.

Many seemed to not care about the warnings and just ran with the headline.

To wit, if you search for Project Aristotle, you will need a snorkel to keep from drowning in the number of sites that just copied, pasted and published...as fact...that psychological safety was the key to high performing teams.

Maybe all the issues I am raising have solid explanations, but there are plenty of reasons to warrant a healthy skepticism.

Why then did something so consequential...with so many questions swirling about it... get so little scrutiny?

Did people just see the words "Google" or "Aristotle" or the word "research" in the headline and throw their powers of discernment out the window?
Was it the fact that the article was in the NYTimes?
Did everyone forget this was an internal study of the teams in a single organization before they began breathlessly hyper-generalizing?
And with all these potential issues or at least glaring questions about the research, it is worth asking what was Google's motivation here? I understand the motivation to study teams...multiplying team effectiveness has a huge potential payout for organizations. But why not just leverage the findings internally? Had the findings had the power implied, Google could have used them to increase the effectiveness of teams company-wide, producing real competitive advantages. Who was the real stakeholder here for releasing anything? What did Google gain by broadcasting non peer-reviewed, unreplicated research in the NYTimes?
Finally, maybe something less insidious and much more mundane is behind the over-rotation we witnessed. One of the many ills that flesh is heir to is Confirmation Bias. Maybe the conclusion about psychological safety was just consistent with what most believed was true or what they wanted to be true about teams, so no need to question it.

That last point...people believing what they want/need to believe...shouldn't come as a surprise. It's so prosaic, it's barely worth highlighting.

Remember the 80s and 90s when they told us moderate alcohol consumption was good for us? "It helps with digestion," "It has Resveratrol," "It dilates the arteries and is good for circulation," "Look at how healthy those Europeans are...and how much wine they drink..."

We soooo wanted to believe what they told us. Why? Because it fit with our alcohol consumption patterns, preferences, beliefs, and the way we wanted to live.

Then came, damnit, those large N, longitudinal studies. And where are we now? The Surgeon General wants to put cancer warnings on our bottles of wine and cans of beer! A literal buzz kill!

Where To From Here?

However we got here, if you are serious about trying to improve a given team's performance...as a leader, as a member of a team, or as a facilitator....you now have to ask yourself where you stand, really, on psychological safety and teams:

Do you still think psychological safety is the most important factor differentiating high-performing teams? Why?
Are you going to let your own beliefs about psychological safety lead you to concentrate your search for what might be limiting the team's performance around that issue? "I lost my keys over there, but I am looking here because the light is better."
Are you going to nudge your teams towards naval gazing sessions because you believe that is the kind of work teams need to do to "really get to the issues?" Or perhaps because it's the kind of work you like to do?
Finally, do you think continuing to tout that psychological safety is the #1 input impacting team effectiveness, when it's at best a hypothesis, might actually be contributing to the over-rotation problem we're seeing? Can you see some value in, for now, keeping your powder dry until the study has been replicated?

To close, I'll give the project's namesake the last word:

Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. ~Aristotle

Part 3 will explore how variation and between-team differences are way more important than any conclusion about what applies to all successful teams, if there even is such a thing.

Dennis Adsit, Ph.D. is an executive coach, organization consultant, and the designer of The First 100 Days and Beyond, a consulting service that has helped hundreds of leaders get the best start of their careers.

Acknowledgement

Substantive input and editing suggestions provided by Rob Jones, author of A Hole in Science--Grammar of the Sociological Problem. Thank you, Rob.

Disclaimers

First, I am not saying and never have said that psychological safety on teams does not affect their performance. I'm challenging the claim that launched a thousand ships and articles and blog posts that it is "the #1 input impacting team effectiveness."

Second, I don't raise these doubts and concerns lightly. Many of the researchers on Project Aristotle were likely Industrial Psychologists, my brethren. We are cut from the same empirical cloth. I have sat with this discomfort for years and it still gives me pause to criticize what I had no visibility into.

Third, I of course don't know that my challenge to their finding is correct. And, if new information comes to light which provides empirical evidence that in fact psychological safety is the #1 input impacting team effectiveness, I will gladly, and I mean that, issue a retraction and an apology.

Appendix: My Best Guess, Using Google's Own Words, About the Study's Methodological Challenges

Here is what the researchers were trying to do. They were trying to create two "piles": a pile of "successful" teams and a, let's say, pile of less successful teams.

Once they had those two piles, they could then look for differences between the piles in terms of what the groups were doing differently that might contribute to their success or lack thereof.

This seems so straight-forward.

But do you have any idea how unbelievably hard it is to reliably distinguish successful teams from unsuccessful teams, especially outside the lab?

For Industrial Psychologists, this is known as "the criterion problem" and it has been bedeviling research in our field since it came into its own in World War I with its efforts to match Army recruits to jobs.

We know we have all these differences between teams, but we are trying to figure out which ones "matter." Matter means: it is present on the successful teams but less or not present on the less successful teams.

All conclusions about the factors that might contribute to team success hinge on reliable and valid distinctions between those two piles.

I can't emphasize that enough.

All conclusions about the factors that might contribute to team success hinge on reliable and valid distinctions between those two piles.

And it is right here where things get complicated:

Is there an objective measure of success being used, like percent of quota, or lines of code written? Was good or bad luck removed from the results? For example, any last-minute "bluebirds" that converted an about-to-miss-quota-and-thus-be-an-unsuccessful-team into a successful team or vice-versa?
Can a team that doesn't make quota (or win a championship, speaking more generally) still be regarded as successful?
Is self-report data being used to decide who is successful? Do the teams get to decide for themselves if they are successful?
Or do higher level VPs, who the teams report into, do the rating? How are his/her biases removed?
Is it one VP deciding or do we have multiple VPs who have visibility to multiple teams deciding so we can compare ratings?
If we have multiple raters, how do we get them to agree on standards so we don't have a bunch of easy graders and a bunch of tough graders?
If ratings were used, was a reliability coefficient calculated? Ratings can be notoriously unreliable and if you are going to include it in the criterion score, it better be solid.

You can see here, in Google's own words from their own site, there was no real agreement on how to define "successful" teams and distinguish them from less successful teams.

"Executives were most concerned with results (e.g., sales numbers or product launches), but team members said that team culture was the most important measure of team effectiveness. Fittingly, the Team Leads' concept of effectiveness spanned both the big picture and the individuals’ concerns saying that ownership, vision, and goals were the most important measures."

They apparently settled on some aggregate measure or combination of measures which merged these various perspectives.

They never say how they did that aggregation, but I am guessing that is where the methodological weaknesses lie.

It of course could be something else or a combination of weaknesses. There is so much we don't know.

For example, the NYT article implies that all the good teams had psychological safety...or maybe they meant "more psychological safety"...than the not so good teams.

But what does it mean to "have psychological safety?"

Is it the presence of certain behaviors? To what degree? Or did they use a survey to measure it? Did the good teams generally score higher on that survey? Or was there a cut-off score...above this level you have safety, below this level you do not? What instrument did they use? Customized or off-the-shelf? Reliable? Validated?

And on and on.

I was discussing this methodological morass with a colleague and he decided to pose this question to Google's own DeepMind Gemini-1.5-Pro AI Bot: "Why has the Google Project Aristotle study never been published?"

Here is the most relevant part that came back from Google's own AI:

The Google Aristotle study, officially titled "Project Aristotle," has not been formally published in a peer-reviewed academic journal. While Google did release some of its findings in a New York Times Magazine article in 2016 by Charles Duhigg, the full study remains internal. There are several likely reasons for this. [...] Methodological Concerns: While the New York Times article presented some compelling narratives and general conclusions, the full study might have methodological limitations that Google didn't want to expose to rigorous academic scrutiny. This could include issues with sample size, data collection methods, or the generalizability of the findings beyond Google's specific context. Publishing might have opened the study up to criticism that could damage Google's reputation.

Yep.