Adsum Insights Blog

Little to No Chance the Main Conclusion of Google's Project Aristotle is True

leadership: developing others/building teams

This is the second article in my series on how over emphasizing psychological safety on teams might actually have undermined team effectiveness. You can find my introduction here.

This post is about Google's Project Aristotle ...the study that almost single-handedly caused what can only be described as an astonishing, over-rotation towards psychological safety on teams by concluding it differentiated high-performing teams from less successful ones (that was the purpose of the project) and "was critical to making a team work."

I don't know this for a fact, but I will share my logic for why I don't think the study that launched a thousand ships has any chance of being true.

It is also an inquiry into why so many people took the bait…hook, line, and sinker…and a few thoughts about where to from here. I will have more suggestions as the series goes along.

To keep the length more manageable, I have put a significant acknowledgement, a few disclaimers, and an appendix at the end. The appendix has my best guess about where Project Aristotle went wrong methodologically, using info taken off a Google site. But it's really for propeller heads only.

Based on the reactions across the press and blog-o-sphere, they should have called it Project Galileo instead of Project Aristotle because now, suddenly, Psychological Safety seemed to be the heavenly object that all high-performing teams orbited around.

A What They Said vs. What Was Heard Problem? Not Really.

After my intro to this series was released, the hair-splitting online has already begun about what Project Aristotle actually concluded. So let's get grounded in that.

Remember we don't have much to go on: the NYT article and this Google Site.

Before we note what was actually said, keep in mind, that Aristotle set out to study what differentiated high-performing teams from less effective teams.

First from the NYT...edited for brevity:

As the researchers studied the groups, however, they noticed two behaviors that all the good teams generally shared. First, on the good teams, members spoke in roughly the same proportion, a phenomenon the researchers referred to as ‘‘equality in distribution of conversational turn-taking." [...] Second, the good teams all had high ‘‘average social sensitivity’’ — a fancy way of saying they were skilled at intuiting how others felt based on their tone of voice, their expressions and other nonverbal cues. [...] When Rozovsky and her Google colleagues encountered the concept of psychological safety in academic papers, it was as if everything suddenly fell into place. [...] But Google’s data indicated that psychological safety, more than anything else, was critical to making a team work."

And from Google's own site:

With all of this data, the team ran statistical models to understand which of the many inputs collected actually impacted team effectiveness…in order of importance…#1 Psychological Safety...

It is extremely important to remind that Aristotle also reported that there were many other factors behind team success and that teams find many ways to combine those factor and others to organize, adapt, set norms, make decisions, prioritize, leverage team asses and win. (More on that in Part 3.)

That's what was said, but what was heard?

Based on the reactions across the press and blog-o-sphere, they should have called it Project Galileo instead of Project Aristotle because now, suddenly, Psychological Safety seemed to be the heavenly object that all high-performing teams orbited around.

So we could have a "what was heard" and a "what was done with it" problems. But I think the real issue is deeper.

I'm not convinced about the research or the conclusion about psychological safety.

The Three Reasons that Give Project Aristotle's Main Conclusion a Near Zero Chance of Being True

Let me be unequivocal. Personally, I think there is a near zero chance that the Project Aristotle conclusion is true.

I don't know that, but I am willing to bet that their conclusion is not true for three reasons:

The Overall Replication Crisis in the Social Sciences
No Independent Eyes on Their Work Before "Publishing"
Unwillingness to Just Share Their Procedures, Statistical Methods, and Data with Other Researchers

The Replication Crisis in the Social Sciences

By one estimation, only 36% of the studies in premiere Psychology journals are able to be replicated. The problem is particularly pronounced (< 25% replication rate) in Social Psychology (which team research would at least in part fall under).

Let that sink in.

Here are a few highly-touted, widely-believed psychological findings that have been walked back:

The Ego-Depletion Effect: The claim was that self-control is a finite resource, and exercising willpower in one task depletes it for subsequent tasks. A large-scale replication project in 2016 found no evidence supporting the ego-depletion effect. The current view is that while the idea is appealing, evidence for a universal ego-depletion mechanism is weak.
Power Poses: The claim here was that adopting expansive, high-power poses increases feelings of power, confidence, and even physiological markers like testosterone levels. Subsequent studies have failed to replicate the physiological effects, though some psychological effects (e.g., feeling more confident) may still hold under certain conditions. The current view is that the ability of power poses as a means of improving outcomes is questionable, with some researchers dismissing the findings as over-hyped.
Loss Aversion: is the belief that people strongly prefer avoiding losses to acquiring equivalent gains, with losses being perceived as approximately twice as powerful as gains. This finding was regarded as axiomatic for decades. Loss aversion is still a valid concept, but the magnitude of it turns out to be very context dependent.

So here's already where we are. Generally, only 36% of the findings in this field can be replicated.

You have a study from this field in front of you, in this case Project Aristotle.

Knowing the overall replication success rate, how much would you be willing to bet that psychological safety is the "#1 input impacting team effectiveness" is true and can be replicated?
I am assuming you don't make big bets with a 1 in 3 base rate of success. So if you are willing to bet a sizeable amount, what information did you use to "update your priors" as Bayesians like to say? In other words, what additional pieces of information do you have that give you more confidence the conclusion is true and that would thus tip the odds on your bet more in your favor?
- Does the notion of psychological safety on teams have a "face validity" that gives you confidence? Is it what you personally believe to be true? Is it the fact that Google did the research? Something else?

Really, I could stop there. This alone should give the statistically minded pause.

But there is more.

No Independent Eyes on Their Methods Before "Publishing"

The original claims...about power poses, ego depletion, loss aversion...were first published in respected, peer-reviewed journals.

Just to stay grounded, a peer-reviewed journal means that multiple scientists (unknown to the researchers and with no vested interest in the findings) have to review a study's methodology (procedures, controls, data collection, statistical analyses, and (this is important) the conclusions drawn against those methods and give their stamp of approval before that study can be published.

The reason for having peer-review is to prevent weak research from getting published. Weaknesses can come from many sources: low sample sizes or samples that are biased in some way, poor use of control-groups, procedures that don't ensure researcher "blindness," ill-conceived data analyses, conclusions that go far beyond what the data would allow one to conclude, etc.

Spurious findings can still emerge from relatively solid procedures and analyses and poor studies can still scamper across the QC desk, but peer review keeps a lot of chaff out of the wheat.

Still, after publication of the original results about power poses, etc, other scientists following the same published procedures could not replicate the original findings.

Thus, despite solid efforts on the quality and transparency front, the results about ego-depletion, loss aversion, etc, appear to be anomalies.

At Google, did any outside eyes, ones without a dog in the fight, look over Project Aristotle's procedures and methods before it was, not published because it wasn't in the research sense, but broadcast by the NY Times?

I don't know for sure, but since it was launched as an internal study, my guess is outside, independent eyes did not look over their approach.

Unwillingness to Just Share Their Procedures, Statistical Methods, or Data

If the general replicability problem in Psychology isn't enough to give you pause, consider this:

aside from the NYT article, Google did not publish the study in a peer-reviewed journal. This despite the fact that their conclusion had the potential to be truly ground-breaking...something that could improve team performance globally! (It was certainly treated as such.). This will sound hyperbolic, but If replicable, this research, in my view, was worthy of a being a Nobel Peace Prize nominee. That is the kind of global impact it could have had.
Now they may have reasons they don't want to publish...they have a business to run...too busy making teams successful...onto the next cutting-edge research challenge...blah blah blah...that's fine. But they will not even share their procedures, data, or statistical methods with other scientists so they could replicate the study or do what science does and stand on the shoulders of previous research, even failed research, to advance the field. Had they shared this just so the science could further itself, it would have made those Google researchers close to Social Science royalty.
Because of #1 and #2, the study has never been replicated.

And if it can't be replicated, the conclusion that "psychological safety, more than anything else, [is] critical to making a team work."...like loss aversion, power poses, and ego depletion, would have to be regarded as false, even if you really, really want it to be true.

To wit, if you search for Project Aristotle you will need a snorkel to keep from drowning in the number of sites that just took the paramount importance of psychological safety as gospel and hit Send.

Despite Warnings, Many Rose to the Bait. Why?

As the study's conclusion began to proliferate, well-known figures sounded the alarms.

Here are but a few:

"The idea that psychological safety is always good is wrong. You can create a psychologically safe environment where people don't take risks, and they don't challenge one another. But in a world where competition is fierce, you need a little bit of discomfort to perform." ~ Jeffrey Pfeffer, Stanford Professor, From the book Leadership BS.

"Psychological safety is important in that people should feel safe to speak up and express themselves, but I think it is just as important that people are held accountable and pushed to succeed. High performance comes from pushing people, not coddling them."
~Ben Horowitz, from the book, The Hard Thing About Hard Things

"Teams need conflict to thrive, and that means sometimes leaving the comfort zone. Psychological safety can be overrated if it leads to complacency." ~ Patrick Lencioni, The Five Dysfunctions of a Team

Despite the warnings, many just ran with the headline.

To wit, if you search for Project Aristotle you will need a snorkel to keep from drowning in the number of sites that just took the paramount importance of psychological safety as gospel and hit Send.

Maybe all the issues I am raising have solid explanations, but there are plenty of reasons for skepticism.

Why then did something so consequential get so little scrutiny?

Did people just see the words "Google" or "Aristotle" or the word "research" in the headline and throw their powers of discernment out the window?
Was it the fact that the article was in the NYTimes?
Did everyone forget this was an internal study of the teams in one organization before breathlessly hyper-generalizing?
And with all these potential issues or at least glaring questions about the research, it is worth asking what was Google's motivation here? I understand the motivation to study teams...multiplying team effectiveness has a huge potential payout for organizations. But why not just leverage the findings internally? Had the findings had the power implied, Google could have used them to increase the effectiveness of all their teams, resulting in real competitive advantage. Who was the real stakeholder here? What did Google gain by broadcasting non peer-reviewed, unreplicated research in the NYTimes?
Finally, maybe something less insidious and much more mundane is behind this over-rotation. One of the many ills that flesh is heir to is Confirmation Bias. Maybe the conclusion about psychological safety was just consistent with what most believed was true or what they wanted to be true about teams, so no need to question it.

That last point...people believing what they want/need to believe...shouldn't come as a surprise. It's prosaic.

Remember the 80s and 90s when they told us moderate alcohol consumption was good for us? "It helps with digestion," "It has Resveratrol," "It dilates the arteries and is good for circulation," "Look at how healthy those Europeans are...and how much wine they drink..."

We soooo wanted to believe that. Why? Because it fit with our alcohol consumption patterns, preferences, and beliefs and the way we wanted to live.

Then comes, damnit, those large N, longitudinal studies, and where are we now? The Surgeon General wants to put cancer warnings on our bottles of wine and cans of beer! A literal buzz kill!

Where To From Here?

However we got here, if you are serious about trying to improve a given team's performance...as a leader, as a member of a team, or as a facilitator....you have to now ask yourself where you are, really, on psychological safety and teams:

Do you still think psychological safety is the most important factor differentiating high-performing teams? Why?
Are you going to let your own beliefs about psychological safety lead you to concentrate your search for what might be limiting the team's performance around that issue? "I lost my keys over there, but I am looking here because the light is better."
Are you going to nudge your teams towards naval gazing sessions because you believe that is the kind of work teams really need to do to "really get to the issues?" Or because it's the kind of work you like to do?
Finally, do you think your continuing to tout the importance of psychological safety might be contributing to the problem. Would it make sense hold your fire until the study has been replicated?

I'll give the project's namesake the last word:

Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. ~Aristotle

Part 3 will explore how between-team differences are more important than any conclusion about what applies to all successful teams, if there even is such a thing.

Dennis Adsit, Ph.D. is an executive coach, organization consultant, and the designer of The First 100 Days and Beyond, a consulting service that has helped hundreds of leaders get the best start of their careers, while also providing frameworks and templates for continued success long past their First Hundred Days.

Acknowledgement

Substantive input and editing suggestions provided by Rob Jones, author of A Hole in Science--Grammer of the Sociological Problem. Thank you, Rob.

Disclaimers

First, I am not saying and never have said that psychological safety on teams is not a factor. I'm challenging the claim that launched a thousand ships and articles and blog posts that it is "the #1 input impacting team effectiveness."

Second, I don't make a "little to no chance" claim like this lightly. Many of the researchers on Project Aristotle were likely Industrial Psychologists, my brethren. We are cut from the same empirical cloth. I have sat with this discomfort for years and it still gives me pause to criticize what I had no visibility into.

Third, I of course don't know that my challenge to their finding is correct. And, if new information comes to light which suggest my inferences are incorrect, I will gladly, and I mean that, issue a retraction and an apology.

Appendix: My Best Guess, Using Google's Own Words, About the Study's Major Weakness that Made Project Aristotle Impossible to Publish (For Propeller Heads Only)

Here is what the researchers were trying to do. They were trying to create two "piles": a pile of "successful" teams and a, let's say, pile of less successful teams.

Once they had those two piles, they could then look for differences between the piles in terms of what the groups were doing differently that might contribute to their success or lack thereof.

This seems so straight-forward.

But do you have any idea how unbelievably hard it is to reliably distinguish successful teams from unsuccessful teams anywhere, but especially outside the lab?

For Industrial Psychologists, this is known as "the criterion problem" and it has been bedeviling research in our field since it came into its own in World War I with its efforts to match Army recruits to jobs.

We know we have all these differences between teams, but we are trying to figure out which ones "matter." Matter means: it is present on the successful teams but less or not present on the less successful teams.

All conclusions about the factors that might contribute to team success hinge on reliable and valid distinctions between those two piles.

I can't emphasize that enough.

All conclusions about the factors that might contribute to team success hinge on reliable and valid distinctions between those two piles.

And it is right here where things get complicated:

Is there an objective measure of success being used, like percent of quota, or lines of code written? Was good or bad luck removed from the results? For example, any last-minute "bluebirds" that converted an about-to-miss-quota-and-thus-be-an-unsuccessful-team into a successful team or vice-versa?
Can a team that doesn't make quota (or win the championship) still be regarded as successful?
Is self-report data being used to decide who is successful? Do the teams get to decide for themselves if they are successful?
Or do higher level VPs, who the teams report into, do the rating? Is it one VP deciding or do we have multiple VPs who have visibility to multiple teams deciding so we can compare ratings?
If we have multiple raters, how do we get them to agree on standards so we don't have a bunch of easy graders and a bunch of tough graders?
If ratings were used, was a reliability coefficient calculated? Ratings can be notoriously unreliable and if you are going to include it in the criterion score, it better be solid.

You can see here, in Google's own words from their site, there was no real agreement on how to define "successful" teams and distinguish them from less successful teams.

"Executives were most concerned with results (e.g., sales numbers or product launches), but team members said that team culture was the most important measure of team effectiveness. Fittingly, the Team Leads' concept of effectiveness spanned both the big picture and the individuals’ concerns saying that ownership, vision, and goals were the most important measures."

They apparently settled on some aggregate measure or combination of measures which merged these various perspectives.

They never say how they did that aggregation, but I am guessing that is where the methodological weaknesses lie.

Finally, regarding methodology, I was discussing this morass with a colleague on LinkedIn and he decided to pose this question to Google's own DeepMind Gemini-1.5-Pro AI Bot: "Why has the Google Project Aristotle study never been published?"

Here is the most relevant part that came back from Google's own AI:

The Google Aristotle study, officially titled "Project Aristotle," has not been formally published in a peer-reviewed academic journal. While Google did release some of its findings in a New York Times Magazine article in 2016 by Charles Duhigg, the full study remains internal. There are several likely reasons for this. [...] Methodological Concerns: While the New York Times article presented some compelling narratives and general conclusions, the full study might have methodological limitations that Google didn't want to expose to rigorous academic scrutiny. This could include issues with sample size, data collection methods, or the generalizability of the findings beyond Google's specific context. Publishing might have opened the study up to criticism that could damage Google's reputation.

Yep.

"A crucial part of reform [for the replicability problem] is a commitment to openness—making data, analysis scripts, and hypotheses accessible to others so they can verify, critique, and improve the findings." ~E. J. Wagenmakers, Psychological Science in the Public Interest, 2016