Conversation with Pablo Rosado

6 ago

Pablo Rosado is the Principal Data Scientist at Our World in Data. He used to be a researcher in astrophysics and, after leaving academia, he has worked as a data scientist in various companies in different sectors. He also recently started a YouTube channel, AltruFísica, where he talks about how to make the world a better place, using data and scientific evidence.

My usual warm-up question for these interviews is always related to the past. So, let’s travel back in time: in 2004 you started a double degree in Physics. Was it something vocational?

When I was little I always thought I would probably do something related to arts, especially music. But in the last few years of high school, while studying piano, I realized that I was very good at physics, and I actually enjoyed it very much. That’s how I ended up dedicating more than a decade of my life to it.

In fact, right after your degree, you began a PhD on Gravitational Wave Astrophysics and Data Analysis. Tell me more about it

I did my masters work in nuclear physics, but early on in my degree I knew I was more interested in astrophysics. I actually had no idea about gravitational waves until I had to prepare for my first interview for the PhD position. The person who interviewed me, Bruce Allen, who would become my PhD advisor (and actually his PhD advisor was Stephen Hawking!), was a highly inspiring, knowledgeable scientist. And thanks to him I came to learn about one of the most fascinating areas of physics to work on at this time in the history of science.

From a purely data analysis perspective, the challenge of detecting gravitational waves was really daunting. The most likely signal to be detected was the one produced by a pair of merging black holes. The signal itself would be a tiny blip embedded in a massive sea of noise. And the least likely signal to be detected was… the gravitational wave background, which happened to be the focus of my PhD.

To the surprise of many physicists, some years after I finished my PhD, gravitational waves from merging black holes were finally detected (and the collaboration was awarded the Nobel Prize in physics!). And since last year, we may even have evidence for a gravitational wave background, consistent with the predictions I worked on during my academic years.

I guess, at that point of your life, you were already engaged in Data Analytics, so I am curious to know more about how you were using data during your astrophysics years in Australia (2014-2016). I mean both in terms of “topics you were addressing” but also in terms of tools and methodologies.

Between the start of my degree and the end of my PhD, I worked with Fortran, C, PAW (used only in high energy physics), Mathematica, Matlab, and finally Python. Nowadays, I do most of my work in Python (and to be fair I don’t think I’d be able to write a single line of code in any of the previous languages). My research was a bit on the theoretical side. My main goal was to simulate and analyze a background signal made of many gravitational waves. The main culprits of this signal were supermassive black hole binaries spread across the universe.

After that, you worked as a Data Scientist for a few years. You did some insightful work, but I would love to know more about your work at EUIPO (European Union Intellectual Property Office) in the field of Artificial Intelligence.

At some point I decided to leave academia. Partly, because work-life balance in academia sucks. But also partly because I wanted to have a more direct positive impact with my work. And my best career option, given how much my research had been based around data, seemed to be data science.

I worked at a consultancy company in the banking sector, and then I worked at Holaluz (where we almost crossed paths!). In both cases I was doing mostly machine learning, especially natural language processing, computer vision, and time series forecasting. I was also a technical mentor at a summer school called Data Science for Social Good, which was a short but very intense and rewarding experience.

And then at some point I found out about EUIPO. Honestly, until then I had never really cared much about intellectual property (and now I care just a little bit more), but it was definitely a promising option to increase my career capital. We were using AI for tasks like assessing how similar logos (small images often with a short text) or goods and services (a short text) were. I worked a lot with word embeddings, and transformers were slowly becoming a thing around that time. And I also learned a fair bit about data engineering.

In 2022, you joined Our World in Data. For the record, I am a huge fan of this organization and, of course, Hans Rosling. But can you provide a high-level summary of the purpose of Our World in Data and its main areas of work?

I’m also a big fan of Hans Rosling’s work, and so are the majority of my colleagues!

OWID’s motto is “Research and data to make progress against the world’s largest problems”. We cover a huge variety of topics, including (in no particular order, just as they come to my mind) poverty, demography, energy, climate change, artificial intelligence, animal welfare, health and pandemics.

Most people have a very poor intuition about the world’s most pressing problems, and believe that things are getting worse and worse. Even experts in the relevant fields have intuitions that are totally misaligned with reality! Just to mention a few examples, people often think that we have a higher mortality rate now than in the past, due to diseases, war, or natural disasters. But when you start looking at the data, you realize that we are making progress in almost every area. Despite this, newspapers and social media tend to amplify the bad things and omit the incredible progress we’ve made (I made a video about this topic for my channel).

There are still lots of very hard and worrying problems today, but knowing about them should not come at the cost of neglecting progress.

Max Roser, the founder of OWID, summarized all this in a short, but very powerful way: “The world is awful. The world is much better. The world can be much better. All three statements are true at the same time.”

Let’s talk now about the projects you are working on in Our World in Data.

I now do much more “science” with data than I used to do at any of my previous jobs, so I think it’s fair to say that I do “data science”, even though I don’t do any machine learning anymore (which is what most data science jobs require in industry).

Unlike most industry jobs, we don’t usually need to extract value from a big dataset that exceeds your RAM (although we do have a few such cases). Rather, we usually need to extract value from the combination of a big number of small datasets.

We spend a lot of time curating “metadata”, which means, for example, understanding and explaining what each variable in a dataset means, where it comes from, and all its caveats. When you start digging into any dataset, you always find all sorts of issues and small inconsistencies. Often, we need to accept that the data is imperfect, but still incredibly valuable.

In terms of data engineering, we developed a Python library that works “on top of pandas” to be able to, among other things, handle all that metadata when doing operations on dataframes.

While working on data, I also do a fair amount of research on very diverse topics, which I really enjoy, since I get to learn about lots of interesting things! Right now I’m looking into data on critical minerals for the energy transition.

I find your comment “on the time spent curating metadata” very relevant, because to some extent you are talking about a hot topic such as Data Governance. Could you provide more details on this curation process? Learnings, tips, tools you use, etc?

If you look at most data sources, even official governments, very often, the data is not well documented. Sometimes not even the units are mentioned anywhere. There is very little consistency in how data is shared and communicated. And data is always nuanced. You really need to understand where a data point comes from and what it means, to extract value from it.

That’s why having good metadata (additional data that comes on the sidecar of the main data) is very useful.

“Curating metadata” usually means squeezing all the information and caveats that need to be known about a certain data point (or a time series, or a dataset) into a few words or a short paragraph. It also means ensuring that the original sources are properly cited and described.

At OWID, almost everything we do is publicly available (both data and code). On the one hand, this makes things easier, since we don’t handle sensitive data. On the other hand, it’s more pressure to know that thousands of people may find a bug before you do! That said, the vast majority of cases where someone points out an issue with our data, it’s because there was something odd in the original data, not because we made a mistake in our processing.

One tip: Add sanity checks to your data processing code. The more you dig into the data, the more weird things you will find. At some point you need to stop trying to make your data absolutely pristine. But at the very least, you should ensure that basic sanity checks are passed with green lights.

How do you think we could use Data and Artificial Intelligence for good?

You may be disappointed, but I think that the best we can do with data at this very moment in history is rather boring. We should do more OWID-like work, to better understand and prioritize the most pressing problems, and less “rocket-data-science”. Let me explain: I’m excited about all the recent advances in AI, but I’m also very worried about the risks involved. I’m not just talking about issues related to biases or lack of interpretability in machine learning models. These are also big problems, but there are much, much bigger issues that can arise from a powerful AI.

To put it in more visual terms, society is currently traveling in a supersonic plane. We are accelerating and we don’t know who’s driving or where. So, in my opinion (and in the opinion of many AI experts) we should slow down that supersonic plane, and decide its trajectory. For that, we need more people working on AI safety and governance. Some countries are making progress in this area, but globally we are not doing nearly enough to minimize the risks.

Let’s discuss now the role of Data in Climate Change. One year ago you were one of the speakers in the BcnAnalytics event on The role of Data in the Decarbonization and Energy Transition. Could you tell me the main data points we should all know to understand the magnitude of the tragedy?

One very important take-away message is that, globally, we are making huge progress. We are, for example, decarbonizing our electricity grid at an incredible pace.

However, another take-away message is that we still have so much more to do. The more progress we make now, the better the future will be.

So, combining the two, we need optimism… cautious optimism.

I’d highly recommend anyone to read the book that my colleague Hannah Ritchie published some months ago, called Not the End of the World. Her book has a very similar vibe to Hans Rosling’s Factfulness, but in the context of environmental issues.

And how can we use data to effectively mitigate the risks?

Data should be at the core of every conversation we have on climate change, and every decision we make. Last year, while working on climate change data, I made a video about this problem, and another one about the solutions. In the latter video, I suggested various solutions we can all implement at an individual level:

Donate to effective organisations doing work on climate change mitigation. In Spain, for example, you can donate to the Clean Air Task Force via Ayuda Efectiva.
Reduce your consumption of animal products. This, of course, has huge benefits in animal welfare, but it turns out to also be one of the most effective ways to reduce our negative impact on the environment.
Avoid flying and driving.
Adapt your house to be more energetically efficient.
Help politics move in the right direction. You can do that by directly contacting politicians, or, simply, by voting. Vote for political parties that take climate change seriously. No party is perfect, and you don’t have to agree 100% with them. You just need to agree on the most important topics.

Let me play the devil's advocate here. The increasing use of LLM models and data centers is leading to a surge in energy consumption. What’s your take on that?

I have heard about this issue multiple times but haven’t looked into the data yet, so I don’t have an informed opinion on it. I definitely need to look into it! But, to play the devil’s advocate (on your devil’s advocacy), I’d also point out that, so far, LLMs are making us much more efficient in accomplishing tasks, which may actually reduce our energy consumption.

In my view, the main issue with the development of LLMs is not energy consumption, but the increase in the negative risks of AI that I mentioned earlier.

You told me you are now focused on “animal welfare”. Could you let me know some more details on the work?

Farmed animal suffering is a problem of huge scale, highly neglected, and for which we already have tractable solutions. Therefore, looking at this problem from the perspective of Effective Altruism, we should assign a high priority to it.

At OWID, last year we published a topic page on animal welfare. There you can find data on, for example, how many animals are slaughtered to produce different kinds of food. Right now I’m not working on this topic, but we want to expand this page soon.

OWID aside, I’m currently preparing a video on this topic, which is taking me a long time to do. It’s a really tricky one, but also really interesting. It touches on many scientific and philosophical areas. And, even though many people have been working on it for decades, we have now much more data to work with.

I think you might also be interested in some of the work done by the Welfare Footprint Project. They have created a GPT that helps you evaluate and quantify the welfare impact of certain common practices used with animals (for example, keeping chickens in battery cages).

Thanks to projects like this, we now have ways to quantify not only the environmental impact of our food choices, but also their welfare impact.

Echoing what you said, the issue becomes even more kafkian when you realize around one third of the food we produce is wasted. This is a topic very close to my heart, and one of the main challenges, believe or not, is the lack of reliable data on the amount of food we waste in each step of the food chain. Any thoughts on this?

According to some estimates, one in four animals raised in factory farms never even make it to our plates. Those animals, after enduring a dystopian existence, either die before reaching the slaughterhouse, or become food waste.

So yes, reducing food waste is another effective action we can take to help mitigate climate change. Especially if that food is an animal product, since they have a much larger carbon footprint.

Finally, I agree that the data on food waste is very limited. As is the data related to animal welfare.

You mentioned before your passion about music, which is also mine. And you also had previous experiences in the field of intellectual property, so I am going to ask you the same question I recently asked both Marc Planagumà (aka DJ Kram) and Ramon Navarro in their respective interviews. So, here goes the question “What do you think about AI tools such as Udio or Suno, which are being used to “generate music”? I believe they could boost creativity…but at the same time they are getting criticism and some record labels, including Sony and Universal have filed lawsuits against them arguing that they "steal music to spit out similar work”.

Oh yes, this is a thorny issue!

I used to be much more active in music. In particular, I co-produced an album and several music videos with my partner a few years ago. But lately, I’ve been spending less time on music, mainly to have more time to make videos (which is not as easily replaceable by AI… yet!).

On the one hand, I am excited about some of the tools that will come in the future. We may be able to press a button and produce a song that combines influences from all the music you used to love in your childhood in a unique way, instantly. As a listener, this is an incredible gift.

On the other hand, part of the reason why we love music is because we nurture it and share it. If we reach a saturation point where we can generate hyper-personalized music for every single moment in your life, none of that music will accompany you in the future. That nostalgic feeling I get now when I listen to Steven Wilson’s Happy Returns would not be there if I had not listened to that song hundreds of times, on different occasions, together with other people.

So there is a complicated trade-off to maximize the benefit of music in our lives. We want more and more music, but we also want that music to become part of us.

And, as someone who not only enjoys listening to music but also creating it, the idea of pressing a button to produce a whole song makes creator-me sad. Why should I put so much effort into creating a song, when an AI can do it in a second? Again, there is a complicated trade-off here. Software is making it increasingly easy to create new music. But creators don’t want to be removed from the creation process altogether.

Manuel Bruscas Bellido

Conversation with Pablo Rosado

Conversación con Carlos “Chato” Castillo

Conversation with Jaume Civit

All you need is love.