Conversation with Jaume Civit
Let's start this conversation in the year 2002, which is when you started your studies as a Telecommunications Technical Engineer. What led you to make that decision? Did you have a call as an engineer?
Two things, first: I am curious by nature, I was always dismantling things, trying to understand how things work. I believe this curiosity makes the engineer. The second one: back then we were experiencing a revolution in communications, and “telecos”, for me, was offering the basis for everything. That career also was greatly advertised. Time proved to me that it was a great decision: go first for the bachelor, with a more practical approach, and complete it, later on with a Master’s degree.
Once you finished your education, you began your professional career at Telefónica I+D, as an R&D engineer and later on as Project Manager and Software Architect. Tell me a little bit about the type of projects you participated in. I am particularly interested in “the integration and deployment of Recommendations Module of the Personalization Server Global platform of the Telefonica Group”.
That is quite a story. I started a Telefonica R&D building/researching methods to improve a 3D video reconstruction with the aim of raising the user-experience in video-conferences by creating the feeling of immersion. Long story short, I did applied research in Image Processing for some years.
Around 2006-07, I got the chance to be involved with a quite aspiring project within the Telefonica Group, aiming to solve the dispersion and accessibility to customer data. Leveraging on the content gathered in this new system, we were in position to enable the usage of recommendations (for sales) in different business units. Later on, I was offered to manage this component, and one can say this was my first immersion in big data related topics, and its application in real business.
It seems you developed an interest in the recommender field because your next move was to Tuenti (part of Telefonica), where you were dealing with the design methods/algorithms for recommendations in different domains.
Indeed, leveraging on data to create a better offering for the customer, the technical challenge that implied to move those volumes of information and the possibility to experiment with new tech -I played a little with graph databases back then- was really exciting. But what moved me was this chance to create a real impact on the customer. Personalization, recommendations, etc require a real understanding of your customer: one can’t succeed just scratching the surface of the data.
Echoing on your journey through “recommenders”, I would like to ask you the main differences and changes on this topic since your early days at Telefonica and Tuenti
Regarding my time in Telefonica, we have to bear in mind we were at the early days of Big Data solutions: first one needed to be able to gather the amount of information produced in all the touchpoints. If you did have one, you had to roll out a hadoop cluster - with all the implications and learning required- to make it work properly. That was already quite an effort. And then you had to learn how to apply those “collaborative filtering” methods at scale.
Cloud Providers were in their very early days, so no chance to deploy a computing cluster with a couple of clicks, almost everything was happening on premise.
At Tuenti we “experimented” with graph databases -it was a social network, so quite fitting-. Those were the early days of Neo4J and the challenge there was to work with very big graphs (millions of nodes and another ton of relations).
In 2015 you became Head of Analytics at Wallapop. Could you let me know the main projects you were responsible for?
That was another jump in my career, Wallapop was super fast growing in its first year, with all the challenges for a startup: a lot to do and not much time to do it.
The first thing we did as data team, was to rollout the infrastructure and processes to do proper Business Intelligence and performance analysis, basically we needed to be able to answer questions quite fast (analyze competitors, respond to the VCs, understand why something is working and why the same thing is not working in another region, etc).
But we had other challenges like forbidden/illegal content to be filtered out. That was a project that I particularly enjoyed.
We’ll come back to the forbidden content in a while, but before that, let’s talk about the Data Platform. I learnt you had to develop a solution able to handle peaks of “1M events/minute”. I am curious to know how you faced it.
One of the challenges was the concurrency of customers, we had tons with very, very long navigations (long sessions scrolling articles in the wall that triggered tons of events).
Kafka was our friend here, we developed a solution based on that, with the ability to scale horizontally: more data - more shards. I have to say it took us a while to place the kafka cluster in production and make it perform at scale.
And what would you have done differently on the platform taking into account the capabilities we have in 2024?
I think the solution has aged quite well, now we have similar approaches in every cloud provider to avoid the maintenance of this kafka cluster (event hubs in azure, kinesis in aws, pubsubs in gcp,...). Maybe, I would have gone for an approach based on kubernetes for the data capture API, maybe, but I would try to avoid, for sure, doing too much Ops stuff (needed but not loved).
Let’s get back to the forbidden/illegal content topic. In a market place such as Wallapop, you have to handle multiple fraud and scam attempts. I remember when you explained many years ago the challenge of people “selling pets” (which is illegal in Spain). Which solution did you design? And a second question: how would you do it now with current available solutions?
We tried several things: from super duper manual approach by placing content moderators by category - not affordable-; text classification with training simple bayesian classifiers with tokens extracted from fraudulent texts -partial solution but the diversity of content was to much-; more advanced methods like using Google Image Net to “tag” the images and, in combination with the text, we could classify the content into sensible or not, so we could reduce substantially the amount of articles to validate.
Now, one can leverage more advanced solutions: models that can describe the content of an image with detail, LLMs that can extract the semantics of text. Maybe I would have fine tuned something given the particular context of the company (short descriptions and multiple images).. who knows.. The available toolstack leaped a whole generation.
In 2018 you joined SCRM-Lidl as Chief Data Officer (CDO). Which are your main responsibilities there?
There, I am in charge of three main branches of data related topics: 1) developing and maintaining the data platform able to give support to the millions of Lidl customers we handle; 2) data management, as loyalty program we gather that that needs to be productized; 3) and finally, the initiatives related to the usage of advanced methods such AI in our domain of activity.
Let’s first talk about the platform side: what is the stack you are using, the team you have there, and the main challenges you have to address?
As for now we are in Azure, for everything. We usually don’t talk about figures, but consider that Lidl has a presence in 32 countries so our volumes are way beyond PetaByte scale.
We are big fans of the Lake House approach, so the core of our thing is the data lake we move with the ecosystem Databricks offers (heavy users there).
The magic sauce, is a combination of tech and process: tech wise we manage to adapt what the market has to offer, and process wise, we established the right organization to turn data into product, given the particularities of our company.
Now, let’s get into the governance and data management part. Which are the main pillars?
Again, the main activity of the Hub is to provide the Loyalty Platform for the group. There comes implicit that data gathered through the platform is an important asset for the business. In order to honor this importance, we designed a set-up where the data is governed from the very source, so we can ensure trust and reliability in the information we deliver.
This is where data contracts come in place. A data contract is a definition of the content that is being produced, plus the definition of ownership and a statement of trust. Any data consumer, by subscribing themselves to a contract, can be sure that the data producer will know that somebody is using his data. The contracts have to be honored, nobody changes the content or the schemas of the data without knowing who is going to complain, and the impact.
As you know data governance without business is really tough, so I wanted to ask you how you split responsibilities between your team, Tech team and business stakeholders
Data contracts are the cornerstone to solve this dilemma. When a contract is defined, business owners, technical owners, product owners, are sitting together, to write down the expectations and responsibilities and, with the help of the data team, turn them into a contract.
Establishing a data contract looks like an extra effort but, trust me, it pays off. In addition, not everything has to go under contract, so this process is not triggered that often, ideally.
With the contract at hand, it is clear why we need the data for (business), how is implemented (tech) and by whom (product).
And last but not least, tell me more about the advanced analytics predictive modeling side. Which are the main use cases?
It is not secret but, we put a lot of effort into understanding our customer, so we can provide an experience as personalized as possible. Of course we apply, when allowed, predictive modeling in trying to fulfill this wish. Churn prediction -we are a mobile app- , next best offer - to personalize promotions- , basket analysis to understand and predict purchase patterns. These are just examples, that even though are quite common in the industry, pose a challenge due to our volume.
How are you embedding LLMs and Generative AI in your operations?
To be honest, we are quite in early stages for the adoption, but we already leverage on the strength of these tools to enrich data points we already have with the use of embeddings for example; structure descriptions of the goods we sell; extract nutritional information, and so on. As a group we have our position and we are moving forward towards the integration of these techniques/tools in our operations.
Last time we talked you told me about stackit initiative. What can you tell me about?
It is public information, Schwarz Digits, the Schwarz company we are part of, is developing a European cloud solution. I can’t tell much more than the fact that it is a very strategic position. You can get more details here.
I learnt you are also part of Anything . What is this project about? What is your role there?
I collaborate with a close friend of mine in this project, where we plan to help the busy professional get some of their time back. More to come.
Looking ahead. What big changes will we see in the next 3-5 years in the field of Data analytics and Artificial Intelligence?
For me it is about integration and normalization of this solution in daily life. Today we are exploring what, how to, when to use it. I am a believer of the human in the loop, but this human needs to be trained accordingly. We are in this phase now.