Dergilik Journey Part 2 - Text to Speech

Users have the option of reading articles from Dergilik, which can cover a wide range of topics such as travel and tourism, cooking and food, car economy, and so on. As I mentioned in part 1, we wanted to provide another option to our users when they are doing chores around the house or driving during their commute, and their eyes and hands are occupied with something else. So we wanted to give our customers a listening experience🎧 with Text-to-Speech.

We started to review TTS Engines; Google, Microsoft, and Nuance for the Turkish language with our contents. Google did a good job, but our AI team did better. The only risk was that it was not yet ready for production.🙈 Anyway,

Mobile App

I wanted to start with mobile applications because it is easier to explain with photos. We had excellent mobile developers on staff at Dergilik, and we came to gather with the fizy(music streaming app) mobile team to speed up things. Luckily, you can find many developers in every field in Turkcell who have experience and can ask questions. We got all the key points and knowledge about streaming in the iOS/Android ecosystem. And let me show you two pictures. The left one was our first version🙈, and the right one current version.🚀

You can easily see the differences, but let me explain further below.

The user could not change the playback speed in the first version.
The user could not share content in the first version.
After listening to one article, the user was stuck because he didn’t have a playlist in the first version.
The user could not seek on the player in the first version.
The second one is far superior and more functional from a design standpoint.

We launched our first version in three months, and we didn’t try to make it perfect; instead, we tried to iterate on adding something new in the shortest amount of time possible. Therefore, we started with the most basic features and made continuous optimizations with the A/B test and user feedback.

CMS & AI

We came together with the team and told them what we wanted to do, and they told us their working methods.

They prepared around 10+ new articles with different topics every day.
They prepared articles usually between 08:00–11:00.
They could probably change the content of an article.
They could prepare drafts without publishing.
We asked them to listen to these TTS files before publishing. So default value of visibility audio files was false.

We decided to make it mandatory to listen to articles before publishing because we wanted to be sure about the quality of audio files. After launching the feature in whitelist users, we got feedback from the content management team and made these changes.

We put simple player, but we missed time information and seek option in player. It was painful for our content team to listen and seek at on specific time. They had to download mp3 files to their computer to solve that problem. So we updated our player quickly.
They made more changes to the articles as we expected. So It caused an overload problem which slowed down the translation process on the TTS platform. The AI team found a clever solution; they cache each article and only process/consider changing sentences. It solved our problem without expanding our infrastructure.
TTS is not an easy job. Every language has its unique grammar, and we do not usually care about grammar. There were different problems such as abbreviations, special names, and foreign words. So our AI team put great effort into that and quickly trained the TTS platform to solve these problems.

We changed our rule to listen to articles before publishing because it was inefficient. We let the content team publish articles without listening and put a Beta label on the Android player. Unfortunately, we couldn’t put it on the Apple platform because of their policy.
We launched the application in whitelist to Turkcell employees. It’s a great way to gain feedback with this beta group.

Platform

I want to put a schema to show the basic architecture. As I mentioned, we started with a mobile client because the TTS platform was not production-ready. While developing necessary services for TTS, we also changed our list architecture. We developed an architecture called as Auto-List. maybe we’re not good at naming :) The idea was that everybody could create an auto-list of articles, magazines, newspapers without development or deployment. For example, they’d create “most popular articles, articles from X magazine, articles for business people” without any development with basic SQL, and it’s live on production. We used that architecture in articles and podcasts, and I’ll go into detail in the podcast part.

Anyway, after Auto-List, we started to develop integration with the TTS platform. We sent article text to the TTS platform, and we were waiting for audio file location asynchrony. After we got the audio file location, we uploaded audio files to our cdn and made them playable for mobile apps. Of course, this was the happy path and, we also need to handle:

We didn’t have an unlimited source; the TTS platform could be busy on the ongoing process, so we had to implement a retry mechanism.
We had to monitor ongoing tasks easily, and we had to make them visible to everyone.
If we got errors on a task/process, we needed to report the necessary people with details. Maybe in some cases, we are required to retry.

The entire team (Development, QA, Content Management, Business, Operation, AI, Security, and many others) took the initiative and worked efficiently. We did not work extra hours for this job; it’s all the result of teamwork. In just three months, we were able to launch this feature. Over 2 million audio articles were listened to in the first 11 months alone.🚀✌🏻👏🏻

In Part 3, I’ll be talking about how we changed the Speech Experience. As a memory, I’d want to share some highlights from our team. Since we started working remotely after covid, we did not have a chance to get together. That’s why photos may not include all team members.