Find out how an agile and easy-to-use data integration solution enabled Areeba to take their machine learning efforts to the next level.
'We wished we had known Data Virtuality from the beginning when we started with our machine learning processes. This could have saved us a lot of time and energy! Data Virtuality is now a very essential part of our daily life as a data scientist. It helps us to capture, mix and consume data from different sources very easily. Thereby we can save time and focus more on the end result. Exciting times ahead!'
Areeba is a leading Lebanese financial technology company that provides smarter, faster, and innovative payment solutions for banks, merchants, and governments in the Middle East region. Areeba is committed to investing in new capabilities and technologies that deliver enhanced payment experience and more secure solutions from biometric card and mobile payment to smartPOS.
As an innovative company, Areeba is always striving for cutting-edge solutions and has started to apply data science rules on transactional and operational data in order to improve the performance of their services. Areeba stumbled upon several challenges before finding the dream solution, but finally overcame these with the Data Virtuality Logical Data Warehouse.
Large Amounts of Data in Various Data Sources
In order to apply machine learning models on the transactional and operational data, Areeba has to deal with huge amounts of data. Furthermore, all the data lies in different sources, such as MemSQL, MariaDB, and Oracle. This makes the challenge even more complex.
Manual Export of Data Resulting in CSV Chaos
Like many data scientists, the data science team at Areeba first tried to collect the data that is needed from multiple sources and in different formats, such as csv and text files. The data was then used to build the predictions and models using languages such as Python, R or Scala. However, this process was very time consuming, troublesome, and error-prone. They began to see that this kind of approach would cause challenges for real-time analytics in the near future.
Looking for a scalable solution
Areeba knew that traditional data warehousing system were inefficient and difficult to scale. Thus, they began to seek out a solution which enabled data virtualization. At the end of the day, data is the main ingredient from which machine learning algorithms are trained. Although Areeba was able to gather data with good quality, they were hung up on copying and pasting the data from csv files into a central place and formatting the data even before they could start building the predictive models. That’s when their data architecture team found Data Virtuality Logical Data Warehouse.
Working hand in hand, the data architecture/analyst team and data science team at Areeba built a foundation and process in Data Virtuality which is efficient, scalable, and faster than ever before.
The data architecture team
Using the Data Virtuality virtual layer, the data architecture team began to integrate the data from the different data sources into a central place, using the JDBC and REST API connectors. In this virtual layer, the data architects could separate the modules by responsibilities and thereby serve different teams with different requirements without editing or creating new code. Furthermore, the whole process is automated and scheduled so the data is always up-to-date and ready-to-be-used. With this single source of truth in place, not only can the data architecture team manage all incoming requests in a timely manner, but they can also ensure a high quality of the data. Lastly, security risks have become virtually nonexistent, as the need to transfer csv files across networks has been eliminated.
The data architecture team built a data as a service (DaaS) concept in Data Virtuality, using data from all systems and is available in the Data Virtuality layer, which can be accessed by all services and apps. Without Data Virtuality, Areeba would need to code the connection to the databases as well as the query definitions for every single connection between data sources and the data science/machine learning tool. But with Data Virtuality in place, Areeba now builds centralized view definitions in the Logical Data Warehouse and uses APIs to connect and retrieve the data from the centralized data model.
"Instead of coding the connections to databases and query definitions separately from each datasource to different data science/machine learning tools, Areeba builds centalized view definitions in DataVirtuality and uses APIs to connect and retrieve the data from the centralized data model
From the front-end perspective, Areeba connects from R to the Data Virtuality engine via a JDBC connector. They stream data of 40 to 50 million transactions this way.
Next to Data Virtuality and R, Areeba is using Spark, i.e. the integration looks as follows:
R -> Spark -> Data Virtuality.
They configured calls in Spark so that data is retrieved in partitions and not in just one query. Handling large amounts of data from different data sources isn’t a challenge anymore.
In-Memory database is used to materialize some data from Data Lake and from Master Data Management, which is needed for high speed access. Other data which do not need high speed access resides in the Data Lake and MDM. In general, we can talk about a virtualized data lake. Different areas in this virtual data lake use different storage and processing technologies, so that different types of data access requirements can be served in the best possible way.
“One of the most important learnings that we got out of this journey is how important data integration is for the machine learning process. And this refers to all parties involved: data architecture/analyst team as well as the data science team. Data Virtuality helped us to reduce the grunt work and eliminate idle time.”
Bernard Bardawil, Development Lead at Areeba
The data science team
The data science team no longer has to copy and paste their data from different csv files. Today, they go directly to the virtual layer to get any data that they need. This has provided two big benefits:
- Elimination of Idle Time: The data scientist does not have to wait for the data analyst team to provide the data they requested
- Real-time Data From Multiple Sources with the Right Format: Data is always available in one single place.
The data scientists can now build the views and aggregations that are needed to build the predictive models in the virtual layer. They can then bring the data in Python or R for ultimate performance.
So, what kind of models does the data science team build and for what do they use them?
The data science team at Areeba applies machine learning to learn more about different segmentations such as customer and merchant, merchant performance, and churn. Models that are used are time series, regression, decision tree, random forest and different clustering algorithms.
As the information won from of these learnings are also important for the business users, the data science team built a connection to Tableau in Data Virtuality. This could be done without the involvement of any developer.
Very little programming was required, except from the data science team. The usual step of writing a SQL query on JAVA to then combine and build a JSON to finally expose as JSON is obsolete now.
“We wished we had known Data Virtuality from the beginning when we started with our machine learning processes. This could have saved us a lot of time and energy! Data Virtuality is now a very essential part of our daily life as a data scientist. It helps us to capture, mix and consume data from different sources very easily. Thereby we can save time and focus more on the end result. Exciting times ahead!”
Khaled Eid, Data Scientist at Areeba
Now that Areeba has a scalable and reliable solution in place, they have even bigger projects in mind. In the near future, the data science team wants to expand to neural network machine learning and deep learning. The models that the data science team builds are shared with the business users for feedback. This feedback will be fed into currently existing machine learning models to enrich them and to learn more.