Online Data Analysis Project

Industry:
Big Data
Location:
Saudi Arabia
Collaboration
2022-Ongoing

Customer

The customer is a big company in Saudi Arabia specialized in media monitoring and analysis. They cover media platforms (such as TV and radio channels, social networks, and news), collect data from these sources, and perform data analysis for various areas of the market.

Their usual clients need to monitor news, world events, and people’s reactions in order to adjust their marketing strategies.

The project we have designed for them is a platform that collects data from an extendable set of sources and allows users to run custom queries and extended searches against the collected data. The results are presented in the form of interactive widgets, charts, and live reports.

Challenge

The challenges in the project were split into several phases.

The first portion of challenges was about collecting:

  • Online sources with RSS feeds from Google
  • Articles from RSS feeds
  • Posts from Twitter
  • User profiles from Twitter

The data had to be obtained from RSS feeds. Suitable RSS feeds were collected from the Internet by an automated web crawler. Approved feeds were consumed throughout a day in the background.

This way, the database would grow every day. It would get 150,000 posts and articles from RSS feeds overnight.

In addition to that, there were additional third-party services that could be activated for specific purposes. The services were:

The customer said that they would want to add more data sources to the list in the future. One of such services could be eMedia Monitor for TV and radio analysis.

After the collected data was stored in the database, we had to work on the next portion of challenges. These challenges were about data analysis.

The data had to be shown on the website in a meaningful format. There were five main types of analysis:

  • Online analysis
  • Social analysis
  • Account analysis (analysis of a user profile in Twitter)
  • Comparison (of social media content provided by different influencers)
  • 24/7 data analysis.

Each module had to have interactive multi-dimensional widgets - charts in which the user could see stats, adjust dimensions, and monitor the amount of data in pivotal points.

Posts and articles had potential reach rankings, sentiments, targeted demographics, categorized topics, and other metrics attached to them by the AI/ML component and shown in the widgets.

The ML component was a separate challenge. It had to process posts written in multiple languages. In most cases the languages would be just English and Arabic, but since the database contained posts written in French, German, Russian, and other languages, the app had to translate the texts so that the AI module could process them.

A predefined set of widgets had to be present on the screen. It could also be downloaded as a so-called instant report - a PDF with custom styles.

In addition to the above, the project required a mobile application developed in React Native.

This way, there were a whole lot of challenges - and the biggest one was the performance.

Solution

The backend was developed in mainstream technologies which enabled easier support. Since the app had to use the database extensively, communicate with an ML component, and process a huge amount of data in the background, the chosen technologies were:

  • PostgreSQL
  • Python / Django / Django REST Framework for the backend
  • Celery / Flower for the background processing
  • Vue.js for the frontend

Since the application architecture was planned ahead, it was possible to define the tables, references, and indexes. However, we had to go through several stages of optimization anyway.

Those stages implied index optimization (especially for full text search) and SQL queries optimization.

Queries for widgets were quite slow, and each project could have up to 50 widgets attached to it. This problem was solved through proper table decomposition and linking.

Over time, the codebase grew and got multiple modules for various types of analysis, background processing, data querying, filtering, converting user input into indicative data sets, and reporting.

The application deployment was set up using Dokku. The app was hosted on a dedicated Oracle server inside a Docker container.

Results

The effort made by the development team resulted in three closed releases after which the company started to analyze data for specific keywords and compare it to results generated by other systems.

One of the advantages that the Online Data Analysis Project is going to have in the market is a fast, handy and user-friendly interface. However, the application is still waiting to be released to early adopters - real clients who can extensively use the platform for real-life use cases and provide detailed feedback which will help the product owner to make the system a better product.

After that phase, the system will be released to the public and become a competitor to older platforms.

Services Provided

Technologies Used

  • Python / Django
  • Django REST Framework
  • feedparser
  • Celery / Flower
  • unittest
  • Selenium
  • PostgreSQL
  • Vue.js / Vuex / SCSS
  • Chart.js / QuickChart
  • Aspose.Words
  • vaderSentiment
  • Docker / Dokku
  • Oracle servers