Real-Time Fraud Detection using Debezium

Real-time fraud detection architecture – RDBMS – Debezium – Kafka – ML Model – Alerting System

Fraud detection is a critical aspect of any financial institution or e-commerce platform. Traditional fraud detection systems rely on batch processing and historical data, which can lead to delayed detection and increased false positives and false negatives.

In 2017, Equifax1, one of the largest credit reporting agencies in the United States, suffered a massive data breach that exposed the sensitive information of over 147 million people. The breach was caused by a vulnerability in the company’s website, and it took several months for the company to detect and respond to the breach. The Equifax breach highlights the importance of real-time fraud detection in preventing and responding to cyber attacks.

In this post, I will attempt to build a prototype of real-time fraud detection system using Debezium, Kafka, MySQL, and machine learning. The CDC tool Debezium has connectors for wide variety of databases.

The Importance of Real-Time Fraud Detection

Real-time fraud detection is crucial in preventing financial losses and protecting customer information. By detecting anomalies and suspicious activity in real-time, organizations can take immediate action to prevent further losses and minimize the damage. The Equifax breach is a prime example of the importance of real-time fraud detection. If Equifax had implemented a real-time fraud detection system, they may have been able to detect the breach earlier and prevent some of the damage.

How CDC tools Like Debezium Can Help

CDC (Change Data Capture) tools like Debezium play a critical role in real-time fraud detection. Debezium is an open-source CDC tool that captures changes to data in real-time, allowing for immediate analysis and action. By integrating Debezium with machine learning algorithms and analytics, financial institutions can detect anomalies and suspicious activity in real-time, preventing fraud and minimizing losses. Debezium can also be used to integrate with other systems, such as CRM2 and ERP3, to provide a comprehensive view of customer activity and behavior.

In addition to Debezium, there are other CDC tools available that can help in real-time fraud detection, like Apache Kafka, Apache Flink, IBM InfoSphere Data Replicator.

There are several benefits of using CDC tools for real-time data monitoring , which include drastic improvements in key metrics for incident management.

  1. Reduced mean time to detect (MTTD) – Average measure of time for detecting security incident. CDC tools can help reduce this mean time significantly when used for real-time monitoring.
  2. Reduced mean time to respond (MTTR) – Average measure of time required to respond to an incident. CDC tools could improve the response time for a given incident in a real-time monitoring setup.
  3. Improved compliance – Adhering to compliance SLA is paramount for businesses in financial, health care, e-commerce domains, which often deal with PII data. Any breach in sensitive information has potential to bring down whole business. Having realtime real-time monitoring is essential.

Prototype Setup

Debezium is a change data capture (CDC) tool that captures row-level changes in databases and sends them to Kafka topics. In the case of MySQL, Debezium uses the binlog (binary log) to capture changes.

Main Components:

  • MySQL – Stores the transactional data.
  • Debezium MySQL Connector – Captures changes from MySQL and pushes them to Kafka.
  • Apache Kafka – Acts as the real-time messaging system. The topics that are used by the Debezium are to be mentioned as part of connector config.
  • Kafka Connect – Runs Debezium in distributed mode to ensure fault tolerance and scalability.
  • ML Service (Flask) – Receives transactions, predicts fraud, and sends alerts.
  • Alert System – A simple Flask app that receives alerts.

Here’s how binlog events are sent to Debezium and pushed to Kafka topic.

  • The MySQL writes all the changes to (inserts, updates, deletes) to its binglog.
  • Debezium uses a MySQL connector to connect to the MySQL database and read the binlog events.
  • The Debezium connector reads the binlog events and converts them into a standardized format, which includes the event type (insert, update, delete), the table name, and the changed data.
  • The formatted events are streamed to a Kafka topic, where they can be consumed by other applications.

Registering Debezium to MySql Connector

This established Debezium to MySQL connector. REST API request payload encompasses various details like, mysql connectivity details, table for which CDC events are captured , the Kafka topic Debezium writes to , any specific configs, for instance, decimal type handling etc.

Debezium standardized format

{
  "before": null,
  "after": {
    "id": 123,
    "amount": 400,
    "status": "FAILED"
  },
  "source": { ... },
  "op": "c",  // operation type: c = create, u = update, d = delete
}
Debezium message example – Notice op type “c” . [NOTE : Expand using a downward arrow]

Git Hub Link

Check out the projects on my Github – Real-Time Fraud Detection

Execution

In the GIF you can notice how when a new record is inserted into MySQL, the CDC is captured and streamed to KafkaConsumer. From there, model makes a prediction, and based on the result, an alert is triggered.

The topics that Debezium uses to push CDC events can be configured when a connection is established.

The topic name that is prefixed with dbz_ is where Debezium streams the CDC events.

Summary: Extending Debezium CDC Beyond Fraud Detection

Setting up Debezium can feel finicky at first, with its distributed mode, Kafka Connectors, and the need to configure multiple services. But once the setup is in place, it eliminates the need for polling databases or directly querying transactional systems for real-time data.

Debezium listens to the change data capture (CDC) events from databases and publishes them to Kafka topics. In this case, I have used this for real-time fraud detection, where each transaction is pushed to a machine learning model for inference, and an alert system reacts accordingly.

It is not limited to fraud detection, this architecture can go far beyond fraud detection.

Extended Use Cases

  • Real-Time Dashboards: Stream transaction updates to a frontend using tools like Apache Druid.
  • Analytics Pipelines: Push the same CDC messages to a data warehouse (e.g., via Kafka Connect + BigQuery/Snowflake sink connectors).
  • Audit Logging: Automatically store all changes, especially for key tables / entities in a secure storage for compliance.

References

    1. https://archive.epic.org/privacy/data-breach/equifax/ ↩︎
    2. CRM – Customer Relationship Management. It’s software that helps businesses manage relationships with customers. Popular CRM tools are Salesforce, HubSpot ↩︎
    3. ERP – Enterprise Resource Planning. It’s software that helps businesses manage core operations, inside the company like managing payroll and HR etc. Popular tool is Oracle ERP. ↩︎