Open Source · Dataspot

Building Dataspot: Lessons from Real-World Fraud Detection


The obsession with understanding fraud

I believe fraud detection is one of the most interesting tasks that exists. During my time leading risk operations, what fascinated me most was understanding how and what fraudsters did to commit fraud. The cases were endlessly creative:

  • Social engineering. Contacting the CEO directly to request product activations or configuration changes, just to push more transactions through.
  • Tax evasion schemes. Using third-party cards to accumulate discounts when paying taxes, exploiting promotional loopholes.
  • Internal corruption. Attempting to bribe people inside the company to activate fraudulent merchants.
  • And many more.

But for a long time, one question kept turning in my head: fraud detection has to have a starting point. Some pattern, something in common. I sat with it for months — while sleeping, while eating, while doing other things. There had to be something.

The key insight

The idea that changed everything was simple: I need to see concentrations.

Not all concentrations or anomalies are fraud, but every fraud leaves some concentration or anomaly behind. I needed something that could find them — and, above all, something that wasn't complicated to implement.

The algorithm: concentrations as threads grouping together

I thought of the algorithm as threads grouping together — the image stuck with me from a documentary about quantum physics and string theory. The intuition was that: if data concentrates by its nature, I need a way to represent it that makes the concentration visible.

It wasn't immediate. After several days, the missing piece arrived: using JSON paths. If two records share structure, they share a path; if they're almost identical, their paths look very similar. That similarity is, literally, the concentration. I started testing it by hand until I realized I had something.

If you want to see how it works, I left an algorithm visualization and the full documentation.

Silent development, real use

I developed it quietly for a good while. I even built an interface to draw the nodes and see the data clearly. It became a tool I used for two concrete things:

  • Evaluating and detecting fraud.
  • Helping the team understand complex cases.

It's not a magic solution that solves every problem. It's a tool to find problems, clearly visualize concentrations, and support detection — one of those tools worth having in your toolbox for when you need it.

The technical challenges

Building something like this comes with real challenges. When you represent data as a tree, you deal with recursion complexity, Big-O concerns, nested loops, and visualization. A big part of the work was designing it to minimize the cost of all that, so it could process large datasets without taking forever.

Why I released it as open source

Development was iterative: removing what wasn't needed, spotting improvements, fixing bugs along the way. At first I didn't even think of it as a formal project — no unit tests, no load tests, no API, no project structure.

Over time, I felt this was a tool that could help other people the way it had helped me. So I decided to make it official and release it. I had always wanted to contribute an open-source project to the community, and this felt like the right one: a useful building block for anyone working on pattern and anomaly detection.

Let's build it together

I keep maintaining it gradually, and the goal is for it to grow with the community. If you work in fraud detection or data analysis, your experience adds value: ideas, improvements, code fixes, use cases. Every contribution is welcome.

The best fraud detection isn't built in isolation. It's built by sharing tools and ideas among the people facing the same problems. Dataspot is our contribution to that conversation.

↗ Project Dataspot on GitHub

Dataspot is open source

The concentration engine, available as a Python library for the whole community. Use it, star it, or contribute ideas and improvements — let's build it together.

See Dataspot on GitHub