Designing Splitwise - Data Modelling of Expense Sharing Application

Shrey Batra

Published Jan 5, 2022

A few years back, when travelling to college in Uber/Ola/Auto shared by a few friends, every single day, tracking expenses daily was becoming quite a hassle. Everyday keeping track of which friends were present, what was the total bill, what was each person's share and who owed how much to whom.

That is when I found a very new super awesome utility app in market called Splitwise. Since then, I have never been using any other application or method to split bills amongst my friends, and I guess everyone around also has used or still using it. This made me think, how is Splitwise storing the data, building an application for Millions of users and calculating the balance sheet, customised for every user, on a real time basis? It's almost as heavy as maintaining a medium sized bank balance sheet for every bank account with millions of transactions.

Understanding Basic Working

The basics of Splitwise, or any other Expense Sharing application is to take an input from a User -- who paid, who all were involved, how much was each person's share. Simple.

For example, if I booked the cab and paid Rs. 150, and I was travelling with 2 friends, I would give the input - user_1 150 user_2 user_3 EQUAL saying that me (user_1) is paying 150 and sharing with my friends (user_2, user_3) equally. That means user_2 owes me 50 and user_3 owes me 50.

Understanding Requirements

The basic views (or queries) the application would show would be -

Show me my total money pending or owed across all my friends.
Show me distributed balances across all my friends who owe me or to whom I owe money.

Data Entities and Models

As we see, basically we have these major models in our system -

User
Transaction

The user model would be a simple model telling us the basic details of the User - name, id, contact info, etc. The focus would be on storing and retrieving the transactions. Let's start with a simple model -

{
  "id": "transaction_1",
  "paid_by": {
    "user": "user_1",
    "amount": 150,
    "distribution": "EQUAL"
  },
  "users": ["user_1", "user_2", "user_3"]
}

Now, this is a great way to store the transactions, but how good is it for satisfying our queries in real time?

Show me my total pending / outstanding

To get any user's outstanding amount, we need to perform a lookup getting every transaction in which the user was a participant. Once we get all the transactions, we need to then group all these and perform the following mathematics -

A Better Model?

The above model was very efficient to capturing individual transactions but very tough to get the views/results when we query. In order to simplify this, let's revisit our modelling technique and use the Division Pattern -

{ 
    "source": "user_1",
    "destination": "user_2",
    "timestamp": 1234,
    "amount": 50
},
{
    "source": "user_2",
    "destination": "user_1",
    "timestamp": 1234,
    "amount": -50
},
{
    "source": "user_1",
    "destination": "user_4",
    "timestamp": 1234,
    "amount": -50
}

We divide the information into separate objects which says - source_user owes destination user some amount for transaction at timestamp T. Now this is an awesome way to get the results of our queries --> filter all documents with source as user_1, group by sum and you get the total outstanding. Obviously, you can represent this information in a much more optimised manner, but you do get the point why we do this.

Now the question arrises --> Do we need 1 record per source per destination per transaction? Imagine 1 million transactions involving an avg of 3 users so we get -> 1M * 6 = 6 million records. (u1-u2, u1-u3, u2-u1, u2-u3, u3-u1, u1-u2)

Can we somehow minimise these records? Let's checkout the Bucketing pattern..!

{
    "source": "user_1",
    "destination": "user_2",
    "transactions": [
      {
          "timestamp": "t1",
          "amount": 50
      },
      {
        "timestamp": "t2",
        "amount": -50
      },
      ...
    ],
    "totalAmount": 100
}

Now, we can group all transactions of user_1 and user_2 into a single record, and keep on appending transactions as and when they arrive, as well as maintaining totalAmount already in the same Bucket. This makes our solution very optimised, but now has another problem -- The transactions for 1 record are unbounded -- they can have millions of transactions.

A Little More Optimised

{
    "source": "user_1",
    "destination": "user_2",
    "startTime": "01 Jan 2020",
    "endTime": "31 Jan 2020",
    "transactions": [
      {
          "timestamp": "t1",
          "amount": 50
      },
      {
        "timestamp": "t2",
        "amount": -50
      },
      ...
    ],
    "totalAmount": 100
}

Keeping everything same, we can now divide these buckets for a certain time range (as shown above) or with a maximum number of transactions. This would help us maintain smaller records, as well as help us in archiving old data as and when required. You can also make a new Bucket when you "settle up" all previous amounts between 2 users.

Conclusion

Understanding, designing and building a data model is very important for any given use case and you should plan ahead how your system would behave in different scenarios depending on your scale, your growth plans, your efforts, migration plans, etc. Planning is much more important that "optimising" everything at Day 1.

For more of HandsOn practise around System Design, Machine Coding and Data Modelling you can register on Eazy Develop and get to experience this...!

System Design & Architecture

52,910 followers

+ Subscribe

Dharmik Naik 3y

Great article, helped me a lot. Can you please throw some light on in which cases would the time based bucketing and capacity based bucketing be favourable? Thanks.

Dharmik Naik 3y

Great approach! I didn't understand one thing though - the limitation of the 2nd last problem/ what was the need to optimize the second last design? Please answer my question, I am really curious about it. Thanks.

Aditya Kulkarni 4y

Thanks for writing this, very informative and useful for beginners like me!

1 Reaction

Saurav Prateek 4y

Great, The way you improved the Data model step by step makes this so informative even for the beginners. The idea of splitting a single object on the basis of a fixed time duration is so cool. One question, Can we split the record on the basis of the size of transactions array, instead of a time range? Like every bucket can hold a fixed say "N" number of transactions. If this exceeds then we split the bucket further. Will it be a good idea?

Designing Splitwise - Data Modelling of Expense Sharing Application

Shrey Batra

Understanding Basic Working

Understanding Requirements

Data Entities and Models

Show me my total pending / outstanding

Recommended by LinkedIn

A Better Model?

A Little More Optimised

Conclusion

System Design & Architecture

52,910 followers

More articles by Shrey Batra

Others also viewed

Import your Wells Fargo Bank Statement to Bank Reconciliation in Microsoft Dynamics 365 Business Central with a Data Exchange Definition

The Golden Record Was Never the Strategy

How Design Thinking Transforms Reporting for Financial Service Companies

Turning Financial Chaos into Clarity — How I Used Spec-Driven Development to Build My Expense Categorization System

Deep Dive into Bacs Standard 18 Messages

Big Data for Finance, is it necessary? - part II

Certainly, here is the extracted data

Increase Your Collection Yields: A Guide to Optimising Collection and Recovery Returns

Remediation Models In Financial Services Organisations

How Databricks Can Help NBFCs Improve Data Management Amid Recent Growth

Explore content categories

Understanding Basic Working

Understanding Requirements

Data Entities and Models

Show me my total pending / outstanding

Recommended by LinkedIn

A Better Model?

A Little More Optimised

Conclusion

System Design & Architecture

52,910 followers

More articles by Shrey Batra

Podcast - My Journey from Founder to Acquisition

🚀 What's New in Cosmocloud Deploy - Jobs, Fast Connect, CLI & Real Cost Savings

Automated Embeddings in MongoDB Vector Search - What Changed and Why It Matters

Using Queues Is Not Just for Scale - They're for Sanity

What happens when a Primary fails in a MongoDB replica set? Automatic failovers?

Live Workshop - Data Modelling, Schema Patterns, Relational Migrator with skill badges - MongoDB MUG April Meetup

Instagram's trick for faster photo uploads and beat competition

How to break a system in Microservices - The invalid myths and the best practises

How to be a SENIOR / STAFF engineer and highlight your impact?

Cosmocloud Deploy - Managed Deployments cheaper than raw VMs / EC2

Others also viewed

Import your Wells Fargo Bank Statement to Bank Reconciliation in Microsoft Dynamics 365 Business Central with a Data Exchange Definition

The Golden Record Was Never the Strategy

How Design Thinking Transforms Reporting for Financial Service Companies

Turning Financial Chaos into Clarity — How I Used Spec-Driven Development to Build My Expense Categorization System

Deep Dive into Bacs Standard 18 Messages

Big Data for Finance, is it necessary? - part II

Certainly, here is the extracted data

Increase Your Collection Yields: A Guide to Optimising Collection and Recovery Returns

Remediation Models In Financial Services Organisations

How Databricks Can Help NBFCs Improve Data Management Amid Recent Growth

Explore content categories