Designing Splitwise - Data Modelling of Expense Sharing Application
A few years back, when travelling to college in Uber/Ola/Auto shared by a few friends, every single day, tracking expenses daily was becoming quite a hassle. Everyday keeping track of which friends were present, what was the total bill, what was each person's share and who owed how much to whom.
That is when I found a very new super awesome utility app in market called Splitwise. Since then, I have never been using any other application or method to split bills amongst my friends, and I guess everyone around also has used or still using it. This made me think, how is Splitwise storing the data, building an application for Millions of users and calculating the balance sheet, customised for every user, on a real time basis? It's almost as heavy as maintaining a medium sized bank balance sheet for every bank account with millions of transactions.
Understanding Basic Working
The basics of Splitwise, or any other Expense Sharing application is to take an input from a User -- who paid, who all were involved, how much was each person's share. Simple.
For example, if I booked the cab and paid Rs. 150, and I was travelling with 2 friends, I would give the input - user_1 150 user_2 user_3 EQUAL saying that me (user_1) is paying 150 and sharing with my friends (user_2, user_3) equally. That means user_2 owes me 50 and user_3 owes me 50.
Understanding Requirements
The basic views (or queries) the application would show would be -
Data Entities and Models
As we see, basically we have these major models in our system -
The user model would be a simple model telling us the basic details of the User - name, id, contact info, etc. The focus would be on storing and retrieving the transactions. Let's start with a simple model -
{
"id": "transaction_1",
"paid_by": {
"user": "user_1",
"amount": 150,
"distribution": "EQUAL"
},
"users": ["user_1", "user_2", "user_3"]
}
Now, this is a great way to store the transactions, but how good is it for satisfying our queries in real time?
Show me my total pending / outstanding
To get any user's outstanding amount, we need to perform a lookup getting every transaction in which the user was a participant. Once we get all the transactions, we need to then group all these and perform the following mathematics -
Recommended by LinkedIn
Now, imagine we need to perform this match clause, unwind the array, calculate and perform if else for every transaction that is there for this user. So much CPU and memory load to perform this for Millions of users on real time..!
A Better Model?
The above model was very efficient to capturing individual transactions but very tough to get the views/results when we query. In order to simplify this, let's revisit our modelling technique and use the Division Pattern -
{
"source": "user_1",
"destination": "user_2",
"timestamp": 1234,
"amount": 50
},
{
"source": "user_2",
"destination": "user_1",
"timestamp": 1234,
"amount": -50
},
{
"source": "user_1",
"destination": "user_4",
"timestamp": 1234,
"amount": -50
}
We divide the information into separate objects which says - source_user owes destination user some amount for transaction at timestamp T. Now this is an awesome way to get the results of our queries --> filter all documents with source as user_1, group by sum and you get the total outstanding. Obviously, you can represent this information in a much more optimised manner, but you do get the point why we do this.
Now the question arrises --> Do we need 1 record per source per destination per transaction? Imagine 1 million transactions involving an avg of 3 users so we get -> 1M * 6 = 6 million records. (u1-u2, u1-u3, u2-u1, u2-u3, u3-u1, u1-u2)
Can we somehow minimise these records? Let's checkout the Bucketing pattern..!
{
"source": "user_1",
"destination": "user_2",
"transactions": [
{
"timestamp": "t1",
"amount": 50
},
{
"timestamp": "t2",
"amount": -50
},
...
],
"totalAmount": 100
}
Now, we can group all transactions of user_1 and user_2 into a single record, and keep on appending transactions as and when they arrive, as well as maintaining totalAmount already in the same Bucket. This makes our solution very optimised, but now has another problem -- The transactions for 1 record are unbounded -- they can have millions of transactions.
A Little More Optimised
{
"source": "user_1",
"destination": "user_2",
"startTime": "01 Jan 2020",
"endTime": "31 Jan 2020",
"transactions": [
{
"timestamp": "t1",
"amount": 50
},
{
"timestamp": "t2",
"amount": -50
},
...
],
"totalAmount": 100
}
Keeping everything same, we can now divide these buckets for a certain time range (as shown above) or with a maximum number of transactions. This would help us maintain smaller records, as well as help us in archiving old data as and when required. You can also make a new Bucket when you "settle up" all previous amounts between 2 users.
Conclusion
Understanding, designing and building a data model is very important for any given use case and you should plan ahead how your system would behave in different scenarios depending on your scale, your growth plans, your efforts, migration plans, etc. Planning is much more important that "optimising" everything at Day 1.
For more of HandsOn practise around System Design, Machine Coding and Data Modelling you can register on Eazy Develop and get to experience this...!
Great article, helped me a lot. Can you please throw some light on in which cases would the time based bucketing and capacity based bucketing be favourable? Thanks.
Great approach! I didn't understand one thing though - the limitation of the 2nd last problem/ what was the need to optimize the second last design? Please answer my question, I am really curious about it. Thanks.
Thanks for writing this, very informative and useful for beginners like me!
Great, The way you improved the Data model step by step makes this so informative even for the beginners. The idea of splitting a single object on the basis of a fixed time duration is so cool. One question, Can we split the record on the basis of the size of transactions array, instead of a time range? Like every bucket can hold a fixed say "N" number of transactions. If this exceeds then we split the bucket further. Will it be a good idea?