Evaluating a system's design
Engineers often design new system based on their past experience, knowledge and gut. Sharing some questions that I usually answer to myself to ensure that I have addressed common concerns.
High level Design questions:
- What is the bottleneck of the system? Is it a hard or soft bottleneck?
- What components of the system are not horizontally scalable?
- Is there a simpler design that meets our needs? Does the proposed system feel overly complex?
- What kind of hardware is the right type for such a system?
- How much hardware would be required as a function of transaction rates, data sizes, expected growth rate and any other parameters for which system needs to scale.
- What system SLAs can be improved if cost wasn't a factor?
- What other approaches are reasonable for our requirements? What is the trade off for recommending this design over others?
System expansion questions:
- Complexity and development cost of adding new algorithms, new data sources, new workflows, new features, or other minor modifications to system.
- What if scenarios: For e.g. what if system became a wild success (say 10X or 100X most optimistic estimates)
- What other use cases could be on boarded with minor design changes?
- How could the system evolve to reduce costs if it becomes driving factor?
- What would the various phases of system development look like and what use cases would be delivered in each phase?
Routine Operations questions:
- What is the process and cost of doing deployments/code roll outs?
- How would the system/ operations team monitor system health?
- What kind of tools would be required for effective operations?
- What is the cost of recovering system from hardware and software failures? ( for e.g. data center outages, key dependencies going down etc.
- Does on boarding customers require ops involvement? Does on boarding "special" customers require ops involvement?
- How does operational costs change with number of users, number of machines etc?
- How much in advance due to need to order additional capacity? Is the hardware commodity or customized?
- Can additional capacity be added one machine at a time? Can extra capacity be removed equally easily?
- What is the process for installing new hardware capacity? Does it require notifying/involving other teams? Does it have any visible customer impact? Does adding capacity require any manual involvement?
- How does the system behave when overloaded (say unexpected traffic spike)?
- How does the system guard against accidental abuse by users (unexpected load, bad data, bad requests etc)?
- Can the system be piece meal lifted into cloud if needed?
Thanks
Umesh