How to scale an EdTech Platform indefinitely?

White Paper

How to scale an EdTech Platform indefinitely?

White Paper

How to scale an EdTech Platform indefinitely?

White Paper

This white paper is intended to provide information to IT decision-makers regarding scaling software products from thousands to one million simultaneous users

The paper describes the challenges of software scalability. Software scalability is an attribute of a tool or a system to increase its capacity and functionalities based on its users' demand. Scalable software can remain stable while adapting to changes, upgrades, overhauls, and resource reduction. This case shows how the scalability issue was solved by ENBISYS software engineering team in the specific project and which results were achieved.

The project

The project described is an EdTech IT product that evolved from a startup to a full-scale IT company. At the startup phase, it utilized a single database which was good enough to support several thousand users. The goal of the Client was to transform its software product into a high-load international enterprise system with the database supporting hundreds of thousands of users and subsequent scaling up to millions of users.

This process took several years of development and we will describe it in chronological order.

Development challenges

Database schema integrity

Within several months after the project started, our development team discovered an issue of having too much logic in the form of stored procedures. This could be quite neutral by itself, however, it led to an even bigger issue. Any change introduced to a database schema could potentially break the stored procedure. As a result, the application became not functional.

Database schema integrity

Within several months after the project started, our development team discovered an issue of having too much logic in the form of stored procedures. This could be quite neutral by itself, however, it led to an even bigger issue. Any change introduced to a database schema could potentially break the stored procedure. As a result, the application became not functional.

Database schema integrity

Within several months after the project started, our development team discovered an issue of having too much logic in the form of stored procedures. This could be quite neutral by itself, however, it led to an even bigger issue. Any change introduced to a database schema could potentially break the stored procedure. As a result, the application became not functional.

Developers had to spend significant amounts of time just checking the database schema integrity manually each time a change had been introduced. This was an obstacle to writing good quality code within a reasonable time, so we had to solve this problem as quickly as possible

By that time, MS had already issued several versions of the SSDT framework, which introduced improved support of database projects into VS. It wasn't production-ready yet, but it was good enough to solve our database schema matter. In less than a week, we cleaned up our database and made the whole database project consistent and compilable.

This database project led to stabilized development during the expansion of the development team from 2 to 6 developers in total.

Database project and unit tests

After the database solution had been implemented, the number of database schema integrity errors dropped to zero. Still, too much logic remaining in the database could not be refactored fast and was difficult to maintain with proper quality due to a lack of testing infrastructure for SQL databases.

After a quick analysis, we didn't manage to spot an optimal solution that was production-ready and simple enough, so that the database unit tests writing process was as simple as writing code for regular unit tests. Hence, we came up with the custom framework we had used before to test databases.

The framework was built from several well-known frameworks:

SSDT as a seed database schema storage + PowerShell script to build and deploy SSDT database projects.
Dapper + SimpleCRUD as an ORM all these have been wrapped into.
A small custom base class for MS UnitTest framework.

Even though database unit tests were designed as pretty simple, the testing framework proved to be effective in unit test writing as well as integration tests to handle complex schemas of several databases.

After the database solution had been implemented, the number of database schema integrity errors dropped to zero. Still, too much logic remaining in the database could not be refactored fast and was difficult to maintain with proper quality due to a lack of testing infrastructure for SQL databases.

After a quick analysis, we didn't manage to spot an optimal solution that was production-ready and simple enough, so that the database unit tests writing process was as simple as writing code for regular unit tests. Hence, we came up with the custom framework we had used before to test databases.

The framework was built from several well-known frameworks:

SSDT as a seed database schema storage + PowerShell script to build and deploy SSDT database projects.
Dapper + SimpleCRUD as an ORM all these have been wrapped into.
A small custom base class for MS UnitTest framework.

Even though database unit tests were designed as pretty simple, the testing framework proved to be effective in unit test writing as well as integration tests to handle complex schemas of several databases.

After the database solution had been implemented, the number of database schema integrity errors dropped to zero. Still, too much logic remaining in the database could not be refactored fast and was difficult to maintain with proper quality due to a lack of testing infrastructure for SQL databases.

After a quick analysis, we didn't manage to spot an optimal solution that was production-ready and simple enough, so that the database unit tests writing process was as simple as writing code for regular unit tests. Hence, we came up with the custom framework we had used before to test databases.

The framework was built from several well-known frameworks:

SSDT as a seed database schema storage + PowerShell script to build and deploy SSDT database projects.
Dapper + SimpleCRUD as an ORM all these have been wrapped into.
A small custom base class for MS UnitTest framework.

Even though database unit tests were designed as pretty simple, the testing framework proved to be effective in unit test writing as well as integration tests to handle complex schemas of several databases.

Performance challenges

Introduction

The first two years of the database performance have been smooth and all arising issues were solved using optimization. Such optimization did not require significant architectural changes leading to changes in all tiers.

But the initial architecture with a single database was running into its limits, due to the growing number of users and application features:

All users are extremely synchronized in time – they all start at about the same time.
Most of the users generate short data bursts to be stored in the database with high frequency.
A relatively small number of users requires complicated summary statistics and analytics based on data produc

The table below shows insert statistics for 160 thousand users working within the high-load system in one day. The application and the database must be able to handle high loads during the most intensive hours of usage with a minimal response time.

The first two years of the database performance have been smooth and all arising issues were solved using optimization. Such optimization did not require significant architectural changes leading to changes in all tiers.

But the initial architecture with a single database was running into its limits, due to the growing number of users and application features:

All users are extremely synchronized in time – they all start at about the same time.
Most of the users generate short data bursts to be stored in the database with high frequency.
A relatively small number of users requires complicated summary statistics and analytics based on data produc

The table below shows insert statistics for 160 thousand users working within the high-load system in one day. The application and the database must be able to handle high loads during the most intensive hours of usage with a minimal response time.

The first two years of the database performance have been smooth and all arising issues were solved using optimization. Such optimization did not require significant architectural changes leading to changes in all tiers.

But the initial architecture with a single database was running into its limits, due to the growing number of users and application features:

All users are extremely synchronized in time – they all start at about the same time.
Most of the users generate short data bursts to be stored in the database with high frequency.
A relatively small number of users requires complicated summary statistics and analytics based on data produc

The table below shows insert statistics for 160 thousand users working within the high-load system in one day. The application and the database must be able to handle high loads during the most intensive hours of usage with a minimal response time.

Max inserts/minute	Avg. inserts/minute	Max inserts/sec	Avg. inserts/sec
~72500	~63500	~2500	~850

Max inserts/minute	Avg. inserts/minute	Max inserts/sec	Avg. inserts/sec
~72500	~63500	~2500	~850

Max inserts/minute	Avg. inserts/minute	Max inserts/sec	Avg. inserts/sec
~72500	~63500	~2500	~850

Table 1. Production insertion operations speed

Figure 1 shows a common load distribution in one hour with the most intensive load, which correlates with the usage of the most intensively used table.
Figure 1. Daily throughput pattern

Vertical scalability and optimization

The first version of the database schema for the high load was designed improperly. It had major issues with performance that were relatively simple to fix. Issues included missing indexes, time-consuming calculation of a large amount of data in the database, not optimal queries. These issues can be rewritten with minimal resources investment to achieve better and reliable IT product performance. Also, we migrated most of the complexity from the database level to the application backend and even front-end logic.

This way the whole ecosystem was able to continue to exist on a single database, introducing new features required to get a bigger market share. During a gradual increase in the number of users, it was possible to solve all performance issues with either optimization of bottleneck queries or by introducing a more powerful server with a larger number of CPU cores and memory capacity. Everything went well until we started approaching the limit of cloud disk subsystem IOPS when one of our major tables didn't wasn't able to handle several new records insert operations. So, we had to find a solution for that challenge before we hit the limit.

The issue was MS SQL server internal implementation. Its bottleneck is that the MS SQL database engine uses the same page to write all newly inserted records into a table that has a monotonically growing primary key. Each new record is addressed to the same page until the page has no room to insert a new record, and it's just impossible to add more than one record at a time to the same page.

There were three possible solutions to solve that issue:
1. Delayed durability.
2. Table partitioning.
3. Non-monotonically growing primary key.

Solution 1

solves the issue by simply delaying write operation results to a disk and releasing transactions before data will be fully persisted in the transactional log. This makes concurrency for the single page less trouble because the application should not wait any longer. It's the simplest solution of all, but it doesn't guarantee that all data will be stored on disk in case of a sudden server crash all data will be stored on disk.

Another drawback is that this option was only available for SQL Server 2014 which was still in beta.

1
Solution 2

requires changing the database schema and splitting a single table into several partitions. The number of pages that will be available for parallel insertions will be equal to the number of partitions. Furthermore, the performance of this solution can be increased by binding each partition to a separate drive. Another benefit of this solution is that maintenance procedures for this table, like index recalculation, can be performed faster by a simultaneous update of partitions.

2
Solution 3

doesn't utilize any newly introduced MS SQL features, and it is the simplest form to be implemented as a replacement of the existing integer primary key column by a GUID typed column. But it requires significant changes in the database schema and leads to the size increase of each record and potentially increases fragmentation of the table's indexes.

We chose Solution 2 since it provided scalability without risks of data loss- It also resulted in more predictable behavior compared to the non-monotonically growing primary key.

3

Solution 1

solves the issue by simply delaying write operation results to a disk and releasing transactions before data will be fully persisted in the transactional log. This makes concurrency for the single page less trouble because the application should not wait any longer. It's the simplest solution of all, but it doesn't guarantee that all data will be stored on disk in case of a sudden server crash all data will be stored on disk.

Another drawback is that this option was only available for SQL Server 2014 which was still in beta.

1
Solution 2

requires changing the database schema and splitting a single table into several partitions. The number of pages that will be available for parallel insertions will be equal to the number of partitions. Furthermore, the performance of this solution can be increased by binding each partition to a separate drive. Another benefit of this solution is that maintenance procedures for this table, like index recalculation, can be performed faster by a simultaneous update of partitions.

2
Solution 3

doesn't utilize any newly introduced MS SQL features, and it is the simplest form to be implemented as a replacement of the existing integer primary key column by a GUID typed column. But it requires significant changes in the database schema and leads to the size increase of each record and potentially increases fragmentation of the table's indexes.

We chose Solution 2 since it provided scalability without risks of data loss- It also resulted in more predictable behavior compared to the non-monotonically growing primary key.

3

Solution 1

solves the issue by simply delaying write operation results to a disk and releasing transactions before data will be fully persisted in the transactional log. This makes concurrency for the single page less trouble because the application should not wait any longer. It's the simplest solution of all, but it doesn't guarantee that all data will be stored on disk in case of a sudden server crash all data will be stored on disk.

Another drawback is that this option was only available for SQL Server 2014 which was still in beta.

1
Solution 2

requires changing the database schema and splitting a single table into several partitions. The number of pages that will be available for parallel insertions will be equal to the number of partitions. Furthermore, the performance of this solution can be increased by binding each partition to a separate drive. Another benefit of this solution is that maintenance procedures for this table, like index recalculation, can be performed faster by a simultaneous update of partitions.

2
Solution 3

doesn't utilize any newly introduced MS SQL features, and it is the simplest form to be implemented as a replacement of the existing integer primary key column by a GUID typed column. But it requires significant changes in the database schema and leads to the size increase of each record and potentially increases fragmentation of the table's indexes.

We chose Solution 2 since it provided scalability without risks of data loss- It also resulted in more predictable behavior compared to the non-monotonically growing primary key.

3

Horizontal scalability

With the number of users growing, we were able to solve all performance issues with optimization and vertical scalability. However, one server solution was not working well. The business goal was to make another iteration and increase the number of users faster. Our single database architecture was able to handle about 160K unique users working with it daily. The next goal was to add 100K more users to the system within 6 months.

“

The single database could not handle such an amount of new users, and a new easily scalable database architecture was required

“

The single database could not handle such an amount of new users, and a new easily scalable database architecture was required

“

The single database could not handle such an amount of new users, and a new easily scalable database architecture was required

There were three major issues that the new architecture had to solve:
1. Decreasing latency of every user's request because of the limited time for each operation handling.
2. Rapidly growing number of records in the database should be spread across multiple smaller databases.
3. Vertical scalability requires a powerful and expensive database server; multiple smaller databases can work on less powerful servers which should reduce the overall budget spent on cloud computation power.
Three basic ideas were proposed to the R&D group:
1. Partially migrate to the NoSQL database.
2. Multiply the existing database without schema changes.
3. Split the current database schema into several databases.
There were several considerations to be taken into an account:
1. All existing functionality should be migrated to the new architecture preferably without changes, meaning that the majority of business logic and queries should remain unchanged.
2. Implementation of the new architecture should not stop regular development and affect the feature delivery schedule.

There were three major issues that the new architecture had to solve:
1. Decreasing latency of every user's request because of the limited time for each operation handling.
2. Rapidly growing number of records in the database should be spread across multiple smaller databases.
3. Vertical scalability requires a powerful and expensive database server; multiple smaller databases can work on less powerful servers which should reduce the overall budget spent on cloud computation power.
Three basic ideas were proposed to the R&D group:
1. Partially migrate to the NoSQL database.
2. Multiply the existing database without schema changes.
3. Split the current database schema into several databases.
There were several considerations to be taken into an account:
1. All existing functionality should be migrated to the new architecture preferably without changes, meaning that the majority of business logic and queries should remain unchanged.
2. Implementation of the new architecture should not stop regular development and affect the feature delivery schedule.

There were three major issues that the new architecture had to solve:
1. Decreasing latency of every user's request because of the limited time for each operation handling.
2. Rapidly growing number of records in the database should be spread across multiple smaller databases.
3. Vertical scalability requires a powerful and expensive database server; multiple smaller databases can work on less powerful servers which should reduce the overall budget spent on cloud computation power.
Three basic ideas were proposed to the R&D group:
1. Partially migrate to the NoSQL database.
2. Multiply the existing database without schema changes.
3. Split the current database schema into several databases.
There were several considerations to be taken into an account:
1. All existing functionality should be migrated to the new architecture preferably without changes, meaning that the majority of business logic and queries should remain unchanged.
2. Implementation of the new architecture should not stop regular development and affect the feature delivery schedule.

The new architecture should not introduce excessive complexity for developers. It should preferably be transparent for the developers busy with regular business features. The team should not be concerned about the performance issues of regular requests or how to reach the data stored in different databases.

“

After all these considerations passed through the R&D phase for each of the three proposed solutions, there was a composite solution introduced, combining Solutions 1 and 3

“

After all these considerations passed through the R&D phase for each of the three proposed solutions, there was a composite solution introduced, combining Solutions 1 and 3

“

After all these considerations passed through the R&D phase for each of the three proposed solutions, there was a composite solution introduced, combining Solutions 1 and 3

The second approach was rejected mainly due to its complexity and much higher costs of implementation. It required rewriting all administrative applications and it also had an issue with common data storage which is not user-specific and the copy of which should be synched among all database copies. There was another issue with this approach. To locate the database where the user had to be connected, we had to introduce a new infrastructure for user locations.

The main idea of the solution was to split our database schema into three different units so that each of the databases had its dedicated responsibility:

Main database

is a single database instance that contains common data structures. The database contains all top-level information about organization units, users, and memberships, and it doesn't have any actual user data that is created as a result of daily users' activity.
User data database

is a database that can be installed on a virtually unlimited number of servers. Each database can carry data for a single organizational unit or a set of independent units, which can share the same database. The database schema remains almost identical to the original schema, except that part of the organization structure that is stored in the Main database.
Statistic collector database

a completely new database that was designed from scratch with only one purpose in mind. 1. The main purpose of this database is to collect statistical information across all user databases.
Complicated summary statistics and analytics based on data aggregated in the User database and stored in the MongoDB NoSQL database. During the business day, the NoSQL storage provides a rapid read of calculated data.

Main database

is a single database instance that contains common data structures. The database contains all top-level information about organization units, users, and memberships, and it doesn't have any actual user data that is created as a result of daily users' activity.
User data database

is a database that can be installed on a virtually unlimited number of servers. Each database can carry data for a single organizational unit or a set of independent units, which can share the same database. The database schema remains almost identical to the original schema, except that part of the organization structure that is stored in the Main database.
Statistic collector database

a completely new database that was designed from scratch with only one purpose in mind. 1. The main purpose of this database is to collect statistical information across all user databases.
Complicated summary statistics and analytics based on data aggregated in the User database and stored in the MongoDB NoSQL database. During the business day, the NoSQL storage provides a rapid read of calculated data.

Main database

is a single database instance that contains common data structures. The database contains all top-level information about organization units, users, and memberships, and it doesn't have any actual user data that is created as a result of daily users' activity.
User data database

is a database that can be installed on a virtually unlimited number of servers. Each database can carry data for a single organizational unit or a set of independent units, which can share the same database. The database schema remains almost identical to the original schema, except that part of the organization structure that is stored in the Main database.
Statistic collector database

a completely new database that was designed from scratch with only one purpose in mind. 1. The main purpose of this database is to collect statistical information across all user databases.
Complicated summary statistics and analytics based on data aggregated in the User database and stored in the MongoDB NoSQL database. During the business day, the NoSQL storage provides a rapid read of calculated data.

There are a few more things that should be mentioned here.

As one may notice from the description, the user database shares all common data with the main database. The sharing is implemented by using a one-way push transactional replication mechanism that fits our needs perfectly. One-way means that changes will be replicated only from the main database to user data databases. Push means that the publisher decides when the collected transactions will be sent to clients. Transactional is a type of replication that reads transaction records from the transaction log and sends (replicates them) to subscribers.

Transactional replication provides near real-time data synchronization, but sometimes it's not enough. In such cases, we use synchronization via server links and make changes to the cluster database in one transaction. This way data integrity of changes is guaranteed. Tables replicated this way use triggers to detect changes in the main database and provide such changes to all subscribers. This approach also allows us to put stored procedures under replication. The stored procedure should be created on core and cluster databases.

Usually, core implementation won't contain any logic, because it's just a dummy object whose main purpose is to be called and create a record in a transactional log which will be replicated to all clusters.

“

The main idea of MongoDB NoSQL data storage in our system is to store ready-to-show aggregated statistics for each page

“

The main idea of MongoDB NoSQL data storage in our system is to store ready-to-show aggregated statistics for each page

“

The main idea of MongoDB NoSQL data storage in our system is to store ready-to-show aggregated statistics for each page

MongoDB became the most loaded data source in our application. We achieved good performance by spreading DB data and requests over multiple shards. Each shard represents a replica set with one Primary and two Secondary instances. The "write" requests go to the primary and the "read" requests go to the less loaded secondary instance.

Another challenge was to introduce the new concept with three different databases to the application level. It was a pretty straightforward operation with the dependency injection framework Autofac, which was used in the application already, and its multitenant extension.

Conclusion

The implemented database structure was not virtually infinite-scalable and architecture still had to be improved. At that moment, the system had some user data stored within the main database, thus the applications still needed this data whenever users wanted to fetch or change information about their organization. Most of these issues have been solved by simple caching. The solution has proved its reliability and durability with proper support for millions of users.

BACK TO CASES

Let's discuss your Case!