For decades, organizations have been using a traditional relational database and trying to fit everything there, whether it is key/value-based user session data, unstructured log data, or analytics data for a data warehouse. However, the truth is, the relational database is meant for transaction data, and it doesn’t work very well for other data types.
Similarly, for specific data needs, you should choose the right tool that can do the heavy lifting, and scale without compromising performance. Solution architects need to consider multiple factors while choosing the data storage to match the right technology. Here are the important ones:
- Durability requirement: How should data be stored to prevent data corruption?
- Data availability: Which data storage system should be available to deliver data?
- Latency requirement: How fast should the data be available?
- Data throughput: What is the data read and write need?
- Data size: What is the data storage requirement?
- Data load: How many concurrent users need to be supported?
- Data integrity: How to maintain the accuracy and consistency of data?
- Data queries: What will be the nature of queries
In the following table, you can see different types of data with examples and appropriate storage types to use. Technology decisions need to be made based on storage type, as shown here:
|Data Type||Data Example||Storage Type||Storage Example|
|Transactional, structured schema||User order data, financial transaction||Relational database||Amazon RDS, Oracle, MySQL,|
Microsoft SQL Server
|Key-value pair, semi-structured,|
|User session data, application log, review,|
|NoSQL||Amazon DynamoDB, MongoDB,|
Apache HBase, Apache Cassandra, Azure Tables,
|Analytics||Sales data, Supply chain intelligence, Business flow||Data warehouse||IBM Teradata, Netezza, Greenplum, Google Amazon Redshift, BigQuery|
|In-memory||User home page data, common dashboard||Cache||Redis cache, Memcached Amazon ElastiCache,|
|Object||Image, video||File-based||SAN, Amazon S3, Azure Blob|
Storage, Google Storage
|Block||Installable software||Block-based||NAS, Amazon EBS, Amazon EFS,|
Azure Disk Storage
|Streaming||IoT sensor data, clickstream data||Temporary storage for|
|Apache Kafka, Amazon Kinesis,|
Spark Streaming, Apache Flink
|Archive||Any kind of data||Archive storage||Amazon Glacier, magnetic tape storage, virtual tape library storage|
|Web storage||Static web contents such as images, videos,|
|CDN||Amazon CloudFront, Akamai CDN, Azure CDN, Google CDN, Cloudflare|
|Search||Product search, content search||Search index store and|
|Amazon Elastic Search, Apache Solr, Apache Lucene|
|Data catalog||Table metadata, data about data||Meta-data store||AWS Glue, Hive metastore, Informatica data catalog, Collibra data catalog|
|Monitoring||System log, network log, audit log||Monitor dashboard and alert||Splunk, Amazon CloudWatch, SumoLogic, Loggly|
As you can see in the preceding table, there are various properties of data, such as structured, semi-structured, unstructured, key-value pair, streaming, and so on. Choosing the right storage helps to improve not only the performance of the application but also its scalability. For example, you can store user session data in the NoSQL database, which will allow application servers to scale horizontally and maintain user sessions at the same time.
While choosing storage options, you need to consider the temperature of the data, which could be hot, warm, or cold:
- For hot data, you are looking for sub-millisecond latency and required cache data storage. Some examples of hot data are stock trading and making product recommendations in runtime.
- For warm data, such as financial statement preparation or product performance reporting, you can live with the right amount of latency, from seconds to minutes, and you should use a data warehouse or a relational database.
- For cold data, such as storing 3 years of financial records for audit purposes, you can plan latency in hours, and store it in archive storage.