Sunday, April 11, 2010

How to improve application web response using open source

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.

Cassandraa is distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure. Reliability at massive scale is a very big challenge. At this scale, small and large components fail continuously. Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service.

Cassandra values Availability and Partitioning tolerance (AP). Tradeoffs between consistency and latency are tunable in Cassandra. You can get strong consistency with Cassandra (with an increased latency). But, you can't get row locking: that is a definite win for HBase.


Cassandra vs MySQL with 50GB of data
MySQL
Cassandra

~300ms write
~0.12ms write

~350ms read
~15ms read


Eventually Consistent
Consistency describes how and whether a system is left in a consistent state after an operation. In distributed data systems like Cassandra, this usually means that once a writer has written, all readers will see that write.
On the contrary to the strong consistency used in most relational databases (ACID for Atomicity Consistency Isolation Durability) Cassandra is at the other end of the spectrum (BASE for Basically Available Soft-state Eventual consistency). Cassandra weak consistency comes in the form of eventual consistency which means the database eventually reaches a consistent state. As the data is replicated, the latest version of something is sitting on some node in the cluster, but older versions are still out there on other nodes, but eventually all nodes will see the latest version.
More specifically: R=read replica count W=write replica count N=replication factor Q=QUORUM (Q = N / 2 + 1)


  • If W + R > N, you will have consistency
  • W=1, R=N
  • W=N, R=1
  • W=Q, R=Q where Q = N / 2 + 1
Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).
You get consistency if R + W > N, where R is the number of records to read, W is the number of records to write, and N is the replication factor. A ConsistencyLevel of ONE means R or W is 1. A ConsistencyLevel  of QUORUM means R or W is ceiling((N+1)/2). A ConsistencyLevel  of ALL means R or W is N. So if you want to write with a ConsistencyLevel  of ONE and then get the same data when you read, you need to read with ConsistencyLevel  ALL. 

Limitations:

  • Cassandra's compaction code currently deserializes an entire row (per columnfamily) at a time. So all the data from a given columnfamily/key pair must fit in memory. 
  • Cassandra has two levels of indexes: key and column. But in super columnfamilies there is a third level of subcolumns; these are not indexed, and any request for a subcolumn deserializes _all_ the subcolumns in that supercolumn. So you want to avoid a data model that requires large numbers of subcolumns.
  • The byte size of a value can't be more than 2^31-1. 

High scalability Memcached

Memcached: Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

Where we can use memchaced in application?
If you have a high-traffic site that is dynamically generated with a high database load that contains mostly read threads then memcached can help lighten the load on your database. If your DB load is low but CPU usage very high, you could cache computed objects and renderred templates. You can reduce writes related to session handling, temporarily stash data, cache small but frequently accessed files, cache results from web "services" or RSS feeds...

What's the limitation of memcached?
  • Objects larger than 1MB. Memcached is not for large media and streaming huge blobs.
  • The maximum size of a key is 250 characters  The maximum size of a value you can store in memcached is 1 megabyte. If your data is larger, consider clientside compression or splitting the value up into multiple keys.
  • Anyone can just telnet to any memcached server. If you're on a shared system.  Do not run in DMZ zone.
  • No persistence
  • Memcached doesn't support fail over. Remove dead nodes from your list. Be very careful! With default clients adding or removing servers will invalidate all of your cache! Since the list of servers to hash against has changed, most of your keys will likely hash to different servers. It's like restarting all of your nodes at the same time