, Distributed Data Systems @ LinkedIn
If you have used "atomic increment" in your programming life, you are probably familiar with the ability to increment a number and store it in main memory without fear that another thread will sneak in intermediate changes that would result in some increments being lost.
If you are familiar with Java, Java has classes under the java.util.concurrent.atomic package that provide features like atomic increment or atomic compare-and-set.For example, AtomicInteger's getAndIncrement() method returns the current value, increments it, and stores it in main memory in a thread-safe manner. Hence, if 10 threads were to call AtomicInteger.getAndIncrement() simultaneously (assuming the starting value of the variable were 0). the final value of the variable would always be 10.Distributed systems take this operation to another level. Not only does an increment need to be atomically executed in one machine (with multiple CPUs and memory barriers), the increment needs to be reliably carried out on multiple machines in case one of the machines dies. Systems like Cassandra are built around the concept that data and access to the data survives random node failures.Cassandra offered Distributed Counters as mentioned by Cameron in version 0.8. However, there are problems with it.One problem had to do with idempotence.Under load, an increment operation issued across the network can time out. The client machine does not know if the increment reached the destination, so it reissues the increment. The result is that the increment is carried out once or twice.If the client were not to reissue the increment, the result is that the increment is carried out zero or once. Cassandra counters (when I last used them in 1.0) were not idempotent. To make them idempotent would require that clients issue a "request token" that the Cassandra servers kept around for "some time". If a previously (successfully) executed token were seen again during the "some time" duration, the call would not be re-executed on the server side. Clients would need to re-send increment requests on time out operations. Zookeeper is another distributed system that offers counters in the form of recipes. I have not used them, but have heard they are reliable