CouchDB Replication Scheduler – tweak and tuning

A few tweaking hints for the Replication Scheduler in CouchDB 2.x

Lately we’ve been experimenting a lot with CouchDB an its replication features.

It’s a very cool paradigm that allows you to hide many layer of complexity related to data synchronisation between different systems into an automated and almost-instant replication process.

Basically CouchDB implements two kind of replications, a “one-shot” replication and a “continuous” replication. In the first case there’s a process that starts, replicate an entire DB and then goes in a “Completed” state, while in the second case there’s an always-on iterating process that, using some kind of internal sequence numbers (something conceptually close to a Journal Log of a filesystem), keeps the slave database continuously in sync with the master one.

When dealing with many databases and replication processes it’s pretty easy to reach a point where you have many Replication Processes running on a single server and that may lead to slowness and, in general, a high load of (effectively idle) activities on the machines.

To avoid such circumstances CouchDB, since version 2.1, implements a replication scheduler that is able to cycle trough all the replication jobs in some sort of circular fashion pausing and restarting all the jobs to avoid resources exhaustion.

The Replication Scheduler is controlled by a few tuneable parameters (see http://docs.couchdb.org/en/stable/config/replicator.html#replicator for more details). Three of those parameters are the real deal of the situation as they control the basic aspects of the scheduler:

  • max_jobs – which controls the threshold of max simultaneously running jobs;
  • interval – which controls how often the scheduler is invoked to rearrange replication jobs, pausing some jobs and starting some others;
  • max_churn – which controls the maximum number of jobs to be “replaced” (read: one job is paused, another one is started) in any single execution of the scheduler.

This is a basic diagram outlining the Replication Scheduler process:

Untitled Diagram

So, basically, with “max_jobs” you control how much you want to stress your server, with “interval” you control how often you want to shuffle things up, and with “max_churn” you control how violently the scheduler will act.

  • If max_jobs is too high your server load will increase (a lot!).
  • If max_jobs is too low your replication will be less “realtime” as there is an higher chance that a replication job could be paused.
  • If interval is too high a paused replication job could stay paused for way too long.
  • If interval is too low a running replication job could be paused to early, before it could actually catch up with it’s queued activities.
  • If max_churn is too high there may be an high expense in setup and kick off timings (when a replication process is started up it has to connect to the server, authenticate, check that everything is aligned and so on…)
  • If max_churn is too low the amount of time a process could stay paused may be pretty long.

As usual, your working environment – I mean, database size, hardware performances, document sizes, whatever – has a huge impact on how you tweak those parameters.

My only personal consideration is that the default value of max_jobs (500) seems to me a pretty high value for a common server. After some tweaking, on a “small” Virtual Machine we use for development we’ve settled with max_jobs set to 20, interval set to 60000 (60 seconds) and max_churn set to 10. On the Production server, with better Hardware (Real HW instead of VM, SSD drives, more CPU cores, and whatever) we expect an higher value for max_jobs – but in the 2x/3x range, so maybe something like 40/60 max_jobs – I strongly doubt we could ever reach a max_jobs value of 500.

Have fun.

MySQLfs updates

I already spent some talk on this previously but I wrote in Italian, so let’s do a little recap for English readers.

Just recently I became involved in a project where a cluster of machine had to replicate their datas constantly in an active-active fashion and with geographical distribution.

We checked different kinds of solutions based either on drbd or zfs or hast or coda or… Well there’s a lengthy post on this issue just a few posts before this one, so I suggest you checking that.

At the end of our comparison we found the solution that suited us best to be mysqlfs. So I started investigating on that and quickly found some issues that could be improved.

Main points were:

  • performances, as mysqlfs turned out to be pretty slow under certain circumstances
  • transaction awareness
  • better integration with mysql replication

As I digged in the code, I found a pretty good general infrastructure but quite frankly I don’t think mysqlfs was really ever used in a production environment.
Apart from that, the project was quite young (latest official version was 0.4.0) and also pretty “static”, with it’s latest release dating back in 2009.

So, without any knowledge of C or Fuse whatsoever I took the sources of the latest “stable” release and began experimentating with it.

A few weeks has passed and I think I reached a very interesting point. Those are the goal that I reached as of now:

  • mysqlfs is now using InnoDB instead of MyIsam
  • all the writing operations are now enclosed into a transaction that gets rolled back if something bad happens halfway
  • using transactions also means better replication interoperability, since innodb and the binary log don’t fight for the drives. Innodb first, bin log after.
  • mysqlfs now uses fuse API version 2.6 instead of the old… Mmh… 2.5 I think.
  • using fuse’s API version allowed switching on fuse “big writes”, the switch that allow the kernel and the file system to exchange data blocks bigger than standard 4k
  • mysqlfs internal data block was changed from 4k to 128k either to reduce the blocks fragmentation, to reduce the rows in the “data block” table, to reduce the inserts, and finally to match the big write setting. Receiving 128kb of datas and then pushing 4k at time in the db wouldn’t really make any sense
  • moving to the 2.6 API allowed mysqlfs to be FreeBSD compliant. I don’t know (yet) about Mac OS or anything else, but having the devil in the party is a good plus for me.
  • today I started working on the new file system function that got introduced in e latest versions of fuse, that should to even better to speed and such

Next steps are…

  • implement file and inode locking
  • implement some kind of mechanism to do somewhat write-thru cache. Many basic system commands (cp, tar, gzip) use very small writing buffers (8/16k) so the impact of the big write switch get lost and the performances degrade a lot. I’m a bit afraid this task is above my (null) C knowledge but I have some interesting ideas in mind…
  • implement the missing functions from fuse
  • introduce some kind of referential integrity in the db, although I have to understand the performance downsize
  • introduce some kind of internal data checksum, although everything comes at a price in terms of CPU time
  • introduce some kind of external, let’s say php for example, API, to allow direct php applications to access directly the file system, while it’s mounted, maybe in different machines at the same time, without having to use the file system-functions… Wouldn’t be cool to have a web app, a Linux server and a FreeBSD server all working on the same file system at the same time? Yes I know, I’m insane 🙂
  • improve replication interoperability introducing server signatures in stored datas
  • introduce some kind of command line tools to interact with the db and check total size usage and such
  • in a long future introduce some kind of versioning algorithm for the stored datas

Right now the modification I made to mysqlfs aren’t public yet, as I couldn’t really understand the status of the project on sourceforge. Furthermore it’s code base on sourceforge is not aligned to the version that was on the installer (the one I started working on), and it’s stored in svn that I don’t absolutely know how to use.

I know the latter ones aren’t big issue, but my spare time is very thin, and I definitely can’t waste time in learning svn or manually merging the “new” code that’s in the svn and that is different from the tgz I started from. If any of you is willing to do it I’ll be glad to help.

My internal git repository isn’t public yet because of laziness and because I’d like to publish something that’s at least usable and I have to fix a couple of problems before it could be defined “idiot proof”.

In the meantime if any of you is willing to try I’d be very glad to share the modified code, or if any of you is willing to contribute then I’d be twice as glad as long as we try to keep development a bit aligned.

Cheers

SCP tra due host remoti

Chiunque abbia lavorato per più di 10 minuti usando SSH conoscerà sicuramente SCP, il comando che permette di copiare un file tra due macchine appoggiandosi ad una sessione ssh. Il principio è semplice: si può copiare un file da un server ssh alla nostra machina locale oppure un file dalla nostra macchina locale al server. Semplice. Un esempio?

Da qui a li:

scp file.txt utente@server:/directory/

Da li a qui:

scp utente@server:/directory/file.txt .

Ciò che in pochi realizzano è che lo stesso comando può essere usato per copiare un file tra due server entrambi remoti. Il principio è semplice;

scp utente1@serverorigine:/directory/file.txt utente2@serverdestinazione:/directory/

Ciò che però la documentazione si scorda di dirvi e che, a differenza di quanto potreste aver pensato, la vostra macchina non farà da ponte (è un feature, non un bug!) bensì farà in modo che il primo server instauri una connessione diretta verso il secondo per copiare il file.

Questo ha delle implicazioni non da poco, nell’ordine:

  • i due server devono essere in grado di parlarsi, topologicamente intendo. Scordatevi la possibilità di fare voi da gateway tra due reti differenti
  • per poter fare la copia il primo server deve essere in grado di autenticarsi in maniera automatizzata sul secondo (authorized_keys… Avete presente?) in quanto la sessione è “cieca” ed il primo server non sarebbe in grado di venirvi a chiedere la password. Questo vincolo ovviamente non riguarda la connessione tra voi ed il primo server, quella può tranquillamente essere password based

Buone copie.