Verifica certificato SSL e Chain correlata

Come verificare un certificato HTTPS più la relativa chain

Siccome Google è il miglior post-it che si possa avere, incollo qui il comando per verificare il certificato SSL e la relativa Chain di un dato server HTTPS con abilitato l’SNI.

echo | openssl s_client -showcerts -connect www.google.com:443 -servername www.google.com | more

CouchDB Replication Scheduler – tweak and tuning

A few tweaking hints for the Replication Scheduler in CouchDB 2.x

Lately we’ve been experimenting a lot with CouchDB an its replication features.

It’s a very cool paradigm that allows you to hide many layer of complexity related to data synchronisation between different systems into an automated and almost-instant replication process.

Basically CouchDB implements two kind of replications, a “one-shot” replication and a “continuous” replication. In the first case there’s a process that starts, replicate an entire DB and then goes in a “Completed” state, while in the second case there’s an always-on iterating process that, using some kind of internal sequence numbers (something conceptually close to a Journal Log of a filesystem), keeps the slave database continuously in sync with the master one.

When dealing with many databases and replication processes it’s pretty easy to reach a point where you have many Replication Processes running on a single server and that may lead to slowness and, in general, a high load of (effectively idle) activities on the machines.

To avoid such circumstances CouchDB, since version 2.1, implements a replication scheduler that is able to cycle trough all the replication jobs in some sort of circular fashion pausing and restarting all the jobs to avoid resources exhaustion.

The Replication Scheduler is controlled by a few tuneable parameters (see http://docs.couchdb.org/en/stable/config/replicator.html#replicator for more details). Three of those parameters are the real deal of the situation as they control the basic aspects of the scheduler:

  • max_jobs – which controls the threshold of max simultaneously running jobs;
  • interval – which controls how often the scheduler is invoked to rearrange replication jobs, pausing some jobs and starting some others;
  • max_churn – which controls the maximum number of jobs to be “replaced” (read: one job is paused, another one is started) in any single execution of the scheduler.

This is a basic diagram outlining the Replication Scheduler process:

Untitled Diagram

So, basically, with “max_jobs” you control how much you want to stress your server, with “interval” you control how often you want to shuffle things up, and with “max_churn” you control how violently the scheduler will act.

  • If max_jobs is too high your server load will increase (a lot!).
  • If max_jobs is too low your replication will be less “realtime” as there is an higher chance that a replication job could be paused.
  • If interval is too high a paused replication job could stay paused for way too long.
  • If interval is too low a running replication job could be paused to early, before it could actually catch up with it’s queued activities.
  • If max_churn is too high there may be an high expense in setup and kick off timings (when a replication process is started up it has to connect to the server, authenticate, check that everything is aligned and so on…)
  • If max_churn is too low the amount of time a process could stay paused may be pretty long.

As usual, your working environment – I mean, database size, hardware performances, document sizes, whatever – has a huge impact on how you tweak those parameters.

My only personal consideration is that the default value of max_jobs (500) seems to me a pretty high value for a common server. After some tweaking, on a “small” Virtual Machine we use for development we’ve settled with max_jobs set to 20, interval set to 60000 (60 seconds) and max_churn set to 10. On the Production server, with better Hardware (Real HW instead of VM, SSD drives, more CPU cores, and whatever) we expect an higher value for max_jobs – but in the 2x/3x range, so maybe something like 40/60 max_jobs – I strongly doubt we could ever reach a max_jobs value of 500.

Have fun.

PHP Proc_Open and STDIN – STDOUT – STDERR

In gCloud Storage, our Storage-as-a-Service system, we developed some years ago some chain technologies that allowed us to expand dynamically the features of the Storage subsystem allowing it to translate incoming or outgoing files.

Some while ago we developed a chain that allows our users to securely store a file by ciphering it when it enters in the system and decipher it when it’s fetched, without our party saving the password.

After some thinking we decided to embrace already existing technologies for the purpose, and we decided to rely on openssl for the purpose.

So we had to wrap some code that was able to interact with a spawned openssl process. We did some try-and-guess and surely we did our research on google. After various attempts we found this code that proved to be pretty reliable:

stdin, stdout, stderr with proc_open in PHP

We tried first on our Mac OS machines, then on our FreeBSD server and it worked flawlessly for a couple of years. Recently one of our customer asked for a on-premises installation of a stripped-down clone of gCloud Storage, that had to run on Linux (CentOS if that’s relevant). We were pretty confident that everything would go smoothly but that wasn’t the case. When the system went live we found out that when deciphering the files it would lose some ending blocks.

Long story short we found that on Linux a child process can finish while leaving data still in the stdout buffer while – apparently – it can’t on FreeBSD.

The code we adopted had a specific control to make sure that it wasn’t trying to interact with a dead process. Specifically:

if (!is_resource($process)) break;

was the guilty portion of the code. What was happening was that openssl was closing, the code was detecting it and bailing out before fetching the whole stdout/stderr.

So in the end we came out with this:

public function procOpenHandler($command = '', $stdin = '', $maxExecutionTime = 30) {

    $timeLimit = (time() + $maxExecutionTime);

    $descriptorSpec = array(
        0 => array("pipe", "r"),
        1 => array('pipe', 'w'),
        2 => array('pipe', 'w')
    );

    $pipes = array();

    $response = new stdClass();
    $response->status = TRUE;
    $response->stdOut = '';
    $response->stdErr = '';
    $response->exitCode = '';

    $process = proc_open($command, $descriptorSpec, $pipes);
    if (!$process) {
        // could not exec command
        $response->status = FALSE;
        return $response;
    }

    $txOff = 0;
    $txLen = strlen($stdin);
    $stdoutDone = FALSE;
    $stderrDone = FALSE;

    // Make stdin/stdout/stderr non-blocking
    stream_set_blocking($pipes[0], 0);
    stream_set_blocking($pipes[1], 0);
    stream_set_blocking($pipes[2], 0);

    if ($txLen == 0) {
        fclose($pipes[0]);
    }

    while (TRUE) {

        if (time() > $timeLimit) {
            // max execution time reached
            // echo 'MAX EXECUTION TIME REACHED'; die;
            @proc_close($process);
            $response->status = FALSE;
            break;
        }

        $rx = array(); // The program's stdout/stderr

        if (!$stdoutDone) {
            $rx[] = $pipes[1];
        }

        if (!$stderrDone) {
            $rx[] = $pipes[2];
        }

        $tx = array(); // The program's stdin

        if ($txOff < $txLen) {
              $tx[] = $pipes[0];
          }
          $ex = NULL;
          stream_select($rx, $tx, $ex, NULL, NULL); // Block til r/w possible
          if (!empty($tx)) {
              $txRet = fwrite($pipes[0], substr($stdin, $txOff, 8192));
              if ($txRet !== FALSE) {
                  $txOff += $txRet;
              }
              if ($txOff >= $txLen) {
                fclose($pipes[0]);
            }
        }

        foreach ($rx as $r) {

            if ($r == $pipes[1]) {

                $response->stdOut .= fread($pipes[1], 8192);

                if (feof($pipes[1])) {

                    fclose($pipes[1]);
                    $stdoutDone = TRUE;
                }
            } else if ($r == $pipes[2]) {

                $response->stdErr .= fread($pipes[2], 8192);

                if (feof($pipes[2])) {

                    fclose($pipes[2]);
                    $stderrDone = TRUE;
                }
            }
        }
        if (!is_resource($process)) {
            $txOff = $txLen;
        }

        $processStatus = proc_get_status($process);
        if (array_key_exists('running', $processStatus) && !$processStatus['running']) {
            $txOff = $txLen;
        }

        if ($txOff >= $txLen && $stdoutDone && $stderrDone) {
            break;
        }
    }

    // Ok - close process (if still running)
    $response->exitCode = @proc_close($process);

    return $response;
}

Have Fun! 😉

FreeBSD 10.0 bhyve – VMWare ESXi 5.5 comparison – part 2

A few days ago I posted a comparison between FreeBSD’s bhyve and VMWare ESXi 5.5. I received a lot of feedbacks from the result of our test, so we decided to investigate further with a new round of tests, in a more scientific approach.

As in previous test, we used a standard “empty” FreeBSD 10 machine + latest portsnap that we used as our main “template”. The VM was using “ahci-hd” as the storage backend and the tests were run in SSH, not local console. We always started from this template for every test and run the same test in different scenarios. The hardware was the same one as the previous tests.

Note: I didn’t write it in the past post, but our first round of test was run on a ZFS filesystem with both compression and deduplication enabled.

Continua a leggere “FreeBSD 10.0 bhyve – VMWare ESXi 5.5 comparison – part 2”

FreeBSD 10.0 BHyVe – VMWare ESXi 5.5 comparison

Hey, I wrote a “part 2” to this article, you may want to check it out!

Hello,

recently FreeBSD10 has come out and one of the most intresting new features was the introduction of bhyve, a “type 2 hypervisor” that allow you to easily create a Virtual Machine inside of a FreeBSD Host.

As with every new technology, it is yet very rough, but the first “driving” experience was very good. Recently we had a new project starting, some new hardware still unused and in general I’m not very fond of VMWare so we decided to do a comparison between VMWare and bhyve to understand what would be the real performance downfall of using a new technology.

Continua a leggere “FreeBSD 10.0 BHyVe – VMWare ESXi 5.5 comparison”

MySQLfs 0.4.1

Hello everybody.

I’m pleased to announce the release 0.4.1 of MySQLfs. This is the first new official version after a long inactivity so please handle it with care. Furthermore this is my first release so, although I have double checked everything, yet I may have done some tremendous mistake.

These are main improvements in this version:

  • InnoDB usage instead of MyISAM
  • Basic transaction support
  • Upgrade to FUSE API 2.6
  • Enabled support for “big_writes” to speed up FS operations
  • New datablock size
  • FreeBSD (FUSE4BSD) support “out-of-the-box”
  • Support for new FUSE 2.6 API functions:
    • fuse::create – create a file
    • fuse::statfs – returns stats about file system usage (needed for df and such)
    • corrected used block count (needed for du and such)
  • Fixed command line issues: now you can use -obig_writes, -oallow_other (to allow other users to read the mounted filesystem) and -odefault_permissions (which, per this version, is mandatory when using mysqlfs under FreeBSD)

Please note that this is not a production-ready version (yet), but I ask you to test it wildly and please report all the issues that you may have. I’ll try to fix them.

You can download the package here: mysqlfs-0.4.1.tar (232kb).

PLEASE NOTE THAT THE DATABASE SCHEMA HAS BEEN CHANGED FROM 0.4.0 TO 0.4.1!

If your plan is to upgrade from a previous installation my suggestion is to compile the new version alongside the old one, create a new, separated FS, mount the new FS and then copying the datas from the old FS to the new one.

If you really need to do a live upgrade of an 0.4.0 database please take a look at the (unrecommended and incomplete!) upgrade script in the sql subdir.

Installation

To install mysqlfs just make sure you have installed fuse and all it’s libs, plus mysql and all his devel libraries, unpack the tar.gz and just run

./configure
make
make install (as root)

Then create a database with proper permissions and use the file schema.sql in the sql dir to create the database definitions.

Run mysqlfs –help to see al the available options.

You’re done. Have fun.

MySQLfs updates

I already spent some talk on this previously but I wrote in Italian, so let’s do a little recap for English readers.

Just recently I became involved in a project where a cluster of machine had to replicate their datas constantly in an active-active fashion and with geographical distribution.

We checked different kinds of solutions based either on drbd or zfs or hast or coda or… Well there’s a lengthy post on this issue just a few posts before this one, so I suggest you checking that.

At the end of our comparison we found the solution that suited us best to be mysqlfs. So I started investigating on that and quickly found some issues that could be improved.

Main points were:

  • performances, as mysqlfs turned out to be pretty slow under certain circumstances
  • transaction awareness
  • better integration with mysql replication

As I digged in the code, I found a pretty good general infrastructure but quite frankly I don’t think mysqlfs was really ever used in a production environment.
Apart from that, the project was quite young (latest official version was 0.4.0) and also pretty “static”, with it’s latest release dating back in 2009.

So, without any knowledge of C or Fuse whatsoever I took the sources of the latest “stable” release and began experimentating with it.

A few weeks has passed and I think I reached a very interesting point. Those are the goal that I reached as of now:

  • mysqlfs is now using InnoDB instead of MyIsam
  • all the writing operations are now enclosed into a transaction that gets rolled back if something bad happens halfway
  • using transactions also means better replication interoperability, since innodb and the binary log don’t fight for the drives. Innodb first, bin log after.
  • mysqlfs now uses fuse API version 2.6 instead of the old… Mmh… 2.5 I think.
  • using fuse’s API version allowed switching on fuse “big writes”, the switch that allow the kernel and the file system to exchange data blocks bigger than standard 4k
  • mysqlfs internal data block was changed from 4k to 128k either to reduce the blocks fragmentation, to reduce the rows in the “data block” table, to reduce the inserts, and finally to match the big write setting. Receiving 128kb of datas and then pushing 4k at time in the db wouldn’t really make any sense
  • moving to the 2.6 API allowed mysqlfs to be FreeBSD compliant. I don’t know (yet) about Mac OS or anything else, but having the devil in the party is a good plus for me.
  • today I started working on the new file system function that got introduced in e latest versions of fuse, that should to even better to speed and such

Next steps are…

  • implement file and inode locking
  • implement some kind of mechanism to do somewhat write-thru cache. Many basic system commands (cp, tar, gzip) use very small writing buffers (8/16k) so the impact of the big write switch get lost and the performances degrade a lot. I’m a bit afraid this task is above my (null) C knowledge but I have some interesting ideas in mind…
  • implement the missing functions from fuse
  • introduce some kind of referential integrity in the db, although I have to understand the performance downsize
  • introduce some kind of internal data checksum, although everything comes at a price in terms of CPU time
  • introduce some kind of external, let’s say php for example, API, to allow direct php applications to access directly the file system, while it’s mounted, maybe in different machines at the same time, without having to use the file system-functions… Wouldn’t be cool to have a web app, a Linux server and a FreeBSD server all working on the same file system at the same time? Yes I know, I’m insane 🙂
  • improve replication interoperability introducing server signatures in stored datas
  • introduce some kind of command line tools to interact with the db and check total size usage and such
  • in a long future introduce some kind of versioning algorithm for the stored datas

Right now the modification I made to mysqlfs aren’t public yet, as I couldn’t really understand the status of the project on sourceforge. Furthermore it’s code base on sourceforge is not aligned to the version that was on the installer (the one I started working on), and it’s stored in svn that I don’t absolutely know how to use.

I know the latter ones aren’t big issue, but my spare time is very thin, and I definitely can’t waste time in learning svn or manually merging the “new” code that’s in the svn and that is different from the tgz I started from. If any of you is willing to do it I’ll be glad to help.

My internal git repository isn’t public yet because of laziness and because I’d like to publish something that’s at least usable and I have to fix a couple of problems before it could be defined “idiot proof”.

In the meantime if any of you is willing to try I’d be very glad to share the modified code, or if any of you is willing to contribute then I’d be twice as glad as long as we try to keep development a bit aligned.

Cheers

MySQLfs, un nuovo approccio al concetto di filesystem condiviso

Da alcuni mesi sto lavorando su un progetto in cui una certa mole di dati deve essere memorizzata su un filesystem condiviso tra più nodi attivi di un cluster, anche con replicazione distribuita su base geografica.

Abbiamo vagliato una serie di soluzioni differenti basate su tecnologie differenti e con approcci differenti al problema, ma abbiamo avuto parecchie difficoltà a trovare qualcosa che soddisfare tutti i nostri prerequisiti. Nell’ordine abbiamo valutato…

  • NBD: approccio alla replica block based. Ha una logica di funzionamento tale da far si che tutte le operazioni effettuate da un nodo su un dato disco vengano replicate su un altro nodo connesso via rete locale. La sincronizzazione è effettuata a livello di blocchi utilizzando delle tabelle bitmap per individuare, una volta effettuata la prima sincronizzazione iniziale, quali siano i blocchi modificati. Di fondo l’approccio di funzionamento è agnostico al filesystem ed è analogo a quello di md, il mirror software, di Linux. Questo approccio nativo pone due grandi limiti:
    • il primo è che una replica block level, proprio in quanto agnostica rispetto al filesystem, sarà sempre molto dispendiosa in termini di quantità di dati da far viaggiare tra un nodo e l’altro (i blocchi sono piccoli, le scritture non ottimizzabili o sequenziabili…) e soprattutto, per quanto ottimizzata dalle mappe bitmap nel momento in cui il nodo slave fosse momentaneamente sconnesso dal primario, dovrebbe forzatamente rifare un check sull’allineamento con l’intero volume intero – fattori molto importanti in quanto rendono praticamente impossibile utilizzare NDB in uno scenario in cui i due server non siano in rete locale
    • il secondo è che il nodo slave è un hot standby ed il nodo secondario non ha accesso ai dati che sono disponibili sul suo disco fintanto che questo non diventi il master. Per onestà intellettuale va detto che da un paio di anni NDB include anche una finta modalità multi master in cui entrambi i volumi possono essere montati contemporaneamente su entrambi i server e la sincronizzazione è bidirezionale, ma di fondo si tratta nuovamente di una mezza fregatura: NDB non è un filesystem (ed è agnostico al filesystem) quindi con lui da solo non potrete farci niente. Un eventuale filesystem ext3 creato nel volume NDB condiviso non potrà essere acceduto dal secondo server perché ext3 non è multi sessione. NDB non implementa partizioni di quorum, non implementa meccanismi di locking multinodo… Nulla. La soluzione che quindi viene suggerita è quella di usare un filesystem clustered tipo GFS di RedHat o OCFS2 di Oracle sul volume NDB che a questo punto farebbe solo finta di essere un improbabile storage condiviso via rete… A mio avviso una bella schifezza, considerato il forte impatto in termini di implicazioni tecniche e tecnologiche nell’implementare un filesystem cluster (il multi cast per GFS, l’uso di clvm, un filesystem non rimontabile facilmente in caso di Disaster Recovery nel caso di OCFS2)

    Qualora tutto ciò non vi fosse sembrato abbastanza, NDB non ha alcuna logica interna per la gestione stessa del cluster che secondo le menti dietro di esso dovrebbero essere demandate a qualcosa di esterno tipo HA. Insomma, in soldoni NDB dice di fare tutto ma fa piuttosto poco.

  • HAST di FreeBSD giusto per renderci conto che ha gli stessi esatti limiti di NDB. Che peccato per un progetto nato praticamente praticamente ora e che per di più sarà ancora più castrato dalla mancanza di filesystem come GFS ed OCFS2 su FreeBSD. In pratica quindi il fratello stupido di qualcosa che già non è completo. Epic Fail.
  • la DataReplication delle EVA-qualcosa di HP. Fichissima, incrementale e WAN aware (con tanto di compressione dei dati in transito), peccato che il nodo secondario sia nuovamente un hots-tandby inoperativo.
  • sincronizzazioni timerizzate a base di rsync: paradossalmente la soluzione che a conti fatti rispondeva alla maggior parte dei prerequisiti – è incrementale, funziona tramite internet, è compressa… Insomma mica male… Il problema è che dovendo schedulare la risincronizzazione ad intervalli temporali, proprio in quanto timerizzata ha dei buchi di disallineamento. Certo uno può pianificare le sincronizzazioni molto frequenti, ma il problema è che l’operazione di risincronizzazione, per quanto non trasferisca dati allineati, è piuttosto lenta a partire, soprattutto al crescere dei file memorizzati. Con poche decine di migliaia di file tramite internet la frequenza minima realistica di allineamento è di 10/15 minuti di intervallo.
  • ZFS su FreeBSD: ancora una volta una soluzione superganza, questa volta anche facilmente backuppabile, ma anche questa volta non compatibile con uno scenario dual master.
  • HDFS: multi master, geograficamente distribuito, incrementale… Ma… cluster di 64MB? Il dispatcher dei nodi non è clusterizzabile e non ha per design uno slave? Ma stiamo scherzando? diciamo che ho letto male la documentazione, dai.

a questo punto della mia ricerca devo dire che iniziavo ad essere piuttosto sconfortato, quando mi sono ricordato di aver letto un po’ di tempo fa di un progetto per creare un filesystem virtuale operante su database MySQL: MySQLfs.

Ho cercato e su sourceforge ho trovato la pagina del progetto, purtroppo apparentemente non otto attivo, visto che l’ultima versione, la 0.4.0 è del 2009… In un impeto di ottimismo ho deciso di provare comunque a scaricarlo e compilarlo.

MySQLfs è, in verità, un modulo per FUSE, il progetto che permette di scrivere dei filesystem custom operanti in userland invece che a livello kernel. I vantaggi sono molti, tra cui la possibilità di scrivere facilmente un modulo che si interfacci con le librerie standard mysql. Nel caso specifico, mysqlfs si limita a “tradurre” in SQL le write e le effettuate verso il filesystem che astrae. La cosa interessante è che il database dentro cui vengono memorizzati i dati può essere replicato su altri nodi mysql. La replicazione di mysql è ovviamente incrementale, trasformando indirettamente il filesystem risultante in un vero e proprio filesystem journaled di tipo copy on write, ed il database è ovviamente intrinsecamente multi sessione e quindi può essere tranquillamente montato in modalità multi master su più nodi. Il protocollo mysql su cui è basata la replicazione è compresso e si comporta molto bene anche su tratte geografiche.
Il database è facilmente backuppabile sia in maniera integrale (dump) sia in maniera incrementale (ad esempio backuppando i binary log stessi della replication), è atomico e transazionale…

Insomma sulla carta sembrerebbe essere la soluzione di tutti i mali.

Ma ovviamente c’è sempre un “ma”

Allo stato attuale mysqlfs 0.4.0 è piuttosto lento.

Fortunatamente ci ho lavorato un po’ per portarlo al passo con i tempi e per ottimizzare alcune cose tra cui:

  • passaggio alla versione 26 delle api di fuse, più perforanti e che supportano la modalità “big writes” che, in soldoni, leggono e scrivono blocchi di dati più grossi riducendo drasticamente il numero di QUERY eseguite
  • implementazione del supporto alle transazioni, che ottimizza drasticamente l’uso del binary log da parte di mysql, evitando le racing conditions tra le scritture al disco da parte del db engine e quelle da parte del thread del binary log

I risultati preliminari sono ottimi e spero di rilasciare una versione pubblica di questa nuova release di mysqlfs al più presto.

Cercherò di contattare gli autori originali per far includere le mie modifiche nella distribuzione ufficiale.