Category Archives: JavaScript

Converting a file to a JSON array

For some reason I need that. OK, not any reason. For integrating a CloudInit YAML file into an AWS CloudFormation template. By using this article as reference, I made a simple node.js script for doing just that.

#!/usr/bin/env node
 
var fs = require('fs');
 
fs.readFile(process.argv[2], function (err, file) {
	if (err) {
		console.error(err);
		process.exit(1);
	}
	file = file.toString().split('\n');
	var idx, aux = [];
	for (idx = 0; idx < file.length; idx++) {
		aux.push(file[idx]);
		aux.push('\n');
	}
	file = JSON.stringify(aux);
	console.log(file);
});

Save as something.js, make it an executable, then invoke it with ./something.js /path/to/file.

The end.

Performance breakdown for libxml-to-js

Background

libxml-to-js was born to solve a specific problem: to support my early efforts with aws2js. At the time, the options were fairly limited. xml2js was a carry-over from aws-lib which aws2js initially forked. I was never happy with xml2js for a couple of reasons: performance and error reporting. Therefore I looked for a solution to have a drop-in replacement. Borrowed some code from Brian White, made it fit to the xml2js (v1) formal specifications, then pushed it to GitHub. At some point the project had five watchers and five contributors. I guess it hit a sweet spot. That’s why it’s got support for XPath and CDATA, most of it from external contributions. And only then I started using it for other XML related stuff.

The name was chosen to make a distinction from libxmljs which sits at the core of this library which actually binds to Gnome’s libxml2.

Due to the fact that aws2js gained some popularity and I’m doing a complete rewrite with 0.9, the output of libxml-to-js most probably won’t change beyond the “specs” of xml2js v1.

Performance

The actual reason for why I’m writing this article is the fact that people keep asking about the reason for choosing libxml-to-js over xml2js, therefore next time when this question arrives, I am going to simply link this article.

Even now, two and a half years later, with some crappy benchmark that I pushed together, it is somewhere around 25-30% faster than xml2js under usual circumstances. In only specific cases that don’t apply to the XML returned by AWS, xml2js closes in. The part where it really shines is still the error reporting where besides the fact that’s accurate, it is also screaming fast compared to xml2js. In my tests it came out to be around 27X faster.

The code:

var Benchmark = require('benchmark');
 
var suite = new Benchmark.Suite;
 
var parser1 = require('libxml-to-js');
var parser2 = new require('xml2js').Parser({
    mergeAttrs: true,
    explicitRoot: false,
    explicitArray: false
}).parseString;
 
require('fs').readFile(process.argv[2], function(err, res) {
    if (err) {
        console.error(err);
        return;
    }
    var xml = res.toString();
    // add tests
    suite.add('XML#libxml-to-js', function() {
        parser1(xml, function(err, res) {});
    })
        .add('XML#xml2js', function() {
            parser2(xml, function(err, res) {});
        })
    // add listeners
    .on('cycle', function(event) {
        console.log(String(event.target));
    })
        .on('complete', function() {
            console.log('Fastest is ' + this.filter('fastest').pluck('name'));
        })
    // run async
    .run({
        'async': true
    });
 
});

The results, based onto the XML files from the libxml-to-js unit tests and the package.json for the error speed test:

# package.json
XML#libxml-to-js x 18,533 ops/sec ±3.46% (75 runs sampled)
XML#xml2js x 673 ops/sec ±1.35% (68 runs sampled)
Fastest is XML#libxml-to-js
 
# ec2-describeimages.xml
XML#libxml-to-js x 1,122 ops/sec ±4.59% (74 runs sampled)
XML#xml2js x 818 ops/sec ±7.02% (83 runs sampled)
Fastest is XML#libxml-to-js
 
# ec2-describevolumes-large.xml
XML#libxml-to-js x 65.41 ops/sec ±3.13% (65 runs sampled)
XML#xml2js x 50.88 ops/sec ±2.14% (65 runs sampled)
Fastest is XML#libxml-to-js
 
# element-cdata.xml
XML#libxml-to-js x 14,689 ops/sec ±5.41% (72 runs sampled)
XML#xml2js x 11,551 ops/sec ±2.36% (88 runs sampled)
Fastest is XML#libxml-to-js
 
# namespace.xml
XML#libxml-to-js x 9,702 ops/sec ±5.75% (72 runs sampled)
XML#xml2js x 5,802 ops/sec ±2.41% (81 runs sampled)
Fastest is XML#libxml-to-js
 
# root-cdata.xml
XML#libxml-to-js x 22,983 ops/sec ±7.11% (69 runs sampled)
XML#xml2js x 14,849 ops/sec ±6.01% (87 runs sampled)
Fastest is XML#libxml-to-js
 
# text.xml
XML#libxml-to-js x 2,669 ops/sec ±3.68% (78 runs sampled)
XML#xml2js x 2,617 ops/sec ±2.41% (88 runs sampled)
Fastest is XML#libxml-to-js
 
# wordpress-rss2.xml
XML#libxml-to-js x 2,056 ops/sec ±4.08% (75 runs sampled)
XML#xml2js x 1,226 ops/sec ±2.79% (84 runs sampled)
Fastest is XML#libxml-to-js

The tests ran under node.js v0.10.22 / OS X 10.9 / Intel Core i5-4250U CPU @ 1.30GHz with the latest module versions for both libxml-to-js and xml2js.

Inlining the PEM encoded files in node.js

Multi line strings in JavaScript are a bitch. At least till ES6. The canonical example for a node.js HTTPS server is:

// curl -k https://localhost:8000/
var https = require('https');
var fs = require('fs');
 
var options = {
  key: fs.readFileSync('test/fixtures/keys/agent2-key.pem'),
  cert: fs.readFileSync('test/fixtures/keys/agent2-cert.pem')
};
 
https.createServer(options, function (req, res) {
  res.writeHead(200);
  res.end("hello world\n");
}).listen(8000);

All fine and dandy as the sync operation doesn’t penalize the event loop. It is associated with the server startup cost. However, jslint yells about using sync operations. As the code is part of the boilerplate for testing http-get, refactoring didn’t make enough sense. Making jslint to STFU is usually the last option. The content of the files never changes, therefore it doesn’t make any sense to read them from the disk either. Inlining is the obvious option.

Couldn’t find any online tool to play with. Therefore I fired a PHP REPL, then used my PCRE-fu to solve this one. The solution doesn’t look pretty, but it gets the job done:

php > var_dump(preg_replace('/\n/', '\n\\' . "\n", file_get_contents('server.key')));
string(932) "-----BEGIN RSA PRIVATE KEY-----\n\
MIICXAIBAAKBgQCvZg+myk7tW/BLin070Sy23xysNS/e9e5W+fYLmjYe1WW9BEWQ\n\
iDp2V7dpkGfNIuYFTLjwOdNQwEaiqbu5C1/4zk21BreIZY6SiyX8aB3kyDKlAA9w\n\
PvUYgoAD/HlEg9J3A2GHiL/z//xAwNmAs0vVr7k841SesMOlbZSe69DazwIDAQAB\n\
AoGAG+HLhyYN2emNj1Sah9G+m+tnsXBbBcRueOEPXdTL2abun1d4f3tIX9udymgs\n\
OA3eJuWFWJq4ntOR5vW4Y7gNL0p2k3oxdB+DWfwQAaUoV5tb9UQy6n7Q/+sJeTuM\n\
J8EGqkr4kEq+DAt2KzWry9V6MABpkedAOBW/9Yco3ilWLnECQQDlgbC5CM2hv8eG\n\
P0xJXb1tgEg//7hlIo9kx0sdkko1E4/1QEHe6VWMhfyDXsfb+b71aw0wL7bbiEEl\n\
RO994t/NAkEAw6Vjxk/4BpwWRo9c/HJ8Fr0os3nB7qwvFIvYckGSCl+sxv69pSlD\n\
P6g7M4b4swBfTR06vMYSGVjMcaIR9icxCwJAI6c7EfOpJjiJwXQx4K/cTpeAIdkT\n\
BzsQNaK0K5rfRlGMqpfZ48wxywvBh5MAz06D+NIxkUvIR2BqZmTII7FL/QJBAJ+w\n\
OwP++b7LYBMvqQIUn9wfgT0cwIIC4Fqw2nZHtt/ov6mc+0X3rAAlXEzuecgBIchb\n\
dznloZg2toh5dJep3YkCQAIY4EYUA1QRD8KWRJ2tz0LKb2BUriArTf1fglWBjv2z\n\
wdkSgf5QYY1Wz8M14rqgajU5fySN7nRDFz/wFRskcgY=\n\
-----END RSA PRIVATE KEY-----\n\
"
php > var_dump(preg_replace('/\n/', '\n\\' . "\n", file_get_contents('server.cert')));
string(892) "-----BEGIN CERTIFICATE-----\n\
MIICRTCCAa4CCQDTefadG9Mw0TANBgkqhkiG9w0BAQUFADBmMQswCQYDVQQGEwJS\n\
TzEOMAwGA1UECBMFU2liaXUxDjAMBgNVBAcTBVNpYml1MSEwHwYDVQQKExhJbnRl\n\
cm5ldCBXaWRnaXRzIFB0eSBMdGQxFDASBgNVBAMTC1N0ZWZhbiBSdXN1MCAXDTEx\n\
MDgwMTE0MjU0N1oYDzIxMTEwNzA4MTQyNTQ3WjBmMQswCQYDVQQGEwJSTzEOMAwG\n\
A1UECBMFU2liaXUxDjAMBgNVBAcTBVNpYml1MSEwHwYDVQQKExhJbnRlcm5ldCBX\n\
aWRnaXRzIFB0eSBMdGQxFDASBgNVBAMTC1N0ZWZhbiBSdXN1MIGfMA0GCSqGSIb3\n\
DQEBAQUAA4GNADCBiQKBgQCvZg+myk7tW/BLin070Sy23xysNS/e9e5W+fYLmjYe\n\
1WW9BEWQiDp2V7dpkGfNIuYFTLjwOdNQwEaiqbu5C1/4zk21BreIZY6SiyX8aB3k\n\
yDKlAA9wPvUYgoAD/HlEg9J3A2GHiL/z//xAwNmAs0vVr7k841SesMOlbZSe69Da\n\
zwIDAQABMA0GCSqGSIb3DQEBBQUAA4GBACgdP59N5IvN3yCD7atszTBoeOoK5rEz\n\
5+X8hhcO+H1sEY2bTZK9SP8ctyuHD0Ft8X0vRO7tdt8Tmo6UFD6ysa/q3l0VVMVY\n\
abnKQzWbLt+MHkfPrEJmQfSe2XntEKgUJWrhRCwPomFkXb4LciLjjgYWQSI2G0ez\n\
BfxB907vgNqP\n\
-----END CERTIFICATE-----\n\
"
php >

This gave me usable multi line strings that don’t break the PEM encoding.

Update: shell one liner with Perl

cat certificate.pem | perl -p -e 's/\n/\\n\\\n/'

Poor man’s tail recursion in node.js

If you find yourself in the situation of doing recursion over a large-enough input in node.js, you may encounter this:

node.js:201
        throw e; // process.nextTick error, or 'error' event on first tick
              ^
RangeError: Maximum call stack size exceeded

Oops, I smashed the stack. You may reproduce it with something like this:

var foo = []
 
for (var i = 0; i < 1000000; i++) {
    foo.push(i)
}
 
var recur = function (bar) {
    if (bar.length > 0) {
        var baz = bar.pop()
        // do something with baz
        recur(bar)
    } else {
        // end of recursion, do your stuff
    }
}
 
recur(foo)

“Thanks, that’s very thoughtful. But you’re not helping.” Bear with me. The solution is the obvious tail call elimination. But JavaScript doesn’t have that optimization.

However, you may wrap the tail call in order to call the above recur() function in a new stack. The proper recur() implementation is:

var recur = function (bar) {
    if (bar.length > 0) {
        var baz = bar.pop()
        // do something with baz
        process.nextTick(function () {
            recur(bar)
        })
    } else {
        // end of recursion, do your stuff
    }
}

Warning: please read this carefully. I gave you the solution for recurring over such a large input, but the performance is poor. Using process.nextTick (or a timer function such as setTimeout for that matter, slower BTW) is an expensive operation. Didn’t test where’s the actual bottleneck (epoll itself under Linux, libuv | libev, etc).

time node recur.js
node recur.js  1.36s user 0.28s system 101% cpu 1.610 total

The cost of this method is high. Therefore, don’t attempt this in a web application. It kills the event loop. For instance, I don’t use node for writing web applications. It is a difficult task, while the cost of the event loop itself isn’t that negligible as you may think. It useful as long as the CPU time is negligible compared to the time spent doing IO. Therefore, please, don’t include me in the group of people that thinks about node as the hammer for all the problems you throw at it.

If you’re wondering why I won’t just simply iterate the object, the answer is simple: because that “do something with baz” involves some async IO that would kill the second data provider. Sequential calls ensure that everybody in the architecture stays happy. Besides, I don’t actually use bar.pop(), but something like bar.splice(0, 5000) for packing more data in less remote calls and less events. bar.shift() in a situation like this is as slow as molasses in January. In an async framework, the order of the items from a TODO list is not relevant, therefore use the fastest way.

If you’re still wondering why I still use a solution like this, the above technique is part of the cost associated with the start-up cost. The application fetches all the required data in RAM. Having the application to kill the event loop for 20-30 seconds before hitting the Internet pipe is negligible for a process that runs for hours or days. After the application hits the Internet, only then I can say that node is in use for the stuff where it shines. I know, before this, I listed all the wrong reasons for using node as a tool.

Computing file hashes with node.js

Since node.js has the shiny crypto module which binds some stuff to the openssl library, people might be tempted to compute file hashes with node.js. At least the crypto manual page shows how to do a SHA1 for a given file (mimics sha1sum). Should people do this? The answer is: NO. Some may say because it blocks the event loop. I say: because it is as slow as molasses in January. At least compared to dedicated tools.

Let’s have a look:

var filename = process.argv[2];
var crypto = require('crypto');
var fs = require('fs');
 
var shasum = crypto.createHash('sha256');
 
var s = fs.ReadStream(filename);
s.on('data', function(d) {
  shasum.update(d);
});
 
s.on('end', function() {
  var d = shasum.digest('hex');
  console.log(d + '  ' + filename);
});

time node hash.js ubuntu-10.04.3-desktop-i386.iso
208fb66dddda345aa264f7c85d011d6aeaa5588075eea6eee645fd5307ef3cac ubuntu-10.04.3-desktop-i386.iso
node hash.js ubuntu-10.04.3-desktop-i386.iso 28.92s user 0.80s system 100% cpu 29.661 total

time sha256sum ubuntu-10.04.3-desktop-i386.iso
208fb66dddda345aa264f7c85d011d6aeaa5588075eea6eee645fd5307ef3cac ubuntu-10.04.3-desktop-i386.iso
sha256sum ubuntu-10.04.3-desktop-i386.iso 4.86s user 0.21s system 99% cpu 5.093 total

time openssl dgst -sha256 ubuntu-10.04.3-desktop-i386.iso
SHA256(ubuntu-10.04.3-desktop-i386.iso)= 208fb66dddda345aa264f7c85d011d6aeaa5588075eea6eee645fd5307ef3cac
openssl dgst -sha256 ubuntu-10.04.3-desktop-i386.iso 4.40s user 0.17s system 100% cpu 4.567 total

Edit: to sum up for those with little patience:

node hash.js – 29.661s
sha256sum – 5.093s
openssl dgst -sha256 – 4.567s

/Edit

That’s a ~6.5X speed boost just by invoking openssl alone instead of binding to its library. node.js does something terribly wrong somewhere since the file I/O is not to blame for the slowness:

var filename = process.argv[2];
var fs = require('fs');
 
var s = fs.ReadStream(filename);
s.on('data', function(d) {
 
});
 
s.on('end', function() {
 
});

time node read.js ubuntu-10.04.3-desktop-i386.iso
node read.js ubuntu-10.04.3-desktop-i386.iso 0.62s user 0.60s system 106% cpu 1.148 total

This little example that I hacked together shows that using child_process.exec is pretty fine:

var exec = require('child_process').exec;
exec('/usr/bin/env openssl dgst -sha256 ' + process.argv[2], function (err, stdout, stderr) {
	if (err) {
		process.stderr.write(err.message);
	} else {
		console.log(stdout.substr(-65, 64));
	}
});

time node hash2.js ubuntu-10.04.3-desktop-i386.iso
208fb66dddda345aa264f7c85d011d6aeaa5588075eea6eee645fd5307ef3cac
node hash2.js ubuntu-10.04.3-desktop-i386.iso 4.44s user 0.19s system 100% cpu 4.630 total

So you can have your cake and eat it too. The guys with the philosophy got this one right.