Vagrant, Docker provider, (and Puppet)

30 Jul

While this is not exactly a Web or Data performance post, it is indeed about performance and the speedup of our DEV environments…

vagrant docker 600x336 Vagrant, Docker provider, (and Puppet)

Vagrant + docker provider presentation at Docker meetup in Barcelona

 

I regularly use Vagrant to share the DEV VMs among team developers and for open-source projects.  Vagrant allows to share, build, provision server environment in which to run code.  However, as you go adding VMs to your system, it starts consuming resources and making things slow.  After reading about Docker, decided to merge the two.

Docker is virtualization replacement for several scenarios based on Linux containers (LXC)Docker, allows to pack applications along with configuration and OSs without the need of virtualization (for latests Linux kernels).  So containers allow isolation, automation, portability, and sharing.  Many  of the features found in a VM (minus security, migration, etc), but containers runs simple as processes in our system, so no need for reserved resources and competing VMs.  Our Linux kernel scheduler is in charged of deciding when to run and how the distinct processes.  So, besides being more light-weight on our DEV machines, it can allow for example to replicate better a production environment in our laptops.  And many more features…

While Vagrant has an official docker provider, building a Vagrant compatible box from scratch became a challenge.  Couldn’t find a clean, step-by-step instructions.  So decided to build my own, present, and share it!

The source-code for the project can be found at:  https://github.com/npoggi/vagrant-docker

Basically to build a Docker image compatible with Vagrant defaults, the next 7 step need to be performed:

  1. Import image FROM repo:tag
  2. Create vagrant user
    1. create password
    2. sudo permissions and password-less login
  3. Configure SSH
    1. Setup keys
  4. Install base packages
  5. Install Configuration Management system (optional)
    1. Puppet, etc…
  6. Expose SSH port
  7. Run SSH as a daemon

To the the right steps took me a good time of trial an error.  So I hope this save some time to other people interested.

So follow the Presentation:

Clone the repo, and vagrant up

 

 

Summary of the current data model

4 Jul

From the Spark summit 2014, on Eric Baldeschwieler’s (@jeric14) presentation: “Spark and the future of big data applications¨, slide 13 on the image below summarizes quite well the current layers of how Web companies should align their different data needs:

app data model 600x321 Summary of the current data model

While Eric calls it the Bid Data model, I believe this applies to the overall data model of any medium to large Web company.  And how data flows within the architecture and the separate, but interrelated data needs and uses.

You can download the presentation at: Spark and the future of big data applications

Data Analytics: MyISAM vs ARCHIVE vs Infobright ICE vs Infobright IEE benchmark (part2)

14 May

This is Part2 of the MySQL based engines I have been testing lately.  Part1 covered time loading data, size on disk for the tables, and system specs.  On Part1 we could see how the ARCHIVE engine and both free and commercial versions of Infobright gave significant gains in disk space.  In Load times, the commercial version of Infobright (IEE) was the fastest; while MyISAM (disabling keys), ARCHIVE, and Infobright ICE were similar.  InnoDB performed poorly importing data from CSV, taking days to import.  InnoDB works better with INSERT statements than CSV imports.  The next Figure show query performance on this test data for two different queries:

query peformance for different mysql engines2 600x286 Data Analytics: MyISAM vs ARCHIVE vs Infobright ICE vs Infobright IEE benchmark (part2)

(more…)

Data Analytics: MyISAM vs ARCHIVE vs Infobright ICE vs Infobright IEE benchmark

3 Apr

Next is a brief quick benchmark/comparison of different MySQL based storage engines I have been working lately for Big Data analytics.  The comparison includes: disk space used, load time, query performance, as well as some comments.  It is not intended as a formal benchmark.

During the last few days I have running out of disk space in my 2TB partition I use for my research experiments.  On that partition, I mainly have a MySQL database with partitioned tables by week and over 2 years of web log and performance data.  At first, I was comparing InnoDB vs MyISAM query performance and disk usage.  MyISAM is a quite faster than InnoDB loading data, specially when DISABLING KEYS first, but then, reenabling the keys was a problem MyISAM on large tables.  MyISAM doesn’t seem to scale well to a large number of partitions, while InnoDB does.  An advantage of MyISAM tables besides fast loading, is that the tables occupy less disk space than InnoDB:  InnoDB occupies about 40% more space than MyISAM for this type of tables, consisting of various numeric IDs and a couple of URLs.  However, had many crashes with MyISAM having to repair tables many times.   For data analysis that is annoyance but not a serious problem.  Wouldn’t use MyISAM in production/OLTP servers, maybe if back in the early 2000′s…

 Anyhow, after optimizing the configuration for both engines and having to choose between:

  • InnoDB: reliable, but large size on disk and slow to load tables.  It could take a week to load the 2 years of data.

  • MyISAM: faster to load, medium size on disk (a bit less than in CSV), but unreliable for large tables

Decided to explore other non distributed file system options like Hadoop, with easy MySQL migration and found:

  • ARCHIVE: a compressed engine for MySQL, doesn’t support keys (except for the numeric primary key).  Already familiar with it for backups and integrated into MySQL.  Supports basic partitioning.

  • InfoBright ICE: a compressed column table storage, fork from MySQL, open source with fast loading.  As cons, requires a different installation, and advanced features only in the commercial version.

  • InfoBright IEE: commercial version of the storage.  Promesses multi-core improvements for query and loading over the open-source version. Decided to give it a try for comparison.

size on disk for different mysql engines21 600x253 Data Analytics: MyISAM vs ARCHIVE vs Infobright ICE vs Infobright IEE benchmark

(more…)

Improve your DNS lookups with namebench

30 Jan

Ilya Ilya Grigorik’s great course on web performance made me aware of the importance of DNS server performance and how they are poorly mantained.  Domain Name System was invented in ’82, they compose one of the oldest core services of the Internet, however, they are often disregarded, as they are assumed to be fast and usually one connects to whatever is offered through DHCP.  DNS requires very little resources: uses UDP, client and server caches, and high optimized code.  DNS is also very reliable as clients have a pool of servers to connect to and requests can be forwarded between servers.  However, in general, DNS servers are poorly maintained and not optimized regularly, as they most of the time “work”.

Illya suggested trying namebench, an open-source tool to benchmark and help you choose the most appropriate DNS servers for your location.  What’s cool about the tool —besides being python based and having a multi-platform GUI– is that for it’s benchmark it can take domain names from your browser’s cache and graphical reports.

namebench DNS latency results 600x345 Improve your DNS lookups with namebench

(more…)

PHP’s XDebug tracing overhead in production

9 Aug

This is a post with some production information I had pending to write… XDebug PHP extension, is a great, open-source debugger and profiler for PHP. It is a good alternative to the commercial Zend server. XDebug besides its profiling capabilities, it adds full stack traces to PHP error messages when enabled (see config. options). These include Fatal Errors and Warnings as shown in the next picture:

XDebug PHP Fatal error trace 300x175 PHPs XDebug tracing overhead in production

(more…)

A tribute to Zabbix, great network+ monitoring system

7 Apr

Just a short post to recommend a great monitoring system if you haven’t heard of it yet:  Zabbix.

Zabbix 300x216 A tribute to Zabbix, great network+ monitoring system

Zabbix

Zabbix has been around to the public since around 2001, and since 2004 as a stable version.  However I see very few posts about it, and it is far less popular than Nagios, even though it is more feature rich.  I have been using it since 2004, for various projects, and it is great.  It is very simple to install, it had since the beginning Windows and Unix agents so you don’t need to set up SNMP on your network, and scales very well.  I even use it to measure and keep track of  the performance of my own dev machine.

However the most important feature that I find besides monitoring servers, performance availability history, graphics and charts, is that you can extend it and import application data easily! (more…)

GROUP BY a SQL query by N minutes

2 Feb

Just a quick post about a tip to group in SQL [tested on MySQL] a date-time field by N number of minutes using FLOOR().

Functions such as DAY(), HOUR(), MINUTE() are very useful to group by dates, but what about 30, 15, or 10 minutes?

(more…)