Kiev 2012





Agenda

  • Why JRuby?
  • Unix and JVM tools for debugging JRuby
  • Our issues with JRuby in production
  • Current state of art

Why JRuby? Our Requirements

  • Single-process application
  • Shared memory
  • Ability to use native-threads

Production Configuration

  • JRuby 1.6.5(later updated to 1.6.7)
  • Kirk as application server(Jetty-based)
    https://github.com/strobecorp/kirk
  • Akka 1.3 - as Actor-library
  • Java - OpenJDK 6
  • Upstart for daemonizing
  • Ubuntu 10.04

Useful Unix-tools

Can be used for debugging any Unix-process:

  • ps and top - find information about required process(es)
  • strace - tracing all kernel-calls from process
  • lsof - observe all opened sockets/files
  • netstat - monitoring all network activities(interfere with lsof)
  • tcpdump/wireshark - dump and investigation of all network traffic

lsof - list open files

  • Observe all open files/sockets for process
  • Observe all opened files in your system
  • Find-out where all your logs are written
  • Often used in combination with strace

Usage:

$ sudo lsof -p <PID>

lsof usage example

$ sudo lsof -p 21647
COMMAND   PID   USER   FD   TYPE    DEVICE SIZE/OFF      NODE NAME
java    21647 ubuntu  cwd    DIR       8,1     4096    321163 /home/ubuntu
java    21647 ubuntu  rtd    DIR       8,1     4096         2 /
java    21647 ubuntu  txt    REG       8,1    39584      9436 /usr/lib/jvm/java-6-openjdk/jre/bin/java
java    21647 ubuntu  mem    REG       8,1   105288    295079 /lib/x86_64-linux-gnu/libresolv-2.15.so
java    21647 ubuntu  mem    REG       8,1    31104    297067 /lib/x86_64-linux-gnu/libnss_dns-2.15.so
java    21647 ubuntu  mem    REG       8,1   424384      9408 /usr/lib/nss/libfreebl3.so
...
java    21647 ubuntu    0u   CHR       1,3      0t0       735 /dev/null
java    21647 ubuntu    1w   REG       8,1   636410    297370 /var/www/app/rails/log/kirk.log
java    21647 ubuntu    2w   CHR       1,3      0t0       735 /dev/null
....
java    21647 ubuntu   42u  IPv6 146353046      0t0       TCP 127.0.0.1:49872->127.0.0.1:postgresql (ESTABLISHED)
...
java    21647 ubuntu   64u  IPv6 146353326      0t0       TCP localhost:3000 (LISTEN)
java    21647 ubuntu   65u  IPv6 149554050      0t0       TCP 127.0.0.1:37047->127.0.0.1:6379 (CLOSE_WAIT)

strace

  • Trace system-calls
    • open/close file/socket
    • read/write to open file/socket
    • write to stdout/stderr
  • Great for debugging process start-up issues
  • Get a clue why your process stuck
  • Can trace sub-process and threads

strace usage for JRuby

Usage:

$ sudo strace -e trace='!gettimeofday,clock_gettime,futex' -fp <PID>
#
# -e trace='!...' - skip trace of these calls
# -f - trace childs and threads
# -p - with PID

lsof + strace: demo

Run in irb next query:

ActiveRecord::Base.connection.execute('SELECT pg_sleep(10000)')
Find appropriate process PID:
$ jps -l
13690 sun.tools.jps.Jps
13356 org.jruby.Main
Run Strace:
$ sudo strace -fp 13356 -e trace='!gettimeofday,futex,clock_gettime'
Process 13356 attached with 12 threads - interrupt to quit
[pid 13372] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 13362] recv(15,  <unfinished ...>
[pid 13365] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 13372] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)

lsof + strace: demo

lsof in da house:

$ sudo lsof -p 13356
COMMAND   PID USER   FD   TYPE     DEVICE SIZE/OFF    NODE NAME
...
java    13356 cris    0u   CHR      136,9      0t0      12 /dev/pts/9
java    13356 cris    1u   CHR      136,9      0t0      12 /dev/pts/9
java    13356 cris    2u   CHR      136,9      0t0      12 /dev/pts/9
java    13356 cris    3r   REG        8,6 62447715 2232993 /usr/lib/jvm/java-7-oracle/jre/lib/rt.jar
java    13356 cris    4r   REG        8,6 10017442 3419090 /home/cris/.rvm/rubies/jruby-1.6.7/lib/jruby.jar
java    13356 cris    5r   REG        8,6   575675 2239123 /usr/lib/jvm/java-7-oracle/jre/lib/jsse.jar
java    13356 cris    6r   CHR        1,8      0t0    5953 /dev/random
java    13356 cris    7u  unix 0xefb78240      0t0  905568 socket
java    13356 cris    8r   CHR        1,9      0t0    5954 /dev/urandom
...
java    13356 cris   14u  unix 0xc583aac0      0t0  906952 socket
java    13356 cris   15u  IPv6     906954      0t0     TCP localhost:59814->localhost:postgresql (ESTABLISHED)

Issue #1: limited performance

  • At first we started from Amazon Large instance
  • We didn't get required performance during load-testing
  • Next we tried XL-instance and it didn't help
  • Then we realized our Kirk-process used only 512Mb
  • Setting JRUBY_OPTS="-J-Xmx2048m" increased maximum allowed memory and resolved issue

Tool for debug: top

Lessons learned: learn JVM configuration parameters, it can be handy

Issue #2: slow connection to Postgres

Symptoms:
  • Everything went good in development environment
  • In production environment our DB and Application were on different hosting-platform
  • When we run application it looks like it totally stuck
  • But after ~20 seconds it started responding

Tools for debug: tcpdump/wireshark

Debugging:
  • plain traffic sniffing with tcpdump and analysis in wireshark showed JRuby sends many spam-selects to Postgres
  • This issue was known and update of postgres jdbc gem fixed it

Lessons learned: use updated gems and JRuby

JVM-toolbox

  • Useful for debugging any JVM-language: Java, JRuby, Clojure, Scala, Groovy
  • Out of the box tools: jps, jconsole, jvisualvm, jstack, jdb, jmap, jhat, jstat, jstatd
  • Dozens of free and commercial third-party tools: MAT, YourKit, ...

Useful JVM-tools

  • jps -l - show all JVM-processes for current user
  • jstack - get stacktrace of all running threads
  • jvisualvm - monitoring different JVM activities: running threads, memory consumption, GC, and allow to obtain JRuby stacktraces
  • jdb - Java Debugger
  • jmap - obtain heap-dump of running JRuby process
  • MAT - Eclipse Memory Analyzer. Incredible for debugging memory-leaks and other runtime-related issues

VisualVm (jvisualvm)

  • One of the killer-app for JRuby
  • Get common information about JVM-process runtime
  • Work with local and remote JVM-process
  • Monitoring memory consumption and GC
  • Monitoring running threads activity
  • Can create HeapDump of running process
  • Charles Nutter wrote good introduction to VisualVM:
    http://bit.ly/OsaFkA
  • VisualVM deserves its own talk!
  • Obtaining JRuby stacktraces

VisualVm: obtaining JRuby stacktrace

  • Most useful VisualVM feature for troubleshooting!
  • JRuby stacktrace is incredebly useful when JRuby thread stuck
  • Stacktrace in 90% of cases is all you need to find the cause of issue!

Recipe: JRuby stacktrace in VisualVM

1. Run VisualVM:

# Run jvisualvm under the same user as for jruby-process
$ jvisualvm

2. Install "VisualVM-MBeans" plugin:

Tools > Plugins > Available Plugins
Check "VisualVM-MBeans" and click "Install"
Reload VisualVM

3. Run JRuby with "--manage" option:

$ export JRUBY_OPTS="--manage"
$ rails s

Recipe: JRuby stacktrace in VisualVM

4. Open appropriate JRuby-process in "Applications > Local" toolbar

5. Open "MBeans" in right-toolbar and select

org.jruby > Runtime > SomeNumber > Runtime

6. And last step:

Operations > Thread Dump

JRuby stacktrace via VisualVM!

jstack or kill -QUIT

  • Useful for obtaining Java-stacktrace of running JRuby thread
  • Requires some knowledge of Java and JRuby-internals
  • jstack prints its output right into your console
    Run it under the same user as JRuby-process owner(or via su/sudo)!
  • kill -QUIT <PID> prints its output into process stdout

Example: "jstack" or "kill -QUIT"

$ ps -aux | grep jruby
 ubuntu     20071  0.0  2.6 718612 81360 pts/0    Sl+  Sep26   0:48 /usr/lib/jvm/java-7-oracle/bin/java ...
 ubuntu     29775  0.0  0.0   4372   832 pts/8    S+   16:15   0:00 grep jruby
$ kill -QUIT 20071
# OR
$ jstack 20071
2012-09-27 16:19:08
Full thread dump Java HotSpot(TM) Server VM (23.3-b01 mixed mode):

"RMI TCP Connection(16)-127.0.0.1" daemon prio=10 tid=0x8cc08400 nid=0x7451 runnable [0x8bbf9000]
   java.lang.Thread.State: RUNNABLE
...
"main" prio=10 tid=0xb6506c00 nid=0x4e6d runnable [0xb68a4000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.read(Native Method)
    at jline.Terminal.readCharacter(Terminal.java:99)
    at jline.UnixTerminal.readVirtualKey(UnixTerminal.java:122)
    at jline.ConsoleReader.readVirtualKey(ConsoleReader.java:1606)
    at jline.ConsoleReader.readBinding(ConsoleReader.java:764)
    at jline.ConsoleReader.readLine(ConsoleReader.java:511)
    at jline.ConsoleReader.readLine(ConsoleReader.java:457)
    at org.jruby.ext.Readline.s_readline(Readline.java:237)
    at org.jruby.ext.Readline$s$s_readline.call(Readline$s$s_readline.gen:65535)
    at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:332)
...

Issue #3: Rails 3.1 ain't thread-safe

Symptoms:
  • All Rails-worker threads stuck
  • Kirk can't process any application requests
  • Only JVM-threads stay active

Tools for debug: jstack, VisualVM

Debugging:
  • jstack showed that all JRuby threads are waiting for Mutex
  • We obtained JRuby-stacktrace via VisualVm
  • Stacktrace points us to connection_pool.rb
  • Scrutiny code review of this file revealed nasty bugs in mutex logic
  • Gist with Fix: https://gist.github.com/2658680

Lessons learned: pay closer attention to thread-safety in your gems

Dream about Magic Debugging Tool

Magic Debugging Tool

  • Ability to investigate the whole running runtime
  • All running threads
  • Ability to obtain access to all objects and properties
  • Ability to jump into any JRuby Frame

jdb - Java Debugger

  • Can sneak into any JVM-process
  • Can investigate all threads
  • But it require deep knowledge of JRuby-internals and it's hard to obtain access to complex data structure

MAT - Eclipse Memory Analyzer

  • MAT - http://www.eclipse.org/mat/downloads.php
  • Real JRuby killep-app!
  • The simplest memory-leak detection tool
  • The simplest debugging tool
    • Shows all environment for running threads
    • Allows to observe any object on JVM heap
    • Require some basic knowledge of JRuby internals
  • Deserves for two distinct talks

MAT usage

1. Obtain heap-dump via: jmap or VisualVm:

# normal way
$ jmap -dump:live,format=b,file=snapshot.hprof <PID>

# forced way(when process doesn't respond)
$ jmap -dump:format=b,file=snapshot.hprof -F <PID>

2. Feed it into MAT:

File > Open Heap Dump

Issue #4: Locks on Postgres transaction

Symptoms:
  • One by one worker-threads became locked until all server stuck

Tools for debug: jstack, jmap, MAT

Debugging:
  • jstack showed that part of JRuby threads are waiting for response from Postgres, and part of JRuby threads are waiting for mutex
  • But it was not clear, what kind of queries produce such behavior
  • We did HeapDump with jmap and fed it into MAT
  • Going up by JRuby frames we managed to find actual SQL-query
  • Reason was in tricky deadlock with combination of PG-transaction and our mutex logic

Lessons learned: avoid using mutexes, prefer non-blocking actors

Issue #5: autoload isn't threadsafe

Symptoms:
  • Time-to-time after deploy we got random errors with AR models
  • It looked like some part of model isn't presented at all

Tools for debug: brain

Debugging:
  • Having multithreaded environment we understood it can be a reason
  • Main issue can be when Ruby-class is under construction and some object is created in parallel thread
  • We reproduced this behavior even in MRI 1.9.3
  • The main reason was in gem with "autoload"-s

Fix: explicitly list all autoloaded classes in initializer

Lessons learned: all Ruby-classes should be loaded before going into multithreaded mode

Summary: Why JRuby thread can stuck?

  • Waiting on blocking system call [jstack/(strace + lsof)]
    • Waiting on response from Database
    • Trying to reach some file on broken FS
    • Trying to read/write from file/socket(operation without timeout)
  • Waiting for Mutex [jstack, jvisualvm]

New Recommended Configuration

  • JRuby 1.7.0(when it will be released)
    • 1.7.0preview2 is about 30% faster then 1.6.7
  • Java - OpenJDK 7
    • invokedynamic on new Java doesn't show any speed improvements
  • Actors via: Akka or Celluloid
    • Avoid writting home-made mutex-logic
    • Prefer using some non-blocking Actor-solution
    • We moved our Airbrake-notificator into actor
  • Application server: Trinidad or Jetpack
  • Ubuntu 12.04
    • New Upstart with many appealing improvements

Current State of Business

  • JRuby works very well and stable
  • Actors and threads benefeat us with lower latency on IO-waiting
  • No need to use background workers - instead we use actors
  • And you can try JRuby if it meets your requirements

<Thank You!>