Contents

Monologue

is not a blog

DICE

IRC/Web Server for MS Windows®

Japanese Side

Japanese contents

 

Favorites

Beyond3D

The forum runs @ 120fps.

Slashdot

News for nerds, stuff that doesn't matter - powered by Slash

The Code Project

Windows Free Source Code and Tutorials

SceneCritique

Powered by DICE

The Old New Thing

For all Windows programmers

Links

Home

Home (alt.)

Perl, Ruby, Multithreading, Embedding

by RyuK ( )

Nov. 15, 2006

 

For the first half of this article the main topic is multithreading in the 2 scripting languages, Perl and Ruby. By writing a multithreaded download manager application in Perl and then porting it to Ruby, it'll show you how to write a multithread application in the both languages and show you how different they are in this area. This section should be fairly easy and doesn't require much knowledge about these scripting languages, but it's expected that you have basic grasp of multithread programming.

The second half is for a bit more advanced programming topic; it's about how to write a C++ application with an embedded Perl or Ruby interpreter. Just embedding them is not rocket science, but using them in an effective manner is not a very easy task right now because of the implementations of these languages. If you are familiar with .NET you might know it's embedded-friendly with AppDomain and COM interfaces, on the other hand you have only raw C interfaces for these scripting languages, let alone scarce documents. As for Perl embedding, the sample code is based on the one I actually implemented in the web server of the DICE. Since it's realized by the mixture of C++ code and Perl hack, it requires some knowledge of C/C++, advanced Perl programming, and Perl internals. But don't worry, I'll annotate most lines in the code to make it useful for as many people as possible because it's the very purpose of this article! Last but not least, the platform for those experiments is Microsoft Windows XP and Visual C++ 7.1, but due to the platform-neutral nature of these scripting languages most things should be applicable to any platforms.

 

Though I use Microsoft Windows for my primary desktop, I still use the Perl scripting language to make a code snippet for my trivial text manipulation work such as list generation in the shell and some network test. I know the Ruby object-oriented scripting language has become the hot topic among tech-savvy people, but I'd prefered Perl just because I was familiar with it for longer time through developing Perl CGI applications for a while in the earlier part of my programming history. My most preferred programming language is by far C++, but it's not exactly handy to write a tiny application. I like C# and .NET too, but it's still overkill. So which tool should I choose? I hate dynamic languages and would like to criticise them, but it's true that they are handy.

There are many scripting languages, Perl, Ruby, Python, PHP... First Python is out, I don't like significant whitespaces because I do lots of indents just to tidy up a source code. Next I drop PHP, its focus seems to be becoming a good companion with Apache and nothing else. Ruby is cool, but I didn't like the begin...end block. You can use braces for blocks, but can't in other places. Basically I'm too fond of that C-style (or Algol-style, to be nitpicky) syntax. Perl 5 is a clumsy procedural scripting language built on the C-like syntax, and the only point I applaud in Perl is that it recommends C-like syntax (albeit with a little difference here and there) if you try to be consistent. If I could use Ruby with C/Java like syntax I'd completely abandon Perl anyday. But it didn't happen.

Besides, object-orientedness in Ruby doesn't matter in a small code snippet as humans are not that dumb and can use non-object-oriented, non-intuitive expressions. Well it's not totally useless, but it makes more sense in the other situations. When you are building a huge software stack or a huge loosely-coupled software network, capsulation is very important. There are other goodies in Ruby, but I couldn't care less about esoteric syntax sugars. So Perl is the perfect language for writing a code snippet? I tended to think so, but there's more to think about.

One day I noticed my favorite download manager application for Windows didn't support chunked HTTP transfer properly. The downloader / web site grabber application, Irvine, is an excellent Japanese software but the development halted a few years ago. Its document said it would support it in future, but it's apparent that it wouldn't arrive for a while. If I have enough time I want to develop an HTTP download manager just for myself in C# or something, but it's not possible then. What I needed at that time was downloading many files off a web site that is configured to send all files in chunked transfer. Therefore I began evaluation of a network library of Perl to see if it can do chunked transfer. If it can, I can just write a short Perl script to get the result I want. But it doesn't end there, as I had to download hundreds of files it's preferable that it downloads multiple files at the same time just like the aforementioned Windows application. I knew Perl had implemented multithreading years ago, but still had the impression that it might be awkward. So this is a nice occasion to do an experiment for these 2 points: HTTP networking and multithreading.

Thinking about them, it came to my mind that I'd read somewhere that Ruby supports them out of the box. Ruby supports multithreading even in DOS by its own non-native threads. Also Ruby's library can handle networking with ease, as far as I knew. I'd known how Ruby works and what kind of things are available for it for years but didn't write a Ruby code as Perl was sufficient for my use. On the other hand I knew the transition to Perl 6 was not exactly smooth and I'd waited years for it to come but eventually the whole Perl scene got out of my interest when I was into other things. So this is a good occasion to test Ruby for myself and see how my prejudice against it which I wrote above about its non-C-like appearance can stand and if I can move to Ruby for my code snippets.

So let's write a multithreaded download manager for Perl. First, we must choose how to download files. Since my purpose is to get an application that can handle chunked transfer properly and just that, it's obvious I can't bother to write a complicated network negotiation with basic Perl Socket classes! libwww-perl (LWP) is a set of Perl modules which provides API to write a web client. I tested if it can download a file from the web site from which the Windows downloader application failed to download a file and fortunately it worked. I forget if the standard Perl distribution for Windows available from ActiveState has LWP, if it's not installed you have to download it from the author's page at the CPAN and build it with the nmake of Visual C++ (probably freely available from Microsoft) from the command line. With libwww-perl you can control a web client object to surf the web.

Another thing to look at is how Perl's multithreading works. Perldoc has a tutorial for Perl multithreading, perlthrtut. This document has most of the info you need to program a multithread application. If there's only one point which is specific to the Perl thread, it's that memory space is not shared by threads unlike usual thread implementations. All usual premitives that are used in multithreaded programming is available in there. For more info about thread-related modules in Perl such as threads::shared, see their documents from the author's page at CPAN.

By the way I first tried to make my downloader application in this way - it spawns multiple worker threads from the main thread and suspends the main thread by putting a blocking semaphore (by setting the counter value 0) in it, then a worker thread downloads a file in the download queue, and it spawns another thread at the end of a thread, finally when the queue becomes empty the semaphore in the main thread is lifted, then all threads are joined and the main thread ends. The number of worker threads is kept at the same number throughout execution. But it didn't work, as the Perl interpreter crashes at the end of execution everytime I run it though it can download all the files in the list. For writing a native multithread application for Microsoft Windows this should work, but in Perl's case you have to join all threads without spawning a new thread from a running thread. I first thought it's caused by a bug in the thread module and browsed the forum for the module, and what I found was there are non-thread-safe modules which can't be used in Perl threads. libwww-perl looks like that and if I removed it the program ended without an error. Seeking the solution I downloaded the latest version of the threads module from the author's page and built it. I tested threads-1.42 which was the latest at the time I wrote the Perl script. It had a small problem when compiling it with VC++, it couldn't detect the existence of a C compiler even though I used the command console provided in the Microsoft Platform SDK. The solution is to edit Makefile.pl and skip the have_cc subroutine. Then you can compile and install it. But it had still the same crash. Without a choice I abandoned this design and decided to use the more traditional, pthread-esque design.

In this actual design of the script, it spawns worker threads and they get in infinite loops that pick up a job from a job queue one by one. The main thread blocks by calling the join function on these worker threads. This looks fairly simple even from the description of it and may look better than what I described above, but I prefer a more asynchoronous style if it works. I wrote it and this time it worked flawlessly without a crash and resource leaks. The following is the code. It's tested with the Win32 version of Perl 5.8.8 available from ActiveState, threads-1.42 module, and libwww-perl-5.805 module.

The simple user configuration section comes at the top.


# user configuration begin ################################

# You have to list the URLs for the files to be downloaded in a file named
# "download_files.txt" and put it in the same directory as this script. my $username = ""; my $password = ""; my $number_of_threads = 3; my $download_interval_sec = 3; my $download_list_filename = "download_files.txt"; my $storage_directory = 'E:\program\src\downloader\store'; # set "" for the current directory my $user_agent_string = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"; # user configuration end ##################################

$username and $password are for basic authentication at password-protected pages. $number_of_threads tells how much worker threads do the downloading work simultaneously. You can set how many seconds it waits before making a new download connection in $download_interval.

After user configuration options, you have to load necessary modules and other tools useful throughout this application.


use strict;
use warnings;

use threads;
use threads::shared;
#use Thread::Queue;
use Thread::Semaphore;

use LWP::UserAgent;
use Cwd;

my $current_directory = Cwd::getcwd();

if (!$storage_directory)
{
	$storage_directory = $current_directory;
}

$| = 1;

my @download_queue : shared; # used for optimization instead of Thread::Queue

my $sem_download_queue = new Thread::Semaphore;
my $sem_stdout = new Thread::Semaphore;

my $last_download_time : shared = 0;	  
	  

The strict and warnings modules are in action here to ban obscure expressions which are often seen in Perl hacks. Though this is a relatively short script it's better not to be bothered by minor things. $| = 1 expression is what you always see in Perl scripts that processes realtime I/O. It prevents buffering and makes Perl show what it has to show immediately. The synchoronized download queue where all jobs are put and fetched in the serial manner is @download_queue. To make it accessible from multiple threads, you have to mark it explicitly with the shared keyword and this is Perl-specific way as I wrote above. Actually Perl already has a useful tool for this use as the Thread::Queue module, I don't use it here as I control the scope of synchonization explicitly to gain a bit better performance. So it requires a semaphore object as a synchronization primitivem, $sem_download_queue. $sem_stdout is the semaphore to synchronize the standard output as multiple threads try to print text reports simultaneously. $last_download_time is the variable used to force threads to put an interval inbetween downloads as explained in the configuration options.

The next part is the main loop of the script.


print "Perl threads version: " . $threads::VERSION . "\n";  

open(IN, $current_directory . "\\" . $download_list_filename) || die("Can't open " . $download_list_filename);

while (<IN>)
{
	chomp;
	if ($_ =~ /^http:/i)
	{
		push @download_queue, $_;
	}
}

close(IN);

print @download_queue . " URIs have been loaded from the download list\n";

for (my $i = 0; $i < $number_of_threads; ++$i)
{
	new threads(\&download_thread_func);
}

foreach my $thr (threads->list)
{
	if ($thr->tid)
	{
		$thr->join;
	}
}

print "Download completed\n";

################################################################
	 

Very short, don't you think? It's better that there are less things to tweak than to be error-prone. It reads the content of the URL list and puts all of them in the download job queue. Then spawns worker threads, and waits them by calling join. As the join function blocks the main thread can stay on the memory waiting all worker threads return from the work. If it doesn't block, the main thread just exits and the whole program halts leaking thread resources.

A spawned worker thread calls the function referenced as download_thread_func. Let's see the part that does the heavy work.


sub print_ts
{
	$sem_stdout->down;
	my $tid = threads->self->tid();
	print ($tid . ": " . shift @_);
	$sem_stdout->up;
}

sub download_thread_func
{
	for (;;)
	{
		my $u = "";
		my $sleep_time = 0;

		$sem_download_queue->down;

		if (@download_queue == 0)
		{
			$sem_download_queue->up;
			print_ts "The download queue is empty\n";
			return;
		}
		else
		{
			$u = shift @download_queue;

			# $last_download_time is protected by $sem_download_queue
			my $t = time();
			if ($t < $last_download_time)
			{
				$sleep_time = $last_download_time - $t + $download_interval_sec;
			}
			else
			{
				if ($t - $last_download_time < $download_interval_sec)
				{
					$sleep_time = $download_interval_sec - ($t - $last_download_time);
				}
				$last_download_time = $t + $sleep_time;
			}
		}

		$sem_download_queue->up;

		if ($sleep_time)
		{
			print_ts "Sleeping for " . $sleep_time ." seconds\n";
			sleep($sleep_time);
		}

		download_uri($u);
	}
}
	

print_ts is just a debug output function that is used mainly for showing HTTP headers. It's a print functon synchronized by a semaphore. The download_thread_func subroutine is it fetches a job from the job queue and download it. It continues to do it until the queue gets empty. The part that tests and modifies the download queue is synchronized by a single semaphore. The actual download is handled by the download_uri subroutine desctibed below.


sub download_uri
{
	my $uri = shift @_;

	print_ts "Downloading $uri\n";

	my $ua = new LWP::UserAgent;
	$ua->cookie_jar({});

	my $req = new HTTP::Request(GET => $uri);
	$req->header(
		"User-Agent" => $user_agent_string,
		"Accept" => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
		"Accept-Charset" => 'iso-8859-1,*,utf-8',
		"Accept-Language" => 'en-US'
	);

	if ($username)
	{
		$req->authorization_basic($username, $password);
	}

	my $response = $ua->request($req);

	if ($response->is_success)
	{
		print_ts "header begin-----------------------------------------\n"
			. $response->headers_as_string
			. "header end-------------------------------------------\n";

		my $is_chunked = (
			($response->header('Transfer-Encoding') && $response->header('Transfer-Encoding') =~ /Chunked/i)
				|| ($response->header('Client-Transfer-Encoding') && $response->header('Client-Transfer-Encoding') =~ /Chunked/i)
			)
					? 1 : 0;

		if ($response->header('Content-Length') && $response->header('Content-Length') == 0 && !$is_chunked)
		{
			print_ts "Content-Length is zero\n";
		}
		else
		{
			my $image_name = "tmp.bin";
			if ($uri =~ /\/([^\/]+)$/)
			{
				$image_name = $1;
			}

			if ($response->header('Content-Disposition'))
			{
				if ($response->header('Content-Disposition') =~ /filename=(.+)/i)
				{
					$image_name = $1;
					$image_name =~ s/"//g;
				}
			}

			if (!open(OUT, ">$storage_directory\\$image_name"))
			{
				print_ts "open error : $storage_directory\\$image_name\n";
				exit();
			}
			else
			{
				binmode(OUT);
				#my $fsize = $response->header('Content-Length');
				if (!defined(syswrite OUT, $response->content, length($response->content)))
				{
					print_ts "syswrite error : $storage_directory\\$image_name\n";
				}
				syswrite OUT, $response->content, length($response->content);
				close(OUT);

				print_ts "Downloaded $image_name\n";
			}
		}
	}
	else
	{
		my $st = $response->status_line;
		print_ts "Error: $uri : $st\n";
	}
}	  
	  

It instantiates an object of the LWP::UserAgent class to do HTTP negotiation and sets necessary HTTP headers for it. After the request is done, the object carries the HTTP response header and body. As the purpose of this script is to download a file off of a setver that does chunked transfer, it does check if it founds the HTTP header that indicates it, Transfer-Encoding or Client-Transfer-Encoding. In a chunked transfer, the actual file name is communicated by the Content-Disposition header, so it scans this header too. When all the necessary info are available the rest thing to do is to save the received data into a file. As the file can be a binary file, you have to call binmode to set the file handle in the binary mode in Windows. The data length can be obtained by the Content-Length header for a nomal HTTP response but the chunked transfer is the way to send a file without sending the Content-Length header. Conveniently you know the data size already by the length of the $response->content data.

That's all for the Perl downloader script. It's about 200 lines and it works, no big deal. So let's move onto the next task, porting it to Ruby.

Before writing the code, as I read Japanese as my 1st language I went to the Japanese side of the Ruby HQ, ruby-lang.org for the Ruby language/library refererence. It's known that Ruby was invented in Japan and some non-Japanese users often complain the document of Ruby is weak compared to Perl and Python in a language comparison war. But the reality is, the Japanese document for Ruby is downright terrible. At least the official manual is very sparse and unorganized. Probably mailing lists and code samples in Japanese may have beefier contents than English equivalents, but the Japanese documents at ruby-lang.org is not something you yearn for and I actually found English documents linked from the English side are more useful.

Writing a Ruby code immediately after writing a Perl code is a bit puzzling experience. In Ruby, the @ prefix means an instance variable (or an object property, in a more usual object-oriented lingo) unlike the array expression in Perl. In Ruby it's about the variable scope and in Perl it's about the data type. I like Ruby's scope which is more object-oriented and cleaner than Perl's one which is heavily dependent on its symbol table and often the source of dirty hacks. But the problem is not that. It's about braces. As I wrote at the beginning of this article I like braces and would like to use it as much as I can do. But Ruby only allows it at blocks and I often trip on it by using braces where it accepts only do...end. Another thing is, Ruby don't requires you to end a line with a semicolon, but when you put an unwanted carriage-return it just emits syntax errors and stops. So you can't hit the Enter key just to tidy up the layout. It is especially problematic when you declare a block parameter for a block which is one of the strengths of Ruby. So my advice for writing a Ruby code is; don't use braces and don't hit the Enter key too often.

Let's start examining the Ruby version. The user configuration part is at the beginning and it won't need much explanation as it's almost identical to the Perl version. Some variables are declared as instance variables of the main function as they are used not before they are declared in some methods. In the Ruby version, HTTP negotiation is handled by the standard net/http library which, fortunately again, can download chunked transfer files. thread is Ruby's intrinsic class for multithread programming. uri is the small utility class to handle a URI. Net::HTTP.version_1_2 is the pragma to instruct net/http to use the newer implementation. if $0 == __FILE__ line is an include guard but it means nothing in this demo. This script is tested with Ruby 1.8.5.


	  
# user configuration begin ################################

@username = ""
@password = ""

number_of_threads = 3
@download_interval_sec = 3

download_list_filename = "download_files.txt"

@storage_directory = 'E:\program\src\downloader\store' # set "" for the current directory

@user_agent_string = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

# user configuration end ##################################

require 'net/http'
require 'thread'
require 'uri'

Net::HTTP.version_1_2

if $0 == __FILE__
	  

The following section is the Ruby versions of the functions with the same name as in the Perl version. The only difference is it uses synchronization primitives (mutex) more intuitively.


def puts_ts(x)
	@mutex_pt.synchronize do
		puts Thread.current.object_id.to_s + ": " + x
	end
end

def download_thread_func
	while true do
		u = ""
		sleep_time = 0

		@mutex_dq.synchronize do
			if @download_queue.empty?
				puts_ts "The download queue is empty"
				return
			else
				u = @download_queue.shift
				# @last_download_time is protected by @mutex_dq
				t = Time.now.gmtime.to_i
				if t < @last_download_time
					sleep_time = @last_download_time - t + @download_interval_sec
				else
					if t - @last_download_time < @download_interval_sec
						sleep_time = @download_interval_sec - (t - @last_download_time)
					end
					@last_download_time = t + sleep_time
				end
			end
		end

		if sleep_time != 0
			puts_ts "Sleeping for " + sleep_time.to_s + " seconds"
			sleep sleep_time
		end

		download_uri u
	end
end
	  

The next section is again the function with the same name as the Perl version, but its content is a little different.


def download_uri(uri)

	puts_ts "Downloading #{uri}"

	req = Net::HTTP::Get.new(URI.parse(uri).path)

	req["User-Agent"] = @user_agent_string
	req["Accept"] = 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*'
	req["Accept-Charset"] = 'iso-8859-1,*,utf-8'
	req["Accept-Language"] = 'en-US'

	req.basic_auth @username, @password unless @username.empty?

	begin
		Net::HTTP.start(URI.parse(uri).host, 80) do |http|
			http.request(req) do |res|
				h = "#{res.code} #{res.message}\nheader begin-----------------------------------------\n"

				is_chunked = false
				filename = ""
				res.canonical_each do |n, v|
					h += (n + ": " + v + "\n")
					if n =~ /Transfer-Encoding/i && v =~ /Chunked/i
						is_chunked = true
					end
					if n =~ /Content-Disposition/i && v =~ /filename=(.+)/i
						filename = $1
						filename.gsub(/^"|"$/, '')
					end
				end
				h += "header end-------------------------------------------"
				puts_ts h

				unless res.code == "200"
					return
				end

				if filename == "" && uri =~ /\/([^\/]+)$/
					filename = @storage_directory + "\\" + $1
				end

				open(filename, "wb") do |file|
					res.read_body do |str|
						file.write str
					end
				end
			end
		end
	rescue Exception => e
		puts_ts e.to_s
	end

end
	  

By giving a code block to it the Net::HTTP class can process an incoming HTTP response-body stream in multiple small chunks as the read_body function of the Net::HTTPResponse class returns. It means you don't have to store the whole received file in the memory before dumping it onto a file like the Perl version. This can be an advantage when a downloaded file is large.

The rest is the main function. I think you can see the pattern here and it needs no further explanation. << is a useful overloaded operator to push an element to the end of an array as a stack.


# main begin #############################################################
@mutex_pt = Mutex::new

if @storage_directory.empty?
	@storage_directory = Dir.pwd
end

@download_queue = []
@mutex_dq = Mutex::new
@last_download_time = 0;

File.foreach(download_list_filename) {|line|
	@download_queue << line.chomp if line.match(/^http:/i)
}

threads = []

for i in 1..number_of_threads
	threads << Thread::start do download_thread_func end
end

for i in 1..number_of_threads
	threads[i - 1].join
end

# main end ###############################################################
end	  
	  

That's all for the Ruby version.

Now how do they compare? The first thing to be noticed is the script length. The Perl version is over 200 lines and 4,890 bytes. The Ruby version is under 150 lines and 3,383 bytes. The second thing to see is the performance as they are multithreaded to squeeze extra performance in the first place. I used them to download dozens of files and the Perl version used about 15MB RAM and the Ruby version occupied only 5MB. This is probably due to the difference of downloading methods between 2 network libraries (whole download vs partial), but also due to the implementation of interpreter threads in these languages since 15MB RAM usage is a bit too high for downloading files around 100KB. The speed seemed a tie, but in some cases the Perl version was faster and the Ruby version stuttered. Maybe it's due to more file I/O and some context-switching awkwardness in the Ruby version.

My final verdict on this subject is, Ruby is the winner, contrary to some of the possible doubts I suggested above. Of course I still don't like begin...end, but if you like the way Java programs are written I guess you like Ruby too as Ruby is even nicer than Java. At least there's nothing Ruby can't do but Perl can do except for Perl-specific symbolic hacks. Why not just jump ship and switch to Ruby for good?

But this discourse doesn't end there. Ruby is certainly a new kid on the block (well not very new actually but leave it at that now) that presents great opportunities. Even then, there may be something useful in an older and proven tool. My server application, DICE, ended up with an embedded Perl interpreter instead of Ruby. DICE has already had a Common Language Runtime (CLR) embedded since years ago, but for the upcoming new version I planned to add something new to make a bullet point for the update list as it took over a year to release a new version. Though at first I wanted to embed PHP which is fairly popular for a web application because of the affinity with the Apache HTTPd, I abandoned that idea for now as I have relatively less experience with the PHP as a language and the information about PHP embedding seemed scarce on the web. (BTW I downloaded the source code of the PHP5 and browsed it, it's not much bigger than the source code of the DICE in size which was a surprise for me.) So the candidates are, as you expect, Ruby and Perl. What I want to accomplish is to embed an interpreter in the DICE and making it process a web application, but not to execute it everytime ala CGI. However, the point to include such an interpreter is to support existing applications available without writing a new code. The DICE has a CLR embedded in it to execute a web application, but it requires a user to write a stub code, or an entire application. This time I want existing CGI applications to run on the DICE. Therefore the goal is to add mod_perl or mod_ruby like capability to it. The requirements are

1. The interpreter is persistent.
2. All input from the standard input and all output to the standard output must be hooked by the DICE because stdin/stdout are the way CGI scripts interact with an internet user through a web server.

This section of this article begins with Ruby. Before writing a code for embedding Ruby, you have to learn the license of Ruby. It adopts the dual license by the GPL and the other BSD-like license. It seems you can do whatever with Ruby if you don't choose GPL, but in reality it's not that simple. The current Ruby implementation (1.8) contains the GNU regular expression library and it forces you to make your application GPLed if you just embed a Ruby interpreter as is. Actually there's a workaround and it's very simple. The current development version of Ruby (1.9) has the Oniguruma regular expressions library which is under the BSD license to avoid this license issue. If you want to stick to 1.8, you can still download the source code of Oniguruma version 2.x and integrate it onto Ruby 1.8 source codes (the instruction is in the Onigruma document). After building the whole Ruby, you can open the binary (msvcr71-ruby19.dll if you use VC++ 7.1 and Ruby 1.9) with Dependency Walker to see if it has Onigruma symbols (prefixed with Onig) and not GNU regex ones.

The first thing you have to do is to download the Ruby source code. Because of the reason I wrote above I recommend the 1.9 development nightly snapshot (snapshot.tar.gz) found in the download page of the ruby-lang.org. Then obviously you have to learn how to embed Ruby somewhere for embedding Ruby. The particular document I referred to when I wrote my sample code was magazine articles written by Shugo Maeda, the author of mod_ruby. They are found around here and here at Mr. Maeda's web site, and his VIM patch is here. They discuss how he embedded Ruby into the vi-clone editor VIM. Unfortunately they are in Japanese though they have some code samples. And this is the real horror story, I couldn't find other detailed Japanese documents for embedding Ruby. Actually the English EmbedRuby article at the Ruby Garden is probably the second best source of info on this matter. After you read the tutorial, you can download the C source code of mod_ruby to learn how a real-world application hosts a Ruby interpreter though it may be hard for those who are not familiar with how an Apache module works.

In addition to the basics of embedding Ruby, Maeda's article offers how to interrupt the standard output of Ruby and trap it to feed to VIM instead of the console screen. Ruby's outputs are all assigned in the $> special variable and its real body in the C implementation of Ruby is the object named rb_defout. By hooking the write method of this object you can trap all Ruby outputs to the standard output. But how? You can do that by the rb_define_singleton_method function defined in the class.c of the Ruby source code. It defines a "singleton method" which is a special method for a special object, which is rb_defout in this case. Also you can redefine each output method in Ruby if you will by the rb_define_global_function function. It's useful to preserve the way each output method behaves differently. If you read the source code of mod_ruby you'll notice it actually redefines all related methods in Ruby with respective hook functions. VALUE is a Ruby type which can hold any Ruby object which is something like the VARIANT type in COM, or the void* type in C.


-- class.c --
void
rb_define_singleton_method(VALUE obj, const char *name, VALUE (*func)(ANYARGS), int argc)
{
    rb_define_method(rb_singleton_class(obj), name, func, argc);
}	  

void
rb_define_method(VALUE klass, const char *name, VALUE (*func)(ANYARGS), int argc)
{
    rb_add_method(klass, rb_intern(name), NEW_CFUNC(func, argc), NOEX_PUBLIC);
}

void
rb_define_global_function(const char *name, VALUE (*func)(ANYARGS), int argc)
{
rb_define_module_function(rb_mKernel, name, func, argc);
}
-- eval.c -- void rb_add_method(VALUE klass, ID mid, NODE *node, int noex) { }

To build a C++ code with a Ruby interpreter embedded, you need Ruby header files, the Ruby static library and the Ruby dynamic link library. The header file, ruby.h, is found in the root directory of the Ruby source code archive. The static/dynamic libraries can be obtained by building the Ruby source code. To build it on Windows, open your VC++/Platform SDK build environment console and run the configure.bat in the win32 directory then type nmake. You get msvcr71-ruby19.lib for the static library and msvcr71-ruby19.dll for the dll. Note that it doesn't produce a static library for static linking. These are release builds that refer to the release version of the MS C-runtime library. I tried to build a debug version by editing the makefile by changing the compiler option /MD to /MDd but the build stopped in building one of .exe. It seemed to have created libraries at least, but I couldn't link it with an embedding C++ code in the Debug mode nor in the Release mode.

Once the Ruby library is ready you have to set up your build environment for your Ruby-embedded application. Set the root directory of the Ruby source code in your include path and set the library path in your LIB directory (In VC++ it's in Tool | Options | Projects | VC++ Directories | Library files). As the resulted executable requires msvcr71-ruby19.dll to run you should copy it in the working directory by yourself. For the C-runtime library, you have to choose the multi-threaded DLL or the multi-threaded debug DLL since Ruby's memory manager is built with the multithread version of the library. Unfortunately the library built by the default makefile has conflicts with other library in the Debug mode with the multi-threaded debug DLL (/MDd). You have to address it by putting libcmtd.lib in Linker | Input | Ignore Specific Library in the VC++ project options for the Debug mode. In the Release mode it's not required, but it still issues linker warnings LNK4049. I tried to create a project file by myself but failed to do that as the Ruby I built couldn't get the current directory as it can't access some APIs. I hope these issues with Windows are resolved in the future.

The environment is ready, let's see the actual C++ code. This sample code is written for a Ruby 1.9 development snapshot and doesn't work with Ruby 1.8 as there are significant changes in the Ruby source code in 1.9. What this code does is fairly simple, initializes a Ruby interpreter, loads and execute a Ruby script, then destroys the interpreter. It includes necessary files at the beginning. For the sake of simpleness I included the library file by the pragma, but it may be safe to do in in the project setting if your project contains many files that can conflict with it. (Actually for Perl embedding it didn't work if I did this.) In addition to the ruby.h, standard C++ headers are included for utility functions.


#pragma comment(lib, "E:\\program\\lib\\rubylib\\ruby_win32\\usr\\lib\\msvcr71-ruby19-static.lib")

extern "C" {

#include <ruby.h>

}

#include <iostream>
#include <string>

using namespace std;	  
	  

First, output hook functions are defined.


VALUE ruby_write_hook(VALUE self, VALUE str)
{
	str = rb_obj_as_string(str);
	cout << string(RSTRING_PTR(str), RSTRING_LEN(str));

	return Qnil;
}

VALUE ruby_p_hook(int argc, VALUE *argv, VALUE self)
{
	VALUE str = rb_str_new("", 0);

	for (int i = 0; i < argc; i++)
	{
		if (i > 0)
			rb_str_cat(str, ", ", 2);

		rb_str_concat(str, rb_inspect(argv[i]));
	}

	cout << RSTRING_PTR(str);

	return Qnil;
}

ruby_write_hook is a function to hook the singleton method of rb_defout as explained already. It extracts a char* pointer and its length from a Ruby string that is assigned to this special object, and outputs it by itself with cout. Qnil is the null value in the Ruby C implementation. ruby_p_hook is defined to redefine the kernel method p that is used to print the state of an object in a human readable form like ToString method in C#/Java. It create a new Ruby string by rb_str_new and enumerates its elements by adding them by rb_str_concat.

init_ruby is the function that initializes the Ruby interpreter. ruby_init_loadpath initializes the library path of Ruby. show_error_pos and show_exception_info are the functions that shows detailed error information just like what you get when you feed an erroneous Ruby script to the Ruby interpreter.


void init_ruby()
{
	ruby_init();
	ruby_init_loadpath();
}

void show_error_pos()
{
	ID this_func = rb_frame_this_func();

	if (ruby_sourcefile)
	{
		if (this_func)
		{
			cout << ruby_sourcefile << ":" << ruby_sourceline << ":in" << rb_id2name(this_func) << endl;
		}
		else
		{
			cout << ruby_sourcefile << ":" << ruby_sourceline << endl;
		}
	}
}

void show_exception_info()
{
	if (NIL_P(ruby_errinfo))
		return;

	VALUE errat = rb_funcall(ruby_errinfo, rb_intern("backtrace"), 0);
	if (!NIL_P(errat))
	{
		VALUE mesg = (RARRAY_PTR(errat))[0];

		if (NIL_P(mesg))
		{
			show_error_pos();
		}
		else
		{
			cout << string(RSTRING_PTR(mesg), RSTRING_LEN(mesg));
		}
	}

	VALUE eclass = CLASS_OF(ruby_errinfo);

	char* einfo;
	int elen;
	int state;
	VALUE estr = rb_protect(rb_obj_as_string, ruby_errinfo, &state);
	if (state)
	{
		einfo = "";
		elen = 0;
	}
	else
	{
		einfo = RSTRING_PTR(estr);
		elen = RSTRING_LEN(estr);
	}

	if (eclass == rb_eRuntimeError && elen == 0)
	{
		cout << ": unhandled exception" << endl;
	}
	else
	{
		VALUE epath;

		epath = rb_class_path(eclass);
		if (elen == 0)
		{
			cout << ": " << string(RSTRING_PTR(epath), RSTRING_LEN(epath)) << endl;
		}
		else
		{
			char* tail  = 0;
			int len = elen;

			if ((RSTRING_PTR(epath))[0] == '#')
				epath = 0;

			if (tail = strchr(einfo, '\n'))
			{
				len = tail - einfo;
				tail++;
			}

			cout << ": " << string(einfo, len);
			if (epath)
			{
				cout << " (" << string(RSTRING_PTR(epath), RSTRING_LEN(epath)) << endl;
			}

			if (tail)
			{
				cout << string(tail, elen - len - 1) << endl;
			}
		}
	}

	if (!NIL_P(errat))
	{
		const int TRACE_HEAD = 8;
		const int TRACE_TAIL = 5;
		const int TRACE_MAX = TRACE_HEAD + TRACE_TAIL + 5;

		RArray* ep = RARRAY(errat);

		long len =  RARRAY_LEN(errat);
		for (int i = 1; i < len; ++i)
		{
			if (TYPE((RARRAY_PTR(errat))[i]) == T_STRING)
			{
				cout << "  from " << string(RSTRING_PTR((RARRAY_PTR(errat))[i]), RSTRING_LEN((RARRAY_PTR(errat))[i])) << endl;
			}

			if (i == TRACE_HEAD && len > TRACE_MAX)
			{
				cout << "   ... " << len - TRACE_HEAD - TRACE_TAIL << "ld levels..." << endl;
				i = len - TRACE_TAIL;
			}
		}
	}
}

The last part is the function that loads and executes a Ruby script (execute_ruby) and the main function. It registers output hook functions in the Ruby interpreter by rb_define_singleton_method and rb_define_global_function.


void execute_ruby(const char* pScriptName)
{
	int state = 0;

	extern VALUE rb_defout;

	typedef VALUE (*rubyfunc)(...);

	rb_defout = rb_obj_alloc(rb_cObject);
	rb_define_singleton_method(rb_defout, "write", (rubyfunc)ruby_write_hook, 1);
	rb_define_global_function("p", (rubyfunc)ruby_p_hook, -1);

	// Always need a full path
	rb_load_protect(rb_str_new2(pScriptName), 0, &state);
	if (state)
	{
		switch (state)
		{
		case 0x1: // TAG_RETURN
			cout << "unexpected return" << endl;
			show_error_pos();
			break;
		case 0x2: // TAG_BREAK
			cout << "unexpected break" << endl;
			show_error_pos();
			break;
		case 0x3: // TAG_NEXT
			cout << "unexpected next" << endl;
			show_error_pos();
			break;
		case 0x4: // TAG_RETRY
			cout << "retry outside of rescue clause" << endl;
			show_error_pos();
			break;
		case 0x5: // TAG_REDO
			cout << "unexpected redo" << endl;
			show_error_pos();
			break;
		case 0x6: // TAG_RAISE
		case 0x8: // TAG_FATAL
			show_exception_info();
			break;
		default:
			cout << "unknown longjmp status " << state << endl;
			break;
		}
	}

	rb_gc();
}

int _tmain(int argc, _TCHAR* argv[])
{
	init_ruby();

	execute_ruby("test.rb");

	ruby_finalize();

	return 0;
}

The rb_load_protect function is the protected version of the rb_load function and it doesn't cause a segmentation fault even when loading fails. Unless there's a special reason you should always use the protected version of Ruby functions. After evaluation of a script is done, it shows error information if any. rb_gc is called at the end to invoke the garbage collector.

When this program is executed it loads the "test.rb" Ruby script and executes it. I tested it with a simple script that does print some strings and apparently it could output characters through the redefined hook functions. I thought that my experiment was a success and I'd move onto writing the real code for embedding Ruby. But I did one more test. It was to feed the Ruby downloader script which I described above to this test code. To my surprise, it halted with the "memory error" in the Ruby interpreter. I have no idea what's wrong, but apparently the culprit was the net/http library. As the library path is in the $LOAD_PATH variable it's not a loading error. At this point I had no idea what happned and wanted to debug it, but the Ruby library is compiled without debug symbols as explained at the beginning of this section. I was really disappointed in this result as I thought the experiment was almost a success. Eventually I abandoned the idea of embedding Ruby into my application. It is known that the Ruby author himself admits that the weakness of the current Ruby implementation is in embedding because it can't have multiple interpreters side by side, but I didn't know there was such a basic problem. It's expected the next version of the Ruby VM addresses some of the issues in embedding, but apparently it's far off in 2008. There may be some workaround to this if I look more carefully into the mod_ruby source code, but I lost my energy for further research for the time being. If you know a solution please email me, thanks in advance.

The last hope is naturally on Perl. First you need the source code of Perl. I used a Perl development snapshot available at here but snapshots of the development trunk in this directory often fail to build or pass the test. After looking at the latest version number of the stable release at Perl.com , you may pick up the latest stable snapshot in another directory. You can find exactly which snapshot is broken at the perl.perl5.porters newsgroup.

To build it on Windows, edit the Makefile in the win32 direcrtory (in most cases just editing the install path and the compiler selection are enough). Then open a VC++/Platform SDK build environment console and type nmake to stard the building process. After it's done nmake test to test it and type nmake install to build and install all Perl libraries and the Perl interpreter into the install path you specified in the Makefile.

The overall information on how to embed a Perl interpreter into your C/C++ application is available in the perlembed document. However, I recommend you to read perlgut and perlcall before going to perlembed. Also perlapi may be useful for a reference. perlgut explains how the C implementation of Perl represents Perl's data types and subroutines in C. perlcall explains how to call Perl subroutines from C. If you still have questions after reading these documents, you can always grep the Perl source code for functions, function macros and their notes.

For an actual embedding code sample, mod_perl is probably the most complete. If you read mod_perl.c in the src\modules\perl directory of mod_perl 2, you see the modperl_response_handler_cgi function is the main functon that executes a Perl CGI script. The basic flow of this function is

1. Setup environmental variables to pass to Perl
2. Override Perl's standard input and standard output
3. Process an HTTP request to a Perl CGI
4. Restore everything back

Basically it's in line with what I wrote as what I wanted to do earlier in this article. Apache does most of these things by directly manipulating Perl by a C code. But I don't need as much security as the Apache module does. Right now the DICE is a server to host its owner's code and nothing else and don't have to watch naughty users. Also I don't intend to run multiple interpreters for now as the context switching method by PERL_SET_CONTEXT in the current Perl implementation sucks. It's based on heavy use of macro and it forces me to use some nasty macro hacks. Even with a single interpreter, it's not a fun to integrate a C++ code that has a Perl interpreter in a larger project. The Perl source code predetermines the name of the interpreter object as "my_perl" and predefined macros can't take other names, so you have to assign your a Perl interpreter pointer to a local variable PerlInterpreter* my_perl everytime you want to do something with it. Also the perl header file redefines many symbols for its own API, which means if your C++ class has a member function named write it conflicts with Perl. So my advice is, don't include perl.h in a header file that is included by many other files in your project. Include it at the beginning of a .cpp source code file and put all Perl-related things in it. But how can you define a class that has a pointer to a Perl interpreter as its property? You can put a char* or something equivalent in the place of the pointer to an interpreter in your class declaration and cast it to PerlInterpreter* everytime you use it in the .cpp source file.

I decided to offload most of the sandbox work to Perl to keep the C++ part simple. The basic design is like this:

1. The Perl interpreter in the DICE loads a setup Perl script which does STDIN/STDOUT hooks and has other necessary utilities for sandboxing a Perl script
2. Whenever a CGI script is called, if it's the first time the Perl interpreter evaluates it as a subroutine in a unique package by using eval in Perl and caches the compiled code. If it is an already compiled code in the cache, it just returns a reference to it.
3. The Perl interpreter executes the compiled Perl code by dereferencing the subroutine reference and the C++ part gets the output by reading a Perl variable that has all Perl output. The C++ part is protected and synchronized by a critical section and you don't have to worry about this Perl variable is destroyed by others (though there's certain performance penalty)

Let's take a look at the Perl setup script (EmbeddedPerlSandbox.pl). This script is the one actually used in the DICE but a newer version may be included in actual package, to see how it works you should download the DICE and run its web server by yourself. First, it defines the package EmbeddedPerlSandboxOut that is used to intercept and overrides the standard out (STDOUT in Perl). The required Perl version is specified as 5.8.0, but I'm not sure if this script works for a lower version of Perl or not.


# EmbeddedPerlSandbox for DICE
# (c) RyuK 2006 All Rights Reserved
#
# klassphere[at.mark]gmail.com
# http://aiueo.da.ru/
# http://zzz.zggg.com/
#
# This file is not redistributable.
#

use 5.8.0;

package EmbeddedPerlSandboxOut;

sub TIEHANDLE
{
	my $classname = shift;
	my $buffer = "";
	bless \$buffer, $classname;
}

sub PRINT
{
	my $buffer = shift;
	my $s = shift;
	$$buffer .= $s;
}

sub PRINTF
{
	my $buffer = shift;
	my $s = shift;
	$$buffer .= sprintf($s, @_);
}

sub WRITE
{
	my ($buffer, $data, $len, $offset) = @_;

	if (!defined($len))
	{
		$len = length($data);
	}

	if (!defined($offset))
	{
		$offset = 0;
	}

	$$buffer .= substr($data, $offset, $len);
}

sub READLINE
{
	my $buffer = shift;
	return $$buffer;
}

sub BINMODE
{
	# does nothing
	return 1;
}

sub CLOSE
{
	my $buffer = shift;
	undef $buffer;
}

sub DESTROY
{
	my $buffer = shift;
	undef $buffer;
}  
	  

When a Perl script is executed, this class is associated with STDOUT by the tie Perl function which calls the TIEHANDLE subroutine. It creates an object which is just a scalar value that works as a buffer to hold the content of the standard output. After that, if print tries to write something onto STDOUT the PRINT function in this class is called. READLINE is called by <>.

The next section defines EmbeddedPerlSandboxIn to override the standard input in the same way.


package EmbeddedPerlSandboxIn;

sub TIEHANDLE
{ 
	my $classname = shift;
	my $buffer = shift;
	bless \$buffer, $classname;
}

sub READLINE
{
	my $buffer = shift;

	# substr EXPR,OFFSET,LENGTH,REPLACEMENT
	return (length($$buffer) ? substr(
		$$buffer,
		0,
		(defined($/) ? index($$buffer, "$/"): length($$buffer) - 1) + 1,
		""
	)
		: undef);
}

sub READ
{
	my ($buffer, $len, $offset) = ($_[0], $_[2], $_[3]);

	if (!defined($offset))
	{
		$offset = 0;
	}

	if ($len > length($$buffer))
	{
		$len = length($$buffer);
	}

	# You can use the substr() function as an lvalue
	substr($_[1], $offset, $len) = substr($$buffer, 0, $len, "");

	return $len;
}

sub GETC
{
	my $buffer = shift;
	return (length($$buffer) ? substr(
		$$buffer,
		0,
		1,
		""
	)
		: undef);
}

sub BINMODE
{
	# does nothing
	return 1;
}

sub CLOSE
{
	my $buffer = shift;
	undef $buffer;
}

sub DESTROY
{
	my $buffer = shift;
	undef $buffer;
}

The TIEHANDLE function receives a scalar value as the content of the standard input when it's associated with STDIN with tie and holds it. For READ, the second parameter ($_[1]) is the reference (not a "reference" in the Perl, but in the broader computer science terminology) to the receiving variable as Perl subroutines are call-by-reference.

From here on the main class EmbeddedPerlSandbox is defined.


package EmbeddedPerlSandbox;

$SANDBOX_OUTPUT = "";

%CODE_STORE = {};

BEGIN
{
	push @INC, Win32::GetCwd() . "\\perl_lib";

	*CORE::GLOBAL::exit = \&EmbeddedPerlSandbox::exit;
	*CORE::GLOBAL::flock = \&EmbeddedPerlSandbox::flock;
}

$sandbox_output is the global package variable that holds the standard output. The C++ part retrieves the output of a Perl user script by referencing this variable later. %CODE_STORE is a cache that holds compiled Perl scripts with its unique package name as a key. The BEGIN block evaluated when this setup script is loaded by the C++ code. @INC is the array of library paths it adds the "perl_lib" directory under the current directory to it. To obtain the current directory it uses Win32::GetCwd which is a function in a builtin class in the Windows version of Perl without loading the Cwd module. Also it overrides the exit and the flock builtin functions by assigning subroutine references to their type globs. The exit function ends the Perl interpreter by calling exit() in C, which means not only the Perl interpreter but also the entire application just ends if it's called. Such a situation must be avoided for an obvious reason. The flock function should work in theory even in the Windows version, but when I tested it it stopped execution there by an unknown reason. As I explained already the C++ section is synchronized by a critical section, therefore the Perl section is synchronized too. I decided to trap flock and substitute it with a functon that does nothing.

The next subroutine is necessary to make a user script behave like a Perl CGI. If a persistent interpreter just executes the same compiled code 2 times, all global variables persist in the second execution. For example,


if (!defined($var))
{
	$var = 1;
}

print "$var\n";

$var++;

if this code is executed repeatedly in a persistent interpreter the displayed number continues to increase. This behavior is OK for a Java servlet and other persistent web applications, but not desirable for emulating a Perl CGI. To stop it, it has to clear global package variables previously defined in the package before executing a user code. This cleanupSymbolTable subroutine does it by scanning the symbol table hash (also called "stash" in the Perl jargon) of the user code package. In the Perl language, the reflection system is very easily available in the form of the symbol table and it's almost too exposed if you ask me. Though I don't like it as it can be a source of nasty Perl hacks, the reflection itself is certainly a necessary feature when creating a plugin system or a self-contained virtual world (Matrix, anyone?).

 
sub cleanupSymbolTable
{
	my $id = shift;

	while (my ($name, $glob_entry) = each %{$id . "::"})
	{
		local *v = $glob_entry;
		if (defined($v))
		{
			undef $v;
		}

		if (defined(@v))
		{
			undef @v;
		}

		if (defined(%v))
		{
			undef %v;
		}
	}
}

$id is the name of the user code package. It enumerates all symbols in that package and receives a typeglob for its symbol (See the "Symbol Tables" in perlmod). Then if there is a variable (scalar/array/hash) defined within that typeglob it's undefined. Note that it tests all value types as there may be a typeglob that has variables of all the 3 types defined.

The next subroutine compiles a user code as the name suggests.

 	  
sub compile
{
	my ($id, $content, $current_dir) = @_;

	if (!length($content))
	{
		return $CODE_STORE{$id};
	}

	local $SIG{__WARN__} = \&warn;

	# the part undefining global variables must come before $content,
	# otherwise $content can't contain 'use strict'
	my $sandbox =<< "SBOX";
package $id;
sub __wrapper
{
	EmbeddedPerlSandbox::cleanupSymbolTable("$id");
	\@ARGV = \@_;
	$content ;
}
SBOX

	my @tmp = @INC;
	push @INC, $current_dir;
	eval $sandbox;
	@INC = @tmp;

	if ($@)
	{
		my $e = $@;
		if ($e !~ /via package/ && $e =~ /line (\d+)/)
		{
			my $n = $1;
			my $m = $n - 6;
			$e =~ s/line $n/line $m/g;
		}

		$e =~ s/\(eval \d+\) //g;

		$SANDBOX_OUTPUT = "Content-Type: text/plain\n\nPerl Compilation Error: " . $e . "\n";
	}

	return $CODE_STORE{$id} = *{$id . "::__wrapper"}{CODE}; # CODE reference - see perlsub
}

$id is the unique package name for this script. It has to be a unique name not to conflict with other scripts compiled later in this interpreter instance. $content is the actual Perl CGI script content. If it's empty it just returns the already compiled code for this unique package name. In the actual implementation in the DICE the C++ part checks the file time of a script to see if it has to be reloaded or not, but you can implement it in Perl too if you want. Though it may not be necessary the __WARN__ signal is trapped with the warn subroutine to suppress a potential interrupt. The $sandbox here document is the code that wraps the usercode into a unique package that has only a wrapper subroutine code in it. The user code is embedded in this wrapper function and placed after the symbol table cleaner function that is explained above. In this "here document" you have to add "\" to a variable you don't want to be evaluated at the time of the definition of $sandbox. $current_dir has the current directory for this script. Since the use function in Perl is evaluated in the compile time you have to add the current directory particular to this script to @INC here and restores after the evaluation of the user code by eval. If the script has an error, $@ has the error information. When the evaluation is done, the reference to the compiled code can be obtained by getting the "CODE" entry in the subroutine's typeglob.

These subroutines are just utility functions. clearCodeStore is a function that clears the code store as the name suggests. When you run a persistent interpreter you can call this function to reclaim unused resources. addRuntimeError is an error reporting function in executing a user code.


sub clearCodeStore
{
	%CODE_STORE = ();
}

sub addRuntimeError
{
	my $e = shift;
	if (length($SANDBOX_OUTPUT))
	{
		if ($SANDBOX_OUTPUT !~ /Perl Runtime Error/)
		{
			$SANDBOX_OUTPUT .= "Perl Runtime Error: " . $e . "\n";
		}
	}
	else
	{
		$SANDBOX_OUTPUT = "Content-Type: text/plain\n\nPerl Runtime Error: " . $e . "\n";
	}
}	  
	  

The execute subroutine executes a compiled user code.


sub execute
{
	my ($id, $compiled, $stdin, $env, $current_dir, $clear_codestore) = @_;

	local $SIG{__WARN__} = \&warn;

	if (!chdir($current_dir))
	{
		addRuntimeError("chdir: " . $current_dir);
		return;
	}

	my $oh = tie(*STDOUT, "EmbeddedPerlSandboxOut");
	my $ih = tie(*STDIN, "EmbeddedPerlSandboxIn", $stdin);

	my %env_vars = ();
	foreach my $kv (split(/\n/, $env))
	{
		if ($kv =~ /([^=]+)=(.+)/)
		{
			$env_vars{$1} = $2;
		}
	}

	%main::ENV = %env_vars;

	my @tmp = @INC;
	push @INC, $current_dir;
	eval {$compiled->();}; # you need this semicolon...
	@INC = @tmp;

	if ($@)
	{
		unless ($@ =~ /EmbeddedPerlSandbox::exit/)
		{
			my $e = $@;
			if ($e !~ /via package/ && $e =~ /line (\d+)/)
			{
				my $n = $1;
				my $m = $n - 6;
				$e =~ s/line $n/line $m/g;
			}

			$e =~ s/\(eval \d+\) //g;

			addRuntimeError($e);
		}
	}

	undef $ih;
	untie *STDIN;

	if (!$@)
	{
		$SANDBOX_OUTPUT = <STDOUT>;
	}

	undef $oh;
	untie *STDOUT;

	if ($clear_codestore)
	{
		clearCodeStore();
	}
}	  
	  

First it changes the current directory to the directory of the user script by calling chdir. I tried to move the current directory in the C++ side by the SetCurrentDirectory Windows API and the PerlDir_chdir Perl C API, however for some reason the Perl side was not affected by them. Then it overrides the standard input and the standard output by using tie and the EmbeddedPerlSandboxIn and EmbeddedPerlSandboxOut classes as explained already. Unfortunately it seems not to be able to override the implicit STDIN expression in <> so it's required to specify <STDIN> in a user script. $stdin passes the content of the standard input which is the input by a web user in the case of a web server. $env has the content of environmental variables for CGI. As it's passed in a scalar value delimited by '\n' and '=' it's parsed and stored in the %main::ENV hash to which a CGI code refers. $compiled is the code reference and it's invoked by $compiled->() subroutine expression in the eval block. After it's done the content of the standard output is saved in $SANDBOX_OUTPUT then the standard input and the standard output are restored.

The last section is for the hook functions for these dangerous global functions.


sub exit
{
	$SANDBOX_OUTPUT = <STDOUT>;
	die("EmbeddedPerlSandbox::exit");
}

sub flock
{
	# does nothing
	return 1;
}

sub warn
{
	die("Perl Warning: " . $_[0] . "\n");
}

1;

The exit function saves the content of the standard output in $SANDBOX_OUTPUT and stops execution by the die function. As the die function can be trapped by eval, the exit function can be emulated this way. flock does nothing and warn is almost the same with exit. A Perl module has to return 1 at the end.

That's all for the Perl side (EmbeddedPerlSandbox.pl). Let's check out the C++ sample code that drives a Perl interpreter. To build it you have to add the Perl source code directory to the header include directories of your project and have to add the perl59.lib static library to your linker input. The resultant executable requires perl59.dll to run.

   
#include <string>
#include <cstdio>

using namespace std;

#include <EXTERN.h>
#include <perl.h>

EXTERN_C void xs_init (pTHX);
EXTERN_C void boot_DynaLoader (pTHX_ CV* cv);

EXTERN_C void xs_init(pTHX)
{
	char *file = __FILE__;
	dXSUB_SYS;

	/* DynaLoader is a special case */
	newXS("DynaLoader::boot_DynaLoader", boot_DynaLoader, file);
}
	  

The xs_init glue code is just what you get by the ExtUtils::Embed tool (See "Using Perl modules, which themselves use C libraries, from your C program" in perlembed).


int _tmain(int argc, _TCHAR* argv[])
{	
	int argc_perl = 0;
	char* embedding[] = {"", "E:\\program\\src\\perlembedtest2\\EmbeddedPerlSandbox.pl"};

	PERL_SYS_INIT(&argc_perl, (char***)embedding);

	PerlInterpreter* my_perl = perl_alloc();
	perl_construct(my_perl);

	PL_exit_flags |= PERL_EXIT_DESTRUCT_END;

	perl_parse(my_perl, xs_init, 2, embedding, NULL);
	perl_run(my_perl);

The embedding char pointer array is the argument list passed to the Perl interpreter. In the second parameter it set the path of the EmbeddedPerlSandbox.pl. Then it creates a Perl interpreter and runs it.

The next part calls the EmbeddedPerlSandbox::compile subroutine in the Perl side to compile a Perl script. The unique package name for this script is in id and the script content is in content for the sake of this sample code.


	const char* id = "test";
	string content(
		"my $data = ''; read(STDIN, $data, 2); print \"$data\n\"; print getc(STDIN) . \"\n\";"
		"while (<STDIN>) {print ($i++ . ': ' . $_);}"
		"open FH, '>log.txt';print FH 'test'; close(FH);"
	);

	dSP;

	ENTER;
	SAVETMPS;
	PUSHMARK(SP);

	XPUSHs(sv_2mortal(newSVpv(id, 0)));
	// don't use 0 as the second argument since Perl uses strlen for 0
	XPUSHs(sv_2mortal(newSVpv(content.c_str(), content.size())));

	PUTBACK;

	call_pv("EmbeddedPerlSandbox::compile", G_EVAL);

	SPAGAIN;

	SV* sandbox = 0;
	if (SvTRUE(ERRSV))
	{
		POPs; // see perlcall for G_EVAL

		printf("ERRSV\n");
		printf(SvPVX(ERRSV));
		return 0;
	}
	else
	{
		sandbox = newSVsv(POPs);
	}

	PUTBACK;
	FREETMPS;
	LEAVE;
	  

It pushs two strings id and content in a locally copied Perl stack. The newSVpv function creates a new Perl scalar value for a string (pv: pointer value). If you set 0 for the second parameter it uses C strlen() to calculate the length of the string pointer by a char* pointer. Since a Perl script can contain a null character (though unlikely) this sample gives the length of the script explicitly by using C++ basic_string for content. When the parameters are ready, it puts the stack back to Perl by the PUTBACK macro and calls the subroutine by call_pv. See the details for Perl stack manipulation in perlcall. If call_pv is successful it pops the return value from the stack and assigns it to a new Perl scalar value sandbox that holds a compiled Perl code reference.

In this section it actually execute the compiled Perl code. in is a C++ basic_string object that has an emulated standard input. env is supposed to have environmental variables, but this sample code omits it. new_dir is the current directory for this Perl code. As the sample Perl script fiddles with a file, it is created in this directory. Though the EmbeddedPerlSandbox::execute in Perl takes 6 parameters, the 6th parameter is omitted again for this sample. It calls EmbeddedPerlSandbox::execute with 5 arguments with no return value expected (G_VOID).

  
	string in("abcHello world\nI think\nTherefore I am\n");

	string env;

	string new_dir("E:\\program\\src\\perlembedtest2\\Debug");

	ENTER;
	SAVETMPS;
	PUSHMARK(SP);

	XPUSHs(sv_2mortal(newSVpv(id, 0)));
	XPUSHs(sandbox);
	XPUSHs(sv_2mortal(newSVpv(in.c_str(), in.size())));
	XPUSHs(sv_2mortal(newSVpv(env.c_str(), env.size())));
	XPUSHs(sv_2mortal(newSVpv(new_dir.c_str(), new_dir.size())));

	PUTBACK;

	call_pv("EmbeddedPerlSandbox::execute", G_VOID);

	SPAGAIN;

	STRLEN n_a;
	const char* output = SvPV(get_sv("EmbeddedPerlSandbox::SANDBOX_OUTPUT", FALSE), n_a);
	int len = (int)n_a;
	printf("returned: %d bytes\n", len);

	// it can contain a null character
	for (int i = 0; i < len; ++i)
	{
		if (i == 0)
			printf("[%c", output[i]);
		else
			printf("%c", output[i]);
	}
	if (len != 0)
		printf("]\n");

	PUTBACK;
	FREETMPS;
	LEAVE;	  
	  	  

After EmbeddedPerlSandbox::execute is executed, it retrieves the Perl scalar value of EmbeddedPerlSandbox::SANDBOX_OUTPUT by calling get_sv and converting it into a char* by SvPV. output is the string that holds the standard output of the Perl script.


	PL_perl_destruct_level = 0;

	perl_destruct(my_perl);
	perl_free(my_perl);
	PERL_SYS_TERM();

	return 0;
}

When the work is done it frees the Perl interpreter. It's the end of the C++ sample code. This works flawlessly including using Perl modules, and I implemented it in the DICE with some additional attention for a persistent interpreter.

The winner for this round is Perl. But it's not free from a problem, for example its header file redefines common API names with C macros and are not include-friendly. Also it's not designed to host multiple interpreters from the beginning and plagued with excessive C macro use. I hope these issues can be solved in Perl 6 with a new VM (Parrot), just like Ruby's new VM (YARV) plans. If Perl 6 becomes the mainstream this article has to be rewritten since the embedding method should be fairly different from the current one, though I somehow doubt the likelihood that Perl 6 becomes too popular. One thing for sure is I'm satisfied with the relative robustness of Perl for now.