I’ve been determined to find a reason to use Node.js in a project since, as the adage goes, you have to sart somewhere. What I came up with was something that I’m actually going to use in future projects, I think.
I’ve often wished there were an easy way to give clients access to rake tasks, but in the browser. Rake tasks are great because you can provide real time feedback in the terminal to show the user what’s going on as it’s happening. Node.js along with Socket.io make creating this experience in the browser really easy.
This project is a Rails 3.1 engine, so it’s super easy to hook up in your Rails application. Aside from using Rails 3.1+, you need to have Node.js and Socket.io installed. So to get this running:
In your Gemfile:
gem 'rake_ui', :git => "git://github.com/rbrant/rake_ui.git"
In your routes.rb file:
Rails.application.routes.draw do mount RakeUi::Engine => "/rake_ui" end
Then you need to start the Node server. From your app’s root:
rake app:start_node_server
Once your Rails app is started, visit /rake_ui to see your rake tasks. All stdout generated in your task via something like:
$stdout.puts "hey now"
will be displayed in the black window to the right of the Rake task command listing.
They way it works is that when the app is initialized, all the available rake tasks are stored in memory along with an ID key. The index page displays all he rake tasks and identifies them by their ID. When a task is selected, it get identified and then called via:
Kernel.system("#{@rake_task.command} #{@rake_task.arguments} &> #{Rails.root}/log/rake.log")
As you can see, stdout is sent to a log file. This log file gets tailed by Node and hooked up to the browser via Socket.io. Very cool, eh? I need to find a real node project..
To make this engine truly useful, I’ll probably add a way to blacklist certain tasks so they aren’t available in the UI. Probably not a good idea to allow your clients to drop their database. I also need to scrub the optional arguments.
The project is available on Github:
The problem:
You have a large number of records that need to be modified. You don’t have the processing resources to accomplish this as quickly as you’d like.
The solution:
Break the data into chunked csv files, with each file containing a certain number of records. Spin up a bunch of EC2 instances whereby each one, on startup, processes one file. This also allows you to run a large number of processes concurrently to get the job done quickly. 100 ‘micro’ instances all running at the same time will cost $10 per hour ($.10/hour). That’s not bad!
Creating the chunked files
file_num = 1
Thing.all.in_groups_of(2500, false) do |thing_group|
csv_data = FasterCSV.generate do |csv|
thing_group.each_with_index do |thing|
csv
Configure an EC2 instance to serve as the source instance
Configure an EC2 instance from which you will create an AMI to be used to spawn the other worker instances. The important part is to make sure this instance has a copy of all the chunked files created above. I'll explain why later. I used an AMI from bitnami, this one to be specific (listed toward the bottom of the page). It's an EBS backed AMI which makes creating AMIs from running instances easier than from S3 backed ones. This may have changed, not sure.
Use cron to run the script at start up
Create a cron task set to run when the instance is booted. You can use the @reboot shortcut to accomplish this:
@reboot /path/to/ruby /path/to/your/script
The goal here is to have your script run when the server boots. This would be pretty straight forward, however, we need to be sure all the EC2 instances don't process the same file at once, or process the same file twice. Enter SQS.
SQS
SQS basically allows you to create a simple queue that holds messages. For us, these messages take the format of the names of all the files we want to process. When the server boots, and the script runs, it hits the SQS service and asks for the name of a file. SQS responds with a message off the queue (which is the name of a file). That message is locked so no other requests will respond with it. Once the message is retrieved, you can pop is off the queue. Each spawned instance is guaranteed to get it's own filename.
right_aws is an awesome wrapper for AWS. The code to create the create the queue looks like this:
sqs = RightAws::SqsGen2.new(aws_key, aws_secret)
queue = RightAws::SqsGen2::Queue.create(sqs, 'chunk_queue')
Populate the queue with your filenames
queue.push 'your_filename'
Processing the file
To get a filename from the queue:
chunked_filename = queue.pop.to_s
Now we have the file with CSV data to process. Loop through it, hitting service you need to.
Push your modifed data back to S3
When you are finished processing your file, you'll want to put the results somewhere. S3 is an obvious choice. Again with right_aws, it's pretty easy. This will basically write your data back to a file on S3:
s3 = RightAws::S3.new(aws_key, aws_secret)
bucket = s3.bucket(s3_bucket_name)
key = RightAws::S3::Key.create(bucket, "#{s3_bucket_key}#{chunked_filename}")
key.put(your_data)
Altogether now, this is some pseudo code(not tested) that addresses the core parts of the process. This is the script that would be executed at start up:
require 'rubygems'
require 'right_aws'
require 'fastercsv'
# aws
aws_key = 'your key'
aws_secret = 'your secret'
# s3 bucket
s3_bucket_name = 'your_bucket'
s3_bucket_key = 'name_of_key'
sqs = RightAws::SqsGen2.new(aws_key, aws_secret)
queue = RightAws::SqsGen2::Queue.create(sqs, 'chunk_queue')
# pop a file anme off the queue
chunked_filename = queue.pop.to_s
modified_data = []
# process the filename SQS gave us (a copy of all files is on each instance)
file_to_process = "#{File.expand_path(File.dirname(__FILE__))}/#{chunked_filename}"
FasterCSV.foreach(file_to_process) do |row|
data_for_api = "#{row[0]}, #{row[1]}, #{row[2]} #{row[3]}"
results = # hit your web service or do what you need to here
# use api results
modified_data So I’m all set to take advantage of Amazon Web Service’s new ‘micro’
instance type on EC2 (yup, another client moving to EC2). I need to
manage it via the elasticfox Firefox extension, but the micro instance
type isn’t an option yet in the new instance dialog.
And you thought all the clever caching names were taken.
ActsAsCachola is a plugin that lets you cache any class method by simply prepending ‘cachola_’ to the method name when calling it. Here’s how it works:
Given the following model:
class InternetNow you can call the method, ‘cachola_get_a_million_numbers,’ and the return value of ‘get_a_million_numbers’ will be cached automatically.
Note that if the method accepts arguments, each unique call will have its own key in the cache. For example:
class InternetCalling Internet.cachola_get_numbers(100) and Internet.cachola_get_numbers(500) will result in two keys (with different values) stored in the cache.
The cached method is then expired automatically when the class in which the plugin has been included is saved or destroyed. It’s restored to the cache the next time it’s called.
Now, what if your Internet class method ‘get_a_million_numbers’ depends on other objects getting saved or destroyed? That’s the other thing I wanted to make easier. Rather than setting up observers or sweepers, you can add the following to the other model:
class WhereAmI [:internet] endNow when your WhereAmI model is ether saved or destroyed, the cached methods in the Internet model will be deleted.
Installation
script/plugin install git://github.com/rbrant/acts_as_cachola.git
Where is this going from here?
Not sure. It does what I need it to do right now. It’s something I’ve found myself doing on two different projects that I thought would just make my life easier.
Project Info
ActsAsCachola is hosted on Github: http://github.com/rbrant/acts_as_cachola, where your contributions, forkings, comments and feedback are greatly appreciated. Please do add tests if you want me to pull in any changes.
I had to process some pretty big xml docs recently from the USPTO. Each doc is about 60mb and (oddly enough) contains several thousand individual documents all concatenated. So the document isn’t valid xml..but that’s a different story.
The reason for writing this was to show a quick demo of how to use SAX to process a large XML file. You can read about SAX here, but basically, SAX (Simple API for XML) is an event-driven model that solves the problem of having to read an entire tree structure into memory which can be realllly sloooow, and instead reads the stream of data and raises events along the way.
The code below uses the Nokogiri library (which as a side note has this odd, albeit entertaining tagline: “XML is like violence – if it doesn’t solve your problems, you are not using enough of it.”). Most other XML parsing libraries also have SAX implementations.
What the code does below is looks for the root node of each doc and builds a string for each individual document. After the doc has been assembled, the doc can be processed via the more pleasant:
doc = Nokogiri::HTML(xml)
serial = doc.css("application-reference document-id doc-number").inner_text
So this ends up being sort of a hybrid and much, much faster than loading the entire doc at once. It would be faster not parsing the doc again at all but the docs have too much nested complexity that requires the ability to use xpath to get at what I need.
It’s easy to forget what you’ve learned and what tools you used from project to project. I thought it might be worthwhile to sort of sum up these things either on a weekly basis or project basis. I had a lot of fun on a recent project and thought it would be a good place to start. I recently built what is described as a ‘tool for intelligently searching US patent application Image File Wrappers (IFWs).’
Technically, the system allows users to upload PDF documents and have their content indexed and made searchable. The documents are reasonably sized, averaging 25 megs each with several hundred pages. So once uploaded to the server, they are handed to delayed job to be processed in the background. I’m using collective idea’s fork after watching Ryan Bates’ screencast on delayed job that points out this fork has a few generators and rake tasks not part of the original.
In order to index the document, the PDF needs to be examined by OCR (optical character recognition) software. But before the OCR software can do its OCR-ing, it needs to have an image to examine. So we need to convert the individual PDF pages into images. To accomplish that, I used ghostscript. It’s really easy to use and fast. You can hand ghostscript the document, and it will churn out a an image of each PDF page, in the resolution of your choice. I’m using 300×300, which seems to be a nice balance between processing time, space, and readability/ocr results.
Once the document has been converted into images the OCR software, tesseract-ocr, will iterate through each image and produce a text file with the contents of the page. Now, with a directory full of text files, it’s time to store the contents of each page in the database. That’s where sphinx and thinking sphinx come into play. Sphinx is the full text search engine and thinking sphinx is a ‘concise and easy-to-use Ruby library that connects ActiveRecord to the Sphinx search daemon, managing configuration and searching.’ I actually started the project with ferret/acts_as_ferret, but after reading so many good reviews of sphinx, and my own problems with ferret, I switched. The only downside is that setup is a little trickier and thinking sphinx doesn’t automatically update the index the way acts_as_ferret does, so there’s a cron job that handles that. The indexer is super fast though, so frequent indexing isn’t a problem.
The site also offers multiple file download, and I used the rubyzip library which makes it simple to zip up a bunch of docs into one.
As for design, we used a theme from themeforest.net. I was impressed by the quality of generic templates they have. They aren’t free, but are dirt cheap – $5 or $10 for most. I’ve used
It’s a Rails application, so as for gems/plugins, the usual suspects are there: acts_as_commentable, exception_notification, restful_authentication, role_requirement, mislav-will_paginate, attachment_fu, and a few others: slicehost, thinking-sphinx, delayed_job, and rubyzip.
The real jewel in the list is ‘slicehost‘ which gives you a bunch of rake tasks for setting up your slice at slicehost, which is my favorite hosting provider.
One other thing worth mentioning was an issue with the delayed_job process not stopping properly during deploys, so I kept getting multiple instances of delayed job running because the one running during the deploy never stopped. It was noted on github (with the solution below) in the issues section but I can’t find it now. Basically, the restart task looks like this:
desc "Restart the delayed_job process"
task :restart, :roles => :app do
stop
wait_for_process_to_end('delayed_job')
start
end
end
def wait_for_process_to_end(process_name)
run "COUNT=1; until [ $COUNT -eq 0 ]; do COUNT=`ps -ef | grep -v 'ps -ef' | grep -v 'grep' | grep -i '#{process_name}'|wc -l` ; echo 'waiting for #{process_name} to end' ; sleep 2 ; done"
end
As a final note, all the software used in this project is open source. I’m constantly reminded of that and impressed by it. My thanks to all who have contributed to the software used in this project!
Was definitely worth upgrading to snow leopard from leopard. My machine is noticeably quicker and I picked up quite a bit of new storage space, but it was not without its issues on the dev side of things. The most affected area so far has been mysql.
You need to get the 64 bit version:
http://dev.mysql.com/downloads/mysql/5.1.html#macosx-dmg
I know you can use my.cnf and point to your previous versions’ data files, and then run mysql_upgrade, but that wasn’t working for me. I copied all the files from my previous install’s data directory and restarted the server. I suspect this method won’t work for all the different storage engines though.
You also need to reinstall any gems with native extensions.
The install itself, it’s worth noting, was a bit strange. It took a looong time, probably 120 minutes or so. The screen went dark toward the end. I was tempted to try restarting it because it seemed like it was hanging. It did eventually restart on its own, but when it restarted the screen was dark again. Magically after about three restarts it resolved itself.