The problem: You have a large number of records that need to be modified. You don’t have the processing resources to accomplish this as quickly as you’d like.
The solution: Break the data into chunked csv files, with each file containing a certain number of records. Spin up a bunch of EC2 instances whereby each one, on startup, processes one file. This also allows you to run a large number of processes concurrently to get the job done quickly. 100 ‘micro’ instances all running at the same time will cost $10 per hour ($.10/hour). That’s not bad!
Creating the chunked files
file_num = 1
Thing.all.in_groups_of(2500, false) do |thing_group|
csv_data = FasterCSV.generate do |csv|
thing_group.each_with_index do |thing|
csv << [thing.attr1, thing.attr2]
end
end
end
File.open("chunked_things/file#{file_num}.csv", 'w') {|f| f.write(csv_data) }
file_num += 1
Configure an EC2 instance to serve as the source instance Configure an EC2 instance from which you will create an AMI to be used to spawn the other worker instances. The important part is to make sure this instance has a copy of all the chunked files created above. I’ll explain why later. I used an AMI from bitnami, this one to be specific (listed toward the bottom of the page). It’s an EBS backed AMI which makes creating AMIs from running instances easier than from S3 backed ones. This may have changed, not sure.
Use cron to run the script at start up Create a cron task set to run when the instance is booted. You can use the @reboot shortcut to accomplish this:
@reboot /path/to/ruby /path/to/your/script
The goal here is to have your script run when the server boots. This would be pretty straight forward, however, we need to be sure all the EC2 instances don’t process the same file at once, or process the same file twice. Enter SQS.
SQS SQS basically allows you to create a simple queue that holds messages. For us, these messages take the format of the names of all the files we want to process. When the server boots, and the script runs, it hits the SQS service and asks for the name of a file. SQS responds with a message off the queue (which is the name of a file). That message is locked so no other requests will respond with it. Once the message is retrieved, you can pop is off the queue. Each spawned instance is guaranteed to get it’s own filename.
right_aws is an awesome wrapper for AWS. The code to create the create the queue looks like this:
sqs = RightAws::SqsGen2.new(aws_key, aws_secret) queue = RightAws::SqsGen2::Queue.create(sqs, 'chunk_queue')
Populate the queue with your filenames
queue.push 'your_filename'
Processing the file To get a filename from the queue:
chunked_filename = queue.pop.to_s
Now we have the file with CSV data to process. Loop through it, hitting service you need to.
Push your modifed data back to S3 When you are finished processing your file, you’ll want to put the results somewhere. S3 is an obvious choice. Again with right_aws, it’s pretty easy. This will basically write your data back to a file on S3:
s3 = RightAws::S3.new(aws_key, aws_secret)
bucket = s3.bucket(s3_bucket_name)
key = RightAws::S3::Key.create(bucket, "#{s3_bucket_key}#{chunked_filename}")
key.put(your_data)
Altogether now, this is some pseudo code(not tested) that addresses the core parts of the process. This is the script that would be executed at start up:
require 'rubygems'
require 'right_aws'
require 'fastercsv'
# aws
aws_key = 'your key'
aws_secret = 'your secret'
# s3 bucket
s3_bucket_name = 'your_bucket'
s3_bucket_key = 'name_of_key'
sqs = RightAws::SqsGen2.new(aws_key, aws_secret)
queue = RightAws::SqsGen2::Queue.create(sqs, 'chunk_queue')
# pop a file anme off the queue
chunked_filename = queue.pop.to_s
modified_data = []
# process the filename SQS gave us (a copy of all files is on each instance)
file_to_process = "#{File.expand_path(File.dirname(__FILE__))}/#{chunked_filename}"
FasterCSV.foreach(file_to_process) do |row|
data_for_api = "#{row[0]}, #{row[1]}, #{row[2]} #{row[3]}"
results = # hit your web service or do what you need to here
# use api results
modified_data << [results.value1, results.value2]
end
# generating csv data
csv_data = FasterCSV.generate{ |csv| modified_data.each{ |modified| csv << modified } }
# put the file on S3
s3 = RightAws::S3.new(aws_key, aws_secret)
# grab the bucket
bucket = s3.bucket(s3_bucket_name)
# create the S3 key where the csv data will be stored
key = RightAws::S3::Key.create(bucket, "#{s3_bucket_key}#{chunked_filename}")
# write the data to S3
key.put(csv_data)
So I’m all set to take advantage of Amazon Web Service’s new ‘micro’
instance type on EC2 (yup, another client moving to EC2). I need to
manage it via the elasticfox Firefox extension, but the micro instance
type isn’t an option yet in the new instance dialog.
And you thought all the clever caching names were taken.
ActsAsCachola is a plugin that lets you cache any class method by simply prepending ‘cachola_’ to the method name when calling it. Here’s how it works:
Given the following model:
class Internet < ActiveRecord::Base
acts_as_cachola
def self.get_a_million_numbers
1.upto(1_000_000).inject([]){ |numbers, x| numbers << x }
end
end
Now you can call the method, ‘cachola_get_a_million_numbers,’ and the return value of ‘get_a_million_numbers’ will be cached automatically.
Note that if the method accepts arguments, each unique call will have its own key in the cache. For example:
class Internet < ActiveRecord::Base
acts_as_cachola
def self.get_numbers(to_number)
1.upto(to_number).inject([]){ |numbers, x| numbers << x }
end
end
Calling Internet.cachola_get_numbers(100) and Internet.cachola_get_numbers(500) will result in two keys (with different values) stored in the cache.
The cached method is then expired automatically when the class in which the plugin has been included is saved or destroyed. It’s restored to the cache the next time it’s called.
Now, what if your Internet class method ‘get_a_million_numbers’ depends on other objects getting saved or destroyed? That’s the other thing I wanted to make easier. Rather than setting up observers or sweepers, you can add the following to the other model:
class WhereAmI < ActiveRecord::Base acts_as_cachola_notifier => [:internet] end
Now when your WhereAmI model is ether saved or destroyed, the cached methods in the Internet model will be deleted.
script/plugin install git://github.com/rbrant/acts_as_cachola.git
Not sure. It does what I need it to do right now. It’s something I’ve found myself doing on two different projects that I thought would just make my life easier.
ActsAsCachola is hosted on Github: http://github.com/rbrant/acts_as_cachola, where your contributions, forkings, comments and feedback are greatly appreciated. Please do add tests if you want me to pull in any changes.
I had to process some pretty big xml docs recently from the USPTO. Each doc is about 60mb and (oddly enough) contains several thousand individual documents all concatenated. So the document isn’t valid xml..but that’s a different story.
The reason for writing this was to show a quick demo of how to use SAX to process a large XML file. You can read about SAX here, but basically, SAX (Simple API for XML) is an event-driven model that solves the problem of having to read an entire tree structure into memory which can be realllly sloooow, and instead reads the stream of data and raises events along the way.
The code below uses the Nokogiri library (which as a side note has this odd, albeit entertaining tagline: “XML is like violence - if it doesn’t solve your problems, you are not using enough of it.”). Most other XML parsing libraries also have SAX implementations.
What the code does below is looks for the root node of each doc and builds a string for each individual document. After the doc has been assembled, the doc can be processed via the more pleasant:
doc = Nokogiri::HTML(xml)
serial = doc.css("application-reference document-id doc-number").inner_text
So this ends up being sort of a hybrid and much, much faster than loading the entire doc at once. It would be faster not parsing the doc again at all but the docs have too much nested complexity that requires the ability to use xpath to get at what I need.
<p style="font-size: 10px"> <a href="http://posterous.com">Posted via email</a> from <a href="http://rbrant.posterous.com/seth-godins-ebook-what-matters-now-free-downl">Rich's posterous</a> </p>
It’s easy to forget what you’ve learned and what tools you used from project to project. I thought it might be worthwhile to sort of sum up these things either on a weekly basis or project basis. I had a lot of fun on a recent project and thought it would be a good place to start. I recently built what is described as a ‘tool for intelligently searching US patent application Image File Wrappers (IFWs).’
Technically, the system allows users to upload PDF documents and have their content indexed and made searchable. The documents are reasonably sized, averaging 25 megs each with several hundred pages. So once uploaded to the server, they are handed to delayed job to be processed in the background. I’m using collective idea’s fork after watching Ryan Bates’ screencast on delayed job that points out this fork has a few generators and rake tasks not part of the original.
In order to index the document, the PDF needs to be examined by OCR (optical character recognition) software. But before the OCR software can do its OCR-ing, it needs to have an image to examine. So we need to convert the individual PDF pages into images. To accomplish that, I used ghostscript. It’s really easy to use and fast. You can hand ghostscript the document, and it will churn out a an image of each PDF page, in the resolution of your choice. I’m using 300x300, which seems to be a nice balance between processing time, space, and readability/ocr results.
Once the document has been converted into images the OCR software, tesseract-ocr, will iterate through each image and produce a text file with the contents of the page. Now, with a directory full of text files, it’s time to store the contents of each page in the database. That’s where sphinx and thinking sphinx come into play. Sphinx is the full text search engine and thinking sphinx is a ‘concise and easy-to-use Ruby library that connects ActiveRecord to the Sphinx search daemon, managing configuration and searching.’ I actually started the project with ferret/acts_as_ferret, but after reading so many good reviews of sphinx, and my own problems with ferret, I switched. The only downside is that setup is a little trickier and thinking sphinx doesn’t automatically update the index the way acts_as_ferret does, so there’s a cron job that handles that. The indexer is super fast though, so frequent indexing isn’t a problem.
The site also offers multiple file download, and I used the rubyzip library which makes it simple to zip up a bunch of docs into one.
As for design, we used a theme from themeforest.net. I was impressed by the quality of generic templates they have. They aren’t free, but are dirt cheap - $5 or $10 for most. I’ve used
It’s a Rails application, so as for gems/plugins, the usual suspects are there: acts_as_commentable, exception_notification, restful_authentication, role_requirement, mislav-will_paginate, attachment_fu, and a few others: slicehost, thinking-sphinx, delayed_job, and rubyzip.
The real jewel in the list is ‘slicehost’ which gives you a bunch of rake tasks for setting up your slice at slicehost, which is my favorite hosting provider.
One other thing worth mentioning was an issue with the delayed_job process not stopping properly during deploys, so I kept getting multiple instances of delayed job running because the one running during the deploy never stopped. It was noted on github (with the solution below) in the issues section but I can’t find it now. Basically, the restart task looks like this:
desc "Restart the delayed_job process"
task :restart, :roles => :app do
stop
wait_for_process_to_end('delayed_job')
start
end
end
def wait_for_process_to_end(process_name)
run "COUNT=1; until [ $COUNT -eq 0 ]; do COUNT=`ps -ef | grep -v 'ps -ef' | grep -v 'grep' | grep -i '#{process_name}'|wc -l` ; echo 'waiting for #{process_name} to end' ; sleep 2 ; done"
end
As a final note, all the software used in this project is open source. I’m constantly reminded of that and impressed by it. My thanks to all who have contributed to the software used in this project!
Was definitely worth upgrading to snow leopard from leopard. My machine is noticeably quicker and I picked up quite a bit of new storage space, but it was not without its issues on the dev side of things. The most affected area so far has been mysql.
You need to get the 64 bit version: http://dev.mysql.com/downloads/mysql/5.1.html#macosx-dmg
I know you can use my.cnf and point to your previous versions’ data files, and then run mysql_upgrade, but that wasn’t working for me. I copied all the files from my previous install’s data directory and restarted the server. I suspect this method won’t work for all the different storage engines though.
You also need to reinstall any gems with native extensions.
The install itself, it’s worth noting, was a bit strange. It took a looong time, probably 120 minutes or so. The screen went dark toward the end. I was tempted to try restarting it because it seemed like it was hanging. It did eventually restart on its own, but when it restarted the screen was dark again. Magically after about three restarts it resolved itself.
http://www.paulgraham.com/makersschedule.html
so incredibly well said.
I have a client that wants parts of their application available offline. After looking the the various approaches to this solving this problem, it’s clear (to me at least!) that Gears is the way to go. It’s cross-platform and browser and works across a variety of mobile devices.
Rather than trying to get Gears integrated into the existing application, I decided to first put together a sample application to get it all running properly. Step two will be taking what I’ve done and stitching it into the application. Isolating it this way gives you complete control and avoids any complications that existing libraries/code may cause.
I wanted to put this out there for others to see, mainly because I couldn’t find the sort of samples I needed at the time. And without it, it’s not as easy wrap your head around the gestalt of it. To get an app running offline, you need two basic things happen.
You need the physical pages available offline. This is handled via Gears’ LocalServer module. In the sample, most of this is handled in the ‘Store’ js object. And much of the code is refactored, but based on a sample provided by Google.
You need a facility for storing the data locally. This is handled by Gears’ Database, which is SQLite underneath. This where JStORM comes into play. JStORM is a truly awesome javascript library that makes handling the local storage easier and better that Google’s api. From the JStORM original announcement, it ‘gives you a way to declare your tables as objects and provide a nicer OO interface than the normal Google Gears api.’
The server side application is Rails. It’s a very basic CRUD sample. The sample lets you take the application offline, and work disconnected from the internets and then put it back online. The key part there is the syncing of your local data with the remote data. In the sample, when you go offline, the remote data is brought down to the gears storage facility (SQLite). When you go back online your data is pushed back to the remote database.
jQuery glues it all together and makes use of the existing forms. Based on a cookie that flags us as offline or online, jQuery binds to form and submits to the local database rather than the remote one. It’s pretty easy to follow what’s going on, but that after looking at it for days, so maybe not. Email me with any questions.
As a final note, I’m sure there’s plenty in here to improve upon, so please let me know what you’ve come up with. A couple things come to mind, including automating the JStORM model creation; tying it into Rails migrations, better error handling, and perhaps syncing the model validations from Ruby to js. And more generally, building an offline component for your application is kinda cool, but tedious at the same time. It would be nice if there were a way to build in offline support dynamically. I think gearsonrails had this goal in mind, along with less involvement with javascript, but I could’t get their samples to work, and it still appeared to require quite a bit of massaging anyway. At the very least, with this approach you aren’t abstracted too far from seeing what’s goin’ on under the hood.
very psyched to have found this…
Trac: http://labs.urielkatz.com/wiki/JStORM
Intro by the developer:
http://www.urielkatz.com/archive/detail/introducing-jstorm/
Postgres doesn’t seem to handle imports very well. At least not as gracefully as MySQL. When the primary key sequence of a table gets out of whack, you can reset it via psql directly:
SELECT setval('table_name_id_seq', (SELECT MAX(id) FROM table_name)+1);
But that doesn’t see to solve the problem for my rails application. This does:
ActiveRecord::Base.connection.reset_pk_sequence!('table_name')
I posted this because all roads seem to lead to the former solution and not the latter..
This is an easy way to reset all the keys (from the console):
ActiveRecord::Base.connection.tables.each{|t| ActiveRecord::Base.connection.reset_pk_sequence!(t) unless t == 'schema_info'}My wife had what I thought was a great idea: a diary for twitter updates.
When my son was born I tried writing a little bit about each day; just regular stuff that happened each day so I could look back on it later. Fun at first, but became kind of a burden to keep up with it and it eventually went away. Tweetary solves this problem for me. I’m already on Twitter, so now I can just record something privately if I want. Plus, the brevity Twitter requires will make it more likely that I’ll continue to use it. Yeah, you could just email yourself, or start a second account, but that wouldn’t allow me to try out the twitter-auth plugin/gem and mess around with the Twitter api, which was the other motivation. The plugin is great and the Twitter api is very straight forward. The whole thing took only a few hours.
It’s free, and you can export whatever you send there, so there’s no risk of losing it. Try it out and see what you think.
First deployment on a brand new machine is never without a little frustration, eh? If you see this error, be sure to change the ownership of your environment.rb file to ‘www-data’.
If you see this message in your qbwc log, chances are you’ve hit the undocumented qbwc request limit of 2 minutes.
I had a lot of xml being returned by quickbooks and the log kept reporting a timeout (‘Error message: The operation has timed out’). I spent time optimizing the the way the receiveResponseXML method was handled by the application thinking that the way Rails was updating one rec at a time was too time consuming. Well, it was actually, and using the ar-extensions plugin that makes use of mysql’s on duplicate key update syntax, improved performance tremendously, and was a nice benefit of having to deal with the issue, but not the key to fixing it. Discovering that the the initiation of the request and the time out were always within a second of each other on every occasion was.. and I’ve confirmed the behavior in the intuit forums, at least.
If you aren’t analyzing your traffic, you’re crazy, especially since such a powerful and free solution exists in the form of Google analytics. I’m blown away by the depths of the reporting. I’ve just scratched the surface of what’s available it seems, too.
We have an unlimited number of sub domains that we wanted to track for paperconcierge.com and it turned out it’s easy to add. You can set up scheduled reports to be emailed in a variety of formats, too. Although it looks like the scheduled email reports are limited to 10 recipients.
I’m working on a project these days that has quite a bit of Quickbooks integration. For those who have already traveled this path you know it’s no party. SOAP, Windows only..and that’s just the start. I can’t relieve your headache, but I do have some decent resources and tips to mention.
Here’s how:
Running regedit (Start > Run > regedit.exe)
Navigating to: HKEY_CURRENT_USERSoftwareIntuitQBWebConnector
Change the ‘Level’ key to VERBOSE
More to come as I continue to bang my head against the wall..
What a pain… it seems that as a security consideration, Firefox 3 no longer provides access to the full file path via the file type input. I can see the reasons for masking the path, but there are also legitimate reasons you may need the path of the file as it is on the client’s machine. The application on which I’m currently working, for example, needs to know where the Quickbooks data file is stored on the client’s machine. When the user confiugures there QB setup they store this info on the site. Then when the web connector authenticates against the application, it pulls down this info and does its thing. I can imagine there are many similar setups out there and now, and many annoyed developers, too.
The best solution I could find is here in the comments of the bug report thanks to this post.
However, the class file referred to in the bug isn’t the one you want. You want this one.
I did some poking around and it appears the IE8 has this same wonderful security feature, so you will be needing this!
Just installed bbPress for our team members and while cruising around the admin, saw this awesome plugin - ‘Bozo Users,’ which is described as:
Allows moderators to mark certain users as a “bozo”. Bozo users can post, but their content is only visible to themselves.
Now that’s funny.