Extending Ruby: The Sorted Array

Posted by Christopher Wojno Mon, 23 Jun 2008 20:14:00 GMT

I’ve been working on my personal projects and I came across a problem with the ruby Array class. I have an array containing approximately 500 elements. I need to find each element approximately 500 times ( O(n^2) as the array search is linear ). As you can guess, it takes about 0.5 seconds for all these searches. I figured, if the linear time is 0.5, the binary search taking log_2(n) should take significantly less time to perform.

Attempt #1: A ruby Sorted Array

I first attempted to make a sorted array class in pure ruby. While creation of the class was easy, it was not as efficient as hoped. As a matter of fact, according to my tests, the original ruby Array’s find_index was twice as fast as mine on average. In fact, in order for my set to be as efficient as the original Array, the value being sought had to be over 4/5 the way into the array. Only after that point did my SortedArray overtake the Array in speed. Clearly, this was not the way to go.

After several attempts at re-working and simplifying the loop, I decided that the only reason the Array was so fast, was because it was written natively in C, not ruby.

Attempt #2: A native C ruby Sorted Array

Most of the guiding code to the Sorted Array is from the Pragmatic Ruby Programmer’s Guide: Extending Ruby.

The C-file

First, we start with a new project tree, a-la SVN (this isn’t too terribly important, but I prefer it). Next, we create the first file: sorted_array.c.

In it I included the required ruby file: ruby.h. Next, I went to the subversion repository for the Array code. I’m using the 1.9.0.2 tag.

Take a good look at array.c. I’s packed full of goodness. Unfortunately, I’ll need to copy most of the code so that it supports all the functions we’re used to with the standard Array (though, some cannot be used with this version, such as insert).

There, we now have a fully functional SortedArray that doesn’t sort. It’s a good start.

Turning a normal Array into a SortedArray

The standard Array and the SortedArray are essentially the same. The exception is that all insertions should be made in such a fashion as to preserve the ordering of the elements, that a special constructor be provided to accept un-natural sorting functions, and that searches can be optimized to take advantage of the knowledge of the arrangement of data.

The c-version of the Array has 3 major components in the data structure: length (len), ptr (the start of the array in memory), and aux. Aux has lots of things that support the array, such as the capacity. Keep these in mind when mucking with the internals.

Remap the names

We can’t just use the code as is. Let’s rename all the functions from “rb_ary.

*
” to “rb_sorted_ary_.
*
”. With vim, I do: :%s/rb_ary_/rb_sorted_ary_/g. Poof, all renamed.

Enter the Constructors

One of the constructors takes a variable argument list as the 2nd and above parameters. We need to correct this as it simply inserts values. We need to push each individual element. Yes, this will slow down the construction of these types of arrays, but again, we’re banking on searching being the primary operation, not construction.

My version of array.c (1.9.0.2) looks like this:

VALUE
rb_ary_new3(long n, ...)
{
    va_list ar;
    VALUE ary;
    long i;

    ary = rb_ary_new2(n);

    va_start(ar, n);
    for (i=0; i<n; i++) {
    RARRAY_PTR(ary)[i] = va_arg(ar, VALUE);
    }
    va_end(ar);

    RARRAY(ary)->len = n;
    return ary;
}

It needs a little bit of changing. Let’s change: RARRAY_PTR(ary) line to: “rb_sorted_ary_push(ary,va_arg(ar,VALUE));” That should fix it’s wagon. It’s a variable array, so we need to use the c-library variable argument accessors: va_arg and cast it to Ruby’s VALUE type. After that, we just push it to the array, one at a time.

As you may have guessed, I’ve decided to make the primary changes to push and delegating all insertion to that method. We’ll get to this later, however.

Next, we tackle the new4 function. This is the “copy-constructor” if you’re used to C++. Essentially, it is expecting an array passed in as the second parameter (“elts” in my version). The C-code makes a straight copy form a C-array (not a ruby Array!). However, this is not possible if the incoming array is not sorted. We need to insert them one at a time, inserting each. Originally, it looks like this:

VALUE
rb_sorted_ary_new4(long n, const VALUE *elts)
{
    VALUE ary;

    ary = rb_sorted_ary_new2(n);
    if (n > 0 && elts) {
    MEMCPY(RARRAY_PTR(ary), elts, VALUE, n);
    RARRAY(ary)->len = n;
    }

    return ary;
}

The if statement is assuring that the incoming array is not empty or null (nil). That’s a good check and we’ll keep it. But we need to augment it. We know how long the array is, so let’s just iterate through each and push the values onto the array. I threw away the MEMCPY and replaced it with:

for(long i = 0; i < n; ++i) {
  rb_sorted_ary_push( ary, elts[i] );
}

That does it for the constructors (for now, we still need to add a way to provide an unnatural sorting Proc call, but we’ll come back to this later after a bit of testing).

ary_make_* and ary_.* functions

There are two functions that do not adhere to the rb_ary_.* syntax: ary_make_shared and ary_make_hash. Stick a “sorted_” in front of those definitions and in every place they are used.

Do the same with ary_.* functions (such as ary_new and ary_alloc and ary_shared_first). You can do this by simply using vim to search for ”\<ary_” and it will locate any ary_ that has a space or nothing before it (starts a “word”).

to_ary

Keep them as is. I see no need to convert to a sorted array, as we may need to add a parameter to take in a custom sort function (which we’re coming back to later).

Initialize

We now hit the initialize function. It has 4 forms:

  1. Array.new(size=0,obj=nil)
  2. Array.new(array)
  3. Array.new(size) {|index| block}

The first is easy: create an array of size, size, and then fill it with objects given in obj. This version requires no sorted, as all objects will be the same. The second version requires a bit of changing. We need to push each value of the incoming array one at a time to ensure consistency. Jump down to rb_sorted_ary_replace.

The Replacement Detour: before fixing initialize

VALUE
rb_sorted_ary_replace(VALUE copy, VALUE orig)
{
    VALUE shared;
    VALUE *ptr;

    orig = to_ary(orig);
    rb_sorted_ary_modify_check(copy);
    if (copy == orig) return copy;
    shared = sorted_ary_make_shared(orig);
    if (!ARY_SHARED_P(copy)) {
    ptr = RARRAY(copy)->ptr;
    xfree(ptr);
    }
    RARRAY(copy)->ptr = RARRAY(orig)->ptr;
    RARRAY(copy)->len = RARRAY(orig)->len;
    RARRAY(copy)->aux.shared = shared;
    FL_SET(copy, ELTS_SHARED);

    return copy;
}

They simply make a copy of the array’s pointer and declare the source array to be shared. Since we’re not sharing the array, there’s no need to do this anymore, yank out the make_shared line, the line at sets the aux.shared to shared, and the FL_SET. Next, we’ll yank out the pointer and length copying. Since we’re replacing our contents, we need to drop the old contents. Simply drop that if-statement around the xfree(ptr); line, and that will be taken care of. We’ll convert it to a for-loop, iterating through the original, don’t forget to use Ruby’s handy macros! To save some work, I’ll re-use the existing “constructor:”

VALUE
rb_sorted_ary_replace(VALUE copy, VALUE orig)
{
    VALUE shared;
    VALUE *ptr;

    orig = to_ary(orig);
    rb_sorted_ary_modify_check(copy);
    if (copy == orig) return copy;
        ptr = RARRAY(copy)->ptr;
        xfree(ptr);
        // create a new array
        long len = RARRAY_LEN(orig);
        if( len == 0 ) len++;
        RARRAY(copy)->ptr = ALLOC_N(VALUE, len);
        RARRAY(copy)->len = 0;
        RARRAY(copy)->aux.capa = len;
        for( int i = 0; i < RARRAY_LEN(orig); ++i ) {
            rb_sorted_ary_push(copy,RARRAY(orig)->ptr[i]);
        }

    return copy;
}

OK, this is probably the most confusing part so far. Why did we throw away the shared? Ruby cheats because instead of making a true, deep copy of the array it’s making a copy of, it declares the contents of the array as “shared” and simply copies pointers. We can’t do that as we need to re-arrange our data. So that’s why I yanked all the code related to sharing. So now, each object remains unlinked as far as the garbage collector is concerned. Yes, this does make things slower, and yes, does double the memory usage. But we can’t modify the incoming array as it may not be intended to be sorted. So we’re stuck. Because we’re doing a replacement, it’s best to just start from scratch. We do this by freeing up the c-array pointer, using xfree. Next, we get the length of the original array. We then use this to size-up our copy. Why do I add 1 in the case that the length is zero? Because, allocating an array of size 0 will cause C to throw a fit. So, we avoid that mess entirely by creating a starting array with 1 place to store things, but it is empty. We create the appropriately sized, internal c-array. We then set the size of the array to zero to spoof a blank array. We then set the capacity, to tell the Array that it need not resize itself when pushing the next set of items, one at a time.

When all items are added, the array will have the correct length (no longer zero, if the original wasn’t empty).

Back to Initialize to finish the last case

Back to the “initializer,” we now need to take care of the strange case that uses the block to initialize the array, based on the index. We cannot assume that the values created by this block will always be in order. So, after each generation (rb_yield), we need to push them onto the array as usual. The line looks like: “rb_sorted_array_store(ary, i, rb_yeild(LONG2NUM))”. Simply yank that line, and the line incrementing the length (as the length will be modified by the push function). Replace those lines with: “rb_ary_push(ary, rb_yield(LONG2NUM));” That should fix initialize.

Next, the rb_sorted_ary_s_create needs to be changed to, you guessed it, use push. Drop the length and the MEMCPY and insert the for loop, with the push call:

static VALUE
rb_sorted_ary_s_create(int argc, VALUE *argv, VALUE klass)
{
    VALUE ary = sorted_ary_alloc(klass);

    if (argc < 0) {
    rb_raise(rb_eArgError, "negative array size");
    }
    RARRAY(ary)->ptr = ALLOC_N(VALUE, argc);
    RARRAY(ary)->aux.capa = argc;
        for( int i = 0; i < argc; ++i ) {
            rb_sorted_ary_push(ary, argv[i] );
        }

    return ary;
}

Guessing an index

Next, we need to fix up push to stop simply appending. But before that, we need to create a special function to guess an index. Why guess? Because the value may not yet exist, but if it did, this is the index where it would be located in the array. This will be the basic function for both searching, deleting, and pushing.

I’ll call my new function: rb_sorted_ary_guess_index(VALUE ary, VALUE item). Not very creative, but very descriptive. Here’s the problem though: we want to be able to use both natural, and unnatural sorting algorithms. And, either is fine, so long as the ordering is monotonic (goes one-way and is predictable). Before we get into that, let’s just assume we’re using natural ordering, for simplicity. I wrote the following code:



	

Head up to sorted_array_alloc. Hmmm…. We need to add something to our arrays! We need to store the optional sort-by block!

The Sort-by block

Create a new file called “sorted_array.h”. Create the data structure:

#include "ruby.h" 

struct RSArray {
    struct RBasic basic;
    long len;
    union {
        long capa;
        VALUE shared;
    } aux;
    VALUE *ptr;
    ID cmp;
};

This is an exact duplicate of the RArray in ruby.h. The only exception is the additional ID *cmp, our comparison function. Be sure you’ve included this structure file in the sorted_array.c file.

Back in the sorted_array.c file, make sure you cap the new cmp variable with a NULL in sorted_ary_alloc: ary->cmp = NULL; Just put it under the aux.capa assignment. Now, we need to “fix” the sort_2 function to fall-back to the default comparison function, but to prefer a specified function if given. Let’s create a class-specific sort function called: sorted_array_sort_3.



  
  

Posted in  | Tags , , , , , , , ,  | no comments

Array.valN: Working smarter with Rails, with Ruby

Posted by Christopher Wojno Mon, 13 Aug 2007 18:37:00 GMT

Rails has a powerful form builder mechanism. However, if you want to make a list of things and have the data flow back to you without involving ActiveRecord, you’re in for some hurt; well, you were, unless you decide to use this module. Warning: this article is moderately advanced. I’m assuming you’re familiar with ActionControllers and the ActionView in my examples. If you’re familiar with how to use the basics of Rails, you should probably only read the first 1/2. Read the first code block to see what my code can do. But to see why this code is useful, read on to the examples.

Say I have a set of fox phrases. Now, if a certain fox is to have a list of catch phrases, and you don’t want to store them in a database, you can’t use Rails’ form helpers (text_field, hidden_field, etc.) and the params[] calls without a good deal of ugly, hack code (creating spoof instance variables to display data and creating a fake active record objects to receive that data from a form). To help beautify code, I created an additional way of accessing (setting and getting) elements in an array (valN). Behold the ArrayIterAccessors:

>> @catch_phrases = ['Addiction is like Pokemon!','Hey! There goes my pickup!','Chucky Bacon!']
=> ["Addiction is like Pokemon!", "Hey! There goes my pickup!", "Chucky Bacon!"]
>> @catch_phrases.val2
=> "Chucky Bacon!" 
>> @catch_phrases.val2 = 'Chunky Bacon! Chunky Bacon! Chunky Bacon!'
=> "Chunky Bacon! Chunky Bacon! Chunky Bacon!" 
>> @catch_phrases
=> ["Addiction is like Pokemon!", "Hey! There goes my pickup!", "Chunky Bacon! Chunky Bacon! Chunky Bacon!"]

“Now,” you say, “why the heck is he not just using the []!?” Well, I say, Try doing that within ActionView (rhtml template) for a text field:

<%= text_field 'catch_phrases', '[0]' %>

That will fail horribly. Depending on what you set @catch_phrases to, you’ll get:

undefined method `[0]' for ['Addiction is like Pokemon!','Hey! There goes my pickup!','Chucky Bacon!']:Array

And that doesn’t get you anywhere. So, supposin’ you say, oh, I dunno:

# My special module (should be a plugin but for demonstration
# purposes, I'm defining it inline
# Skip this code for now
class CatchPhrasesController < ApplicationController
  module ArrayIterAccessors
  alias :old_method_missing :method_missing
  def method_missing( sym, *args, &block )
  sym_s = sym.to_s; t = sym_s[/\d+/]
    if t and sym_s[/\Aval\d+=?\Z/]
      if sym_s[/=\Z/]
        self[t.to_i] = args.first; return self[t.to_i]
      else
        return self[t.to_i]
      end
    else
      return old_method_missing( sym, *args, &block )
    end
  end
end
  # this is an action for your CatchPhrases Controller URL:
  # http://site.example.com/catch_phrases/new
  def new
    Array.send( :include, ArrayIterAccessors )
    @catch_phrases = ['Addiction is like Pokemon!','Hey! There goes my pickup!','Chucky Bacon!']
  end
end
### in your view: new.rhtml ###
<% @catch_phrases.each_with_index do |item,index| %>
<%= text_field 'catch_phrases', "val#{index}" %>
<% end %>

So when your view renders, you’ll see 3 text fields, each with the catch phrase pre-filled (so you’ll have an addiction catch phrase, a pickup catch phrase, and a chunky bacon catch phrase). You can edit them and when you submit the form, you can reconstitute your array by:

@catch_phrases = params[:catch_phrases].keys.collect{|k| k.to_s[/\d+/] }.sort.collect{|k| params[:catch_phrases][('val'+k.to_s).to_sym] }

If you didn’t change the catch phrases, you’ll get the exact array: @catch_phrases, back. And if you DID change the phrases, you’ll get an array, in order, of the altered catch phrases!

Pitfalls

  1. No ActiveRecord (well, I claimed that as a plus above) means: no validation. That’s up to you, unless you include a validatable mixin (there are a few of those). This also means there is no way of reporting errors back automatically. I suppose I could write something later to do this.
  2. That last line is ugly and slow.
  3. The use example is for demonstration purposes and is awful: DO NOT USE IT. That will mix the module in EVERY time you render the “new” action. To do this correctly take the module and the
    Array.send :include, ArrayIterAccessors
    code out of your controller and:
    1. Put it in a plugin OR
    2. Create a new file in lib/array_iter_accessors.rb (in your Rails project) and put the module into that. Then, in your environment.rb file (at the bottom), put the
      Array.send :include, ArrayIterAccessors
      code. That way, the mixin will only be done once.

What I did

I overrided the method_missing method for the Array class (then mixed it in). I scan for any method calls for methods beginning with “val” and ending in a number. I also parse for an ’=’ in case you’d like to use it for assignment too. Then, I call the array’s [] operator to access the values at the desired position. And because I’m using the array’s built-in operator, you’ll see all the error messages and behavior you’re used to seeing without my module.

Why val?

Because it’s two letters shorter than “value.” What? You expected a profound reason? Sorry if this precludes your ability to keep track of your girlfriends.

You have enough here to write your own plug-in, or just to use it when you need it. Happy riding.

Posted in ,  | Tags , , , , , , , , ,  | no comments

Session of Fear

Posted by Christopher Wojno Fri, 10 Aug 2007 23:48:00 GMT

Now, before you think “FUD Time!”, I preface this post with the following warning: Yes, this is a problem with not just Rails but all sessions; no, it’s not unsolvable. So, in light of the bad news, there is much good news.

The Problem

Rails applications are very useful. However, most require sessions in order to be useful. Sure, you can provide an ActionWebService or REST-based service to not use sessions, but for the majority of user services, you’re stuck with sessions.

A Session About Sessions

(Go ahead and skip this part if you’re already familiar with how sessions work)

A session is a way a server remembers who you are (in case you didn’t know). The most common way of allowing a server to ID you, without knowing who you are exactly, is by sharing a secret! Well, it’s supposed to be secret and we’ll come to that in a bit. When a new session is created, your session ID is generated (it’s random, hopefully) and saved in the server. The id can be associated with other data, such as your name, or permissions. When you request a page, the server will give you this ID. You almost never see that ID because it’s sent in the background as a “cookie.” You’ve probably heard of these before. From now on, every time you request a page from the server, your cookie data is automatically included in the request (your web browser takes care of this) so the server can figure out who you are. Done correctly, no personally identifiable information is sent over the Web. Just your (hopefully) random ID.

Sounds good so far. But if you’re not using HTTPS (SSL/TSL encryption), your secret contained in that cookie can be seen when it is first sent, and when you access any page thereafter (because you’re sending it each time remember). So, if someone has access to your traffic (if you’re using a shared WiFi or a hub or have a shady ISP), they can pull out your secret. Here’s an old session I sniffed using a program called snort:

GET /javascripts/prototype.js?1186461045 HTTP/1.1..
Host: christopher.wojno.com..
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.8.1.6) Gecko/20070810 Firefox/2.0.0.6..
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5..Accept-Language: en-us,en;q=0.5..
Accept-Encoding: gzip,deflate..
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7..
Keep-Alive: 300..
Connection: keep-alive..
Cookie: _session_id=db392fa5b39125fa7c1e581b9c1ec71d; is_admin=yes....

There it is, the last line. “Cookie.” So when my browser was trying to download some javascript libraries (Prototype is quite good), it sent my session ID. Now, the javascript doesn’t care about my session ID, but it was sent any way. If someone were to see that sent over the Internet (at any point on it’s way to the server), he or she could create a fake cookie with that ID and login as me without my password! (don’t try it, I’ve already logged out and reset the session) They could then, go into the settings and lock me out of my own system (until I change the password in the ruby console… that’ll show ‘em?). In the meantime, the damage has been done and it can happen again.

SSL!

I won’t go into SSL, but that little piece of technology has enabled credit card purchases over the Internet for years and will hopefully continue to do so for years to come. Any way, in short, it’s encryption and it will encrypt cookies too!

So, we need to force rails to encrypt cookies. Not as simple as it sounds. First, you need to setup an SSL certificate on your server. Easier said than done, though, some Apache packages come with the tools to automatically generate one. For a confusing, yet interesting read, see X509 on Wikipedia.

I assume you installed your SSL certificate and have configured Apache1 to run with SSL (or lighty, if you’re using that instead (mongrel is not mentioned as it does not have SSL, though there is a way to use SSL and mongrel together)). If not, there are loads of tutorials. I warn you though, it’s fairly technical.

Hold onto your Cookies!

So, you have SSL and a Rails application running (I’m assuming). How do you tell Rails to make sure your cookies are only sent via SSL? Rails lets you specify it.

Tell Rails to use SSL Only

By default, Rails does NOT enforce SSL on cookies or sessions (that would be frustrating for development, wouldn’t it?). So you have to enable that enforcement yourself. If you want your session cookie to be sent over SSL site-wide (and generally, you do!), head into your rails application directory and open (in your favorite editor) app/controllers/application.rb Add the following anywhere in the ApplicationController class:

class ApplicationController < ActionController::Base
  session :session_key => '_session_id', :session_secure => true
end

The ApplicationController class definition should already exist, don’t duplicate it. Also, make sure that session isn’t already specified. If it is, the important part here is ”:session_secure => true”. Rails will now tell the browser to only send the session cookie if the browser is using the https (SSL) protocol. This feature is poorly documented but hopefully this will help keep your applications that you write, safe. NOTE: You MUST use SSL if you enable this or your application will become extremely forgetful (Who are you? What are you doing in my kitchen?!).

If you’re interested in storing OTHER data (not overly recommended though due to this and another exploit) in other cookies, Rails offers cookie—manipulation (not really management, the API leaves something to be desired). You can tell individual cookies to only be sent over encrypted connections, just like the session cookie. The other exploit: if the computer is shared and the browser doesn’t clear out the cookies, the next person to sit at the computer can harvest the cookie information, so don’t store passwords, e-mail addresses… really anything in cookies, it’s just a bad idea. DO store worthless information you don’t want to save in your server, such as squirrel preferences.

Admittedly, generally the odds of someone having access to your network traffic (and your cookies, no! MY cookies!) directly is moderate. For the attacker to successfully pick out your cookie data from the slew of traffic wizzing by, he/she would have to be looking and looking for a specific cookie name. So make sure no one has it out for you.

I don’t mean to sound down on the Rails team. Dang-fine-job I say. Cookies aren’t really important and the session is cookie-based, so session security falls by the way-side. It’s up to all developers to keep his or her eyes open for potential pitfalls.

1_000.thank( Rails::DevTeam.members.collect{|m|m.email} )

1 Note: Apache doesn’t know how to run Ruby code as of this writing. You need to use an Apache’s mod_proxy. This will (after it’s been configured) then pass the requests from Apache, to mongrel, lighty, or… WEBrick (if you’re nuts)

Posted in  | Tags , , , , , , ,  | 2 comments

Easy And/Or in Ruby on Rails

Posted by Christopher Wojno Tue, 07 Aug 2007 05:05:00 GMT

This missing feature of rails has really bugged me, but it’s so useful.

If you have a list of words such as: apples, oranges, and bananas as an array:

>> list = ['apples','oranges','bananas']
=> ["apples", "oranges", "bananas"]

You’d like to be able to have a variable length list and still have it look correct in the view. So a smaller list:

>> list = ['oranges','bananas']
=> ["oranges", "bananas"]

Should look like: “oranges and bananas”.

>> list = ['apples','oranges','bananas']
=> ["apples", "oranges", "bananas"]
>> and_or_list 'and', list
=> "apples, oranges, and bananas" 
>> list.pop
=> "bananas" 
>> and_or_list 'and', list
=> "apples and oranges" 
>> list.pop
=> "oranges" 
>> and_or_list 'and', list
=> "apples" 

The following block of code will do just that:

  def and_or_list( andor, list )
    list = list.dup
    comma = (list.size > 2 ? ',' : '')
    list2 = list.pop if list.size > 1
    s = list.join(', ')
    s << comma+' '+andor+' ' + list2 if list2
    s
  end

Vioa!

Instant and easy listing of various things, in English. Your users will never know it’s generated.

I’ve wrapped it up in a neat little plug-in for you. Just install it in your vendor/plugins directory. It will automatically be available in your views.

AndOrList Module

Posted in  | Tags , , , , , ,  | 2 comments