Sorting Titles Without Articles Fast

Mark Dalrymple wrote an interesting article about the speed difference between isEqual: and isEqualToString: methods. Choosing between these methods is a small decision programmers make everyday. The article should reaffirm readers beliefs that concentrating on readability is more important than performance.

The short of it is that he runs the methods 10 million times to show results in a speed difference of 3% to 7%. About .0000000005 seconds gained for every call of isEqualToString method over isEqual. Most of my code would use those methods at most a few tens of times when a user makes a choice in the Pedias and five billionths of a second is certainly not a difference anyone can notice. In fact the difference in results is low enough that they become statistically meaningless. Trying to unravel the implications of compiler optimizations as well as the testing environment is more complex and time consuming than the numbers justify.

However the article did get me thinking about an old function I optimized a few year back. The Pedias include a bit of code for sorting that ignores the article at the beginning of a title. When sorting a large number of entries this code can be run in the vicinity of a million times. This fits my number one rule about optimization: loops are the place where optimization makes the most difference.

The original code used the standard NSString hasPrefix: method to do the comparison.

if ([aString hasPrefix:@"\""])
    aString = [aString substringFromIndex:1];

if ([aString hasPrefix:@"The "])
    return [aString substringFromIndex:4];
else if ([aString hasPrefix:@"A "])
    return [aString substringFromIndex:2];
else if ([aString hasPrefix:@"An "])
    return [aString substringFromIndex:3];
return aString;

On testing collections with 20,000 items the sort felt sluggish. Especially when compared to a sorting that did not take the article into account, hence I knew the article comparison could do with some optimization. I changed the code to do the comparison based on each individual character. (This all started with a much longer version of the code that ignores articles in other languages that take into account a lot more than the possible 3 articles in English.) By looking at the first letter and making chain decisions I thought I would save a lot of time.

NSUInteger length = [aString length];
if (length) {
	unichar aChar = [aString characterAtIndex:0];

	if (aChar == '\"' && length > 1) {
		aString = [aString substringFromIndex:1];
		aChar = [aString characterAtIndex:0];
		length--;
	}

	if (aChar == 'T' && length > 4 && [aString characterAtIndex:1] == 'h' && [aString characterAtIndex:2] == 'e' && [aString characterAtIndex:3] == ' ')
		aString = [aString substringFromIndex:4];
	else if (aChar == 'A') {
		if (length > 3) {
            aChar = [aString characterAtIndex:1];
		    if (aChar == ' ')
			     aString = [aString substringFromIndex:2];
		    else if (aChar == 'n' && [aString characterAtIndex:2] == ' ')
			     aString = [aString substringFromIndex:3];
		    }
	    }
    }
return aString;

With the new code the sorting was blazing fast in testing and it’s been great for actual users ever since. My interest peaked by Mark Dalrymple article and his timing code convenient and thoughtfully in a gist I ran my two methods in it to finally get some raw numbers.

Title “King Kong”:
Time for characterAtIndex: 0.300036
Time for hasPrefix: 3.165767

Title “The Lord of the Rings”:
Time for characterAtIndex: 3.218064
Time for hasPrefix: 4.198877

Right off you can see the new code (which is really 5 years old) is 10x faster than the old version when no article is present. For the string with “The” the code is much slower in both cases. But this can quickly be attributed to the actual “substringFromIndex:” method that both versions need to return the string to be sorted. Removing this operation, since we are only interested in the comparison timing, gives new numbers.

Title “The Lord of the Rings” with an assignment call instead of substringFromIndex.
Time for characterAtIndex: 0.665993
Time for hasPrefix: 1.568902

As you can see the difference is still 0.98 seconds between them but the time it takes to do 10 million substringFromIndex: calls has been stripped, better reflecting that in the case of “The” article being present the new call is 2x as fast. A title with “An” of course is even faster as the original method does not test for just the presence of “A” first but the entire “A ” and “An ” prefix.

So sometimes it does make sense to write your own versions of the built-in functions and lengthen your code when you can use the internal information in the algorithm to achieve more specific results. However, keep in mind it’s not really worth the time unless making millions of calls to that function.

As a side note, it’s faster in 32 bit than in their 64 bit counterparts. A meaningless statistic since the code is so basic that there is no advantage or disadvantage in the code itself that would affect the timing depending on the architecture.

For programmers out there looking to ignore articles when sorting, the above code will do the trick nicely when wrapped in the below category method of NSString and used from a NSSortDescriptor:

@implementation NSString (NSStringExtensions)

- (NSComparisonResult)ignoringArticleCompare:(NSString *)anotherString {
	NSString *string1 = removeArticleFromString(self);
	NSString *string2 = removeArticleFromString(anotherString);

	return [string1 compare:string2 options:NSCaseInsensitiveSearch | NSNumericSearch range:NSMakeRange(0,[string1 length]) locale:[NSLocale currentLocale]];
}

@end

Here is the code for also ignoring articles in English, French, German, Italian and Spanish:

static inline NSString* removeForeignArticleFromString(NSString *aString) {

	// A faster version than the original below
	NSUInteger length = [aString length];
	if (length) {
		unichar aChar = [aString characterAtIndex:0];

		// ", The, A, An, El, Ein, Eine, Einen, Einem, Einer, Eines, Un, Una, Une, Des, Der, Dem, Den, Die, Das, L', Lo, La, Le, Les, Los, Las, Il
		if (aChar == '\"' && length > 1) {
			aString = [aString substringFromIndex:1];
			aChar = [aString characterAtIndex:0];
			length--;
		}

		if (aChar == 'T' && length > 4 && [aString characterAtIndex:1] == 'h' && [aString characterAtIndex:2] == 'e' && [aString characterAtIndex:3] == ' ')
			return [aString substringFromIndex:4];
		else if (aChar == 'A') {
			if (length > 3) {
				if ([aString characterAtIndex:1] == ' ')
					return [aString substringFromIndex:2];
				else if ([aString characterAtIndex:1] == 'n' && [aString characterAtIndex:2] == ' ')
					return [aString substringFromIndex:3];
			}
		}
		else if (aChar == 'E') {
			if (length > 3) {
				if ([aString characterAtIndex:1] == 'l' && [aString characterAtIndex:2] == ' ') // el
					return [aString substringFromIndex:3];
			}
			if (length > 4) {
				if ([aString characterAtIndex:1] == 'i' && [aString characterAtIndex:2] == 'n' && [aString characterAtIndex:3] == ' ') // ein
					return [aString substringFromIndex:4];
			}
			if (length > 5) {
				if ([aString characterAtIndex:1] == 'i' && [aString characterAtIndex:2] == 'n' && [aString characterAtIndex:3] == 'e' && [aString characterAtIndex:4] == ' ') // eine
					return [aString substringFromIndex:5];
			}
			if (length > 6) {
				if ([aString characterAtIndex:1] == 'i' && [aString characterAtIndex:2] == 'n' && [aString characterAtIndex:3] == 'e' && [aString characterAtIndex:5] == ' ') {
					if (([aString characterAtIndex:4] == 'n') // einen
						|| ([aString characterAtIndex:4] == 'm') // einem
						|| ([aString characterAtIndex:4] == 'r') // einer
						|| ([aString characterAtIndex:4] == 's') // eines
						)
						return [aString substringFromIndex:6];
				}
			}
		}
		else if (aChar == 'U') {
			if (length > 3) {
				if ([aString characterAtIndex:1] == 'n' && [aString characterAtIndex:2] == ' ') // Un
					return [aString substringFromIndex:3];
			}
			if (length > 4) {
				if ([aString characterAtIndex:1] == 'n' && [aString characterAtIndex:3] == ' ') {
					if (([aString characterAtIndex:2] == 'a') //Una
						|| ([aString characterAtIndex:2] == 'e') // Une
						)
						return [aString substringFromIndex:4];
				}
			}
		}
		else if (aChar == 'D') {
			if (length > 4) {
				if ([aString characterAtIndex:1] == 'e' && [aString characterAtIndex:3] == ' ') {
					if (([aString characterAtIndex:2] == 's') // Des
						|| ([aString characterAtIndex:2] == 'r') // Der
						|| ([aString characterAtIndex:2] == 'm') // Dem
						|| ([aString characterAtIndex:2] == 'n') // Den
						)
						return [aString substringFromIndex:4];
				}
				else if (([aString characterAtIndex:1] == 'i' && [aString characterAtIndex:2] == 'e' && [aString characterAtIndex:3] == ' ') // Die
						 || ([aString characterAtIndex:1] == 'a' && [aString characterAtIndex:2] == 's' && [aString characterAtIndex:3] == ' ') // Das
						 )
					return [aString substringFromIndex:4];
			}
		}
		else if (aChar == 'L') {

			if (length > 3) {
				if ( [aString characterAtIndex:2] == ' ' &&
					([aString characterAtIndex:1] == '\'' // L'
					 || [aString characterAtIndex:1] == 'o' // Lo
					 || [aString characterAtIndex:1] == 'a' // La
					 || [aString characterAtIndex:1] == 'e')) // Le
					return [aString substringFromIndex:3];
				else if ([aString characterAtIndex:1] == '\'') // L' Without space suffix
					return [aString substringFromIndex:2];
			}
			if (length > 4) {
				if ([aString characterAtIndex:2] == 's' && [aString characterAtIndex:3] == ' ') {
					if (([aString characterAtIndex:1] == 'e') // Les
						|| ([aString characterAtIndex:1] == 'o') // Los
						|| ([aString characterAtIndex:1] == 'a' ) // Las
						)
						return [aString substringFromIndex:4];
				}
			}
		}
		else if (aChar == 'I') {
			if (length > 3) {
				if (([aString characterAtIndex:1] == 'l' && [aString characterAtIndex:2] == ' ') // il
					)
					return [aString substringFromIndex:3];
			}
		}
	}
	return aString;
}

Tags: ,

3 Responses to “Sorting Titles Without Articles Fast”

  1. Mark Dalrymple Says:

    Thanks for the mention! I have to give credit to Advanced iOS instructor Jonathan Blocksom for putting the timing code into gist. He was the one that suggested it getting me up to speed on the whole thing.

  2. Mike Abdullah Says:

    You could probably eke out a fair bit more performance by adjusting the range passed into -compare:options:range:locale:, rather than deriving so many new substrings a lot of the time.

  3. Conor Says:

    Hi Mike,

    Thank you of the tip. I had not thought of that. I tested the simple case of returning the range for the receiver string and using the range attribute, but it actually runs slightly slower than the previous version. Code used:

    NSInteger rangeOfArticle = rangeOfArticleFromString(self);
    return [self compare:string2 options:NSCaseInsensitiveSearch | NSNumericSearch range:NSMakeRange(rangeOfArticle,[self length] – rangeOfArticle) locale:[NSLocale currentLocale]];

    However, I would have to test it by doing both and not just the receiver, but it sound complicated to flip the comparison order of the strings and then flip the NSOrder value returned to keep the same ordering. But I shall give it a try.