{"id":691,"date":"2012-03-30T06:56:33","date_gmt":"2012-03-30T10:56:33","guid":{"rendered":"http:\/\/bruji.com\/articles\/?p=691"},"modified":"2012-03-30T06:56:33","modified_gmt":"2012-03-30T10:56:33","slug":"sorting-titles-without-articles-fast","status":"publish","type":"post","link":"https:\/\/bruji.com\/blog\/2012\/03\/30\/sorting-titles-without-articles-fast\/","title":{"rendered":"Sorting Titles Without Articles Fast"},"content":{"rendered":"<p>Mark Dalrymple wrote an interesting article about the <a href=\"http:\/\/weblog.bignerdranch.com\/?p=334\">speed difference between <em>isEqual<\/em>: and <em>isEqualToString<\/em>:<\/a> methods. Choosing between these methods is a small decision programmers make everyday. The article should reaffirm readers beliefs that concentrating on readability is more important than performance.<\/p>\n<p>The short of it is that he runs the methods 10 million times to show results in a speed difference of <strong>3%<\/strong> to <strong>7%<\/strong>. About <strong>.0000000005<\/strong> seconds gained for every call of <em>isEqualToString<\/em> method over <em>isEqual<\/em>. Most of my code would use those methods at most a few tens of times when a user makes a choice in the Pedias and five billionths of a second is certainly not a difference anyone can notice. In fact the difference in results is low enough that they become statistically meaningless. Trying to unravel the implications of compiler optimizations as well as the testing environment is more complex and time consuming than the numbers justify.<\/p>\n<p>However the article did get me thinking about an old function I optimized a few year back. The Pedias include a bit of code for sorting that ignores the article at the beginning of a title. When sorting a large number of entries this code can be run in the vicinity of a million times. This fits my number one rule about optimization: loops are the place where optimization makes the most difference.<\/p>\n<p>The original code used the standard NSString <em>hasPrefix<\/em>: method to do the comparison.<\/p>\n<pre class=\"brush: objc; title: ; notranslate\" title=\"\">\r\nif (&#x5B;aString hasPrefix:@&quot;\\&quot;&quot;])\r\n    aString = &#x5B;aString substringFromIndex:1];\r\n\r\nif (&#x5B;aString hasPrefix:@&quot;The &quot;])\r\n    return &#x5B;aString substringFromIndex:4];\r\nelse if (&#x5B;aString hasPrefix:@&quot;A &quot;])\r\n    return &#x5B;aString substringFromIndex:2];\r\nelse if (&#x5B;aString hasPrefix:@&quot;An &quot;])\r\n    return &#x5B;aString substringFromIndex:3];\r\nreturn aString;\r\n<\/pre>\n<p>On testing collections with 20,000 items the sort felt sluggish. Especially when compared to a sorting that did not take the article into account, hence I knew the article comparison could do with some optimization. I changed the code to do the comparison based on each individual character. (This all started with a much longer version of the code that ignores articles in other languages that take into account a lot more than the possible 3 articles in English.) By looking at the first letter and making chain decisions I thought I would save a lot of time.<\/p>\n<pre class=\"brush: objc; title: ; notranslate\" title=\"\">\r\nNSUInteger length = &#x5B;aString length];\r\nif (length) {\r\n\tunichar aChar = &#x5B;aString characterAtIndex:0];\r\n\r\n\tif (aChar == '\\&quot;' &amp;&amp; length &gt; 1) {\r\n\t\taString = &#x5B;aString substringFromIndex:1];\r\n\t\taChar = &#x5B;aString characterAtIndex:0];\r\n\t\tlength--;\r\n\t}\r\n\r\n\tif (aChar == 'T' &amp;&amp; length &gt; 4 &amp;&amp; &#x5B;aString characterAtIndex:1] == 'h' &amp;&amp; &#x5B;aString characterAtIndex:2] == 'e' &amp;&amp; &#x5B;aString characterAtIndex:3] == ' ')\r\n\t\taString = &#x5B;aString substringFromIndex:4];\r\n\telse if (aChar == 'A') {\r\n\t\tif (length &gt; 3) {\r\n            aChar = &#x5B;aString characterAtIndex:1];\r\n\t\t    if (aChar == ' ')\r\n\t\t\t     aString = &#x5B;aString substringFromIndex:2];\r\n\t\t    else if (aChar == 'n' &amp;&amp; &#x5B;aString characterAtIndex:2] == ' ')\r\n\t\t\t     aString = &#x5B;aString substringFromIndex:3];\r\n\t\t    }\r\n\t    }\r\n    }\r\nreturn aString;\r\n<\/pre>\n<p>With the new code the sorting was blazing fast in testing and it&#8217;s been great for actual users ever since. My interest peaked by Mark Dalrymple article and his <a href=\"https:\/\/gist.github.com\/2201940\">timing code convenient and thoughtfully in a gist<\/a> I ran my two methods in it to finally get some raw numbers.<\/p>\n<p>Title &#8220;King Kong&#8221;:<br \/>\nTime for characterAtIndex: <strong>0.300036<\/strong><br \/>\nTime for hasPrefix: <strong>3.165767<\/strong><\/p>\n<p>Title &#8220;The Lord of the Rings&#8221;:<br \/>\nTime for characterAtIndex: <strong>3.218064<\/strong><br \/>\nTime for hasPrefix: <strong>4.198877<\/strong><\/p>\n<p>Right off you can see the new code (which is really 5 years old) is 10x faster than the old version when no article is present. For the string with &#8220;<em>The<\/em>&#8221; the code is much slower in both cases. But this can quickly be attributed to the actual &#8220;<em>substringFromIndex:<\/em>&#8221; method that both versions need to return the string to be sorted. Removing this operation, since we are only interested in the comparison timing, gives new numbers.<\/p>\n<p>Title &#8220;The Lord of the Rings&#8221; with an assignment call instead of substringFromIndex.<br \/>\nTime for characterAtIndex: <strong>0.665993<\/strong><br \/>\nTime for hasPrefix: <strong>1.568902<\/strong><\/p>\n<p>As you can see the difference is still 0.98 seconds between them but the time it takes to do 10 million substringFromIndex: calls has been stripped, better reflecting that in the case of &#8220;<em>The<\/em>&#8221; article being present the new call is 2x as fast. A title with &#8220;<em>An<\/em>&#8221; of course is even faster as the original method does not test for just the presence of &#8220;<em>A<\/em>&#8221; first but the entire &#8220;<em>A<\/em> &#8221; and &#8220;<em>An<\/em> &#8221; prefix.<\/p>\n<p>So sometimes it does make sense to write your own versions of the built-in functions and lengthen your code when you can use the internal information in the algorithm to achieve more specific results. However, keep in mind it&#8217;s not really worth the time unless making millions of calls to that function.<\/p>\n<p>As a side note, it&#8217;s faster in 32 bit than in their 64 bit counterparts. A meaningless statistic since the code is so basic that there is no advantage or disadvantage in the code itself that would affect the timing depending on the architecture.<\/p>\n<p>For programmers out there looking to ignore articles when sorting, the above code will do the trick nicely when wrapped in the below category method of NSString and used from a NSSortDescriptor:<\/p>\n<pre class=\"brush: objc; title: ; notranslate\" title=\"\">\r\n@implementation NSString (NSStringExtensions)\r\n\r\n- (NSComparisonResult)ignoringArticleCompare:(NSString *)anotherString {\r\n\tNSString *string1 = removeArticleFromString(self);\r\n\tNSString *string2 = removeArticleFromString(anotherString);\r\n\r\n\treturn &#x5B;string1 compare:string2 options:NSCaseInsensitiveSearch | NSNumericSearch range:NSMakeRange(0,&#x5B;string1 length]) locale:&#x5B;NSLocale currentLocale]];\r\n}\r\n\r\n@end\r\n<\/pre>\n<p>Here is the code for also ignoring articles in English, French, German, Italian and Spanish:<\/p>\n<pre class=\"brush: objc; collapse: true; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\nstatic inline NSString* removeForeignArticleFromString(NSString *aString) {\r\n\r\n\t\/\/ A faster version than the original below\r\n\tNSUInteger length = &#x5B;aString length];\r\n\tif (length) {\r\n\t\tunichar aChar = &#x5B;aString characterAtIndex:0];\r\n\r\n\t\t\/\/ &quot;, The, A, An, El, Ein, Eine, Einen, Einem, Einer, Eines, Un, Una, Une, Des, Der, Dem, Den, Die, Das, L', Lo, La, Le, Les, Los, Las, Il\r\n\t\tif (aChar == '\\&quot;' &amp;&amp; length &gt; 1) {\r\n\t\t\taString = &#x5B;aString substringFromIndex:1];\r\n\t\t\taChar = &#x5B;aString characterAtIndex:0];\r\n\t\t\tlength--;\r\n\t\t}\r\n\r\n\t\tif (aChar == 'T' &amp;&amp; length &gt; 4 &amp;&amp; &#x5B;aString characterAtIndex:1] == 'h' &amp;&amp; &#x5B;aString characterAtIndex:2] == 'e' &amp;&amp; &#x5B;aString characterAtIndex:3] == ' ')\r\n\t\t\treturn &#x5B;aString substringFromIndex:4];\r\n\t\telse if (aChar == 'A') {\r\n\t\t\tif (length &gt; 3) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:1] == ' ')\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:2];\r\n\t\t\t\telse if (&#x5B;aString characterAtIndex:1] == 'n' &amp;&amp; &#x5B;aString characterAtIndex:2] == ' ')\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:3];\r\n\t\t\t}\r\n\t\t}\r\n\t\telse if (aChar == 'E') {\r\n\t\t\tif (length &gt; 3) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:1] == 'l' &amp;&amp; &#x5B;aString characterAtIndex:2] == ' ') \/\/ el\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:3];\r\n\t\t\t}\r\n\t\t\tif (length &gt; 4) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:1] == 'i' &amp;&amp; &#x5B;aString characterAtIndex:2] == 'n' &amp;&amp; &#x5B;aString characterAtIndex:3] == ' ') \/\/ ein\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:4];\r\n\t\t\t}\r\n\t\t\tif (length &gt; 5) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:1] == 'i' &amp;&amp; &#x5B;aString characterAtIndex:2] == 'n' &amp;&amp; &#x5B;aString characterAtIndex:3] == 'e' &amp;&amp; &#x5B;aString characterAtIndex:4] == ' ') \/\/ eine\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:5];\r\n\t\t\t}\r\n\t\t\tif (length &gt; 6) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:1] == 'i' &amp;&amp; &#x5B;aString characterAtIndex:2] == 'n' &amp;&amp; &#x5B;aString characterAtIndex:3] == 'e' &amp;&amp; &#x5B;aString characterAtIndex:5] == ' ') {\r\n\t\t\t\t\tif ((&#x5B;aString characterAtIndex:4] == 'n') \/\/ einen\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:4] == 'm') \/\/ einem\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:4] == 'r') \/\/ einer\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:4] == 's') \/\/ eines\r\n\t\t\t\t\t\t)\r\n\t\t\t\t\t\treturn &#x5B;aString substringFromIndex:6];\r\n\t\t\t\t}\r\n\t\t\t}\r\n\t\t}\r\n\t\telse if (aChar == 'U') {\r\n\t\t\tif (length &gt; 3) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:1] == 'n' &amp;&amp; &#x5B;aString characterAtIndex:2] == ' ') \/\/ Un\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:3];\r\n\t\t\t}\r\n\t\t\tif (length &gt; 4) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:1] == 'n' &amp;&amp; &#x5B;aString characterAtIndex:3] == ' ') {\r\n\t\t\t\t\tif ((&#x5B;aString characterAtIndex:2] == 'a') \/\/Una\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:2] == 'e') \/\/ Une\r\n\t\t\t\t\t\t)\r\n\t\t\t\t\t\treturn &#x5B;aString substringFromIndex:4];\r\n\t\t\t\t}\r\n\t\t\t}\r\n\t\t}\r\n\t\telse if (aChar == 'D') {\r\n\t\t\tif (length &gt; 4) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:1] == 'e' &amp;&amp; &#x5B;aString characterAtIndex:3] == ' ') {\r\n\t\t\t\t\tif ((&#x5B;aString characterAtIndex:2] == 's') \/\/ Des\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:2] == 'r') \/\/ Der\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:2] == 'm') \/\/ Dem\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:2] == 'n') \/\/ Den\r\n\t\t\t\t\t\t)\r\n\t\t\t\t\t\treturn &#x5B;aString substringFromIndex:4];\r\n\t\t\t\t}\r\n\t\t\t\telse if ((&#x5B;aString characterAtIndex:1] == 'i' &amp;&amp; &#x5B;aString characterAtIndex:2] == 'e' &amp;&amp; &#x5B;aString characterAtIndex:3] == ' ') \/\/ Die\r\n\t\t\t\t\t\t || (&#x5B;aString characterAtIndex:1] == 'a' &amp;&amp; &#x5B;aString characterAtIndex:2] == 's' &amp;&amp; &#x5B;aString characterAtIndex:3] == ' ') \/\/ Das\r\n\t\t\t\t\t\t )\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:4];\r\n\t\t\t}\r\n\t\t}\r\n\t\telse if (aChar == 'L') {\r\n\r\n\t\t\tif (length &gt; 3) {\r\n\t\t\t\tif ( &#x5B;aString characterAtIndex:2] == ' ' &amp;&amp;\r\n\t\t\t\t\t(&#x5B;aString characterAtIndex:1] == '\\'' \/\/ L'\r\n\t\t\t\t\t || &#x5B;aString characterAtIndex:1] == 'o' \/\/ Lo\r\n\t\t\t\t\t || &#x5B;aString characterAtIndex:1] == 'a' \/\/ La\r\n\t\t\t\t\t || &#x5B;aString characterAtIndex:1] == 'e')) \/\/ Le\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:3];\r\n\t\t\t\telse if (&#x5B;aString characterAtIndex:1] == '\\'') \/\/ L' Without space suffix\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:2];\r\n\t\t\t}\r\n\t\t\tif (length &gt; 4) {\r\n\t\t\t\tif (&#x5B;aString characterAtIndex:2] == 's' &amp;&amp; &#x5B;aString characterAtIndex:3] == ' ') {\r\n\t\t\t\t\tif ((&#x5B;aString characterAtIndex:1] == 'e') \/\/ Les\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:1] == 'o') \/\/ Los\r\n\t\t\t\t\t\t|| (&#x5B;aString characterAtIndex:1] == 'a' ) \/\/ Las\r\n\t\t\t\t\t\t)\r\n\t\t\t\t\t\treturn &#x5B;aString substringFromIndex:4];\r\n\t\t\t\t}\r\n\t\t\t}\r\n\t\t}\r\n\t\telse if (aChar == 'I') {\r\n\t\t\tif (length &gt; 3) {\r\n\t\t\t\tif ((&#x5B;aString characterAtIndex:1] == 'l' &amp;&amp; &#x5B;aString characterAtIndex:2] == ' ') \/\/ il\r\n\t\t\t\t\t)\r\n\t\t\t\t\treturn &#x5B;aString substringFromIndex:3];\r\n\t\t\t}\r\n\t\t}\r\n\t}\r\n\treturn aString;\r\n}\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Mark Dalrymple wrote an interesting article about the speed difference between isEqual: and isEqualToString: methods. Choosing between these methods is a small decision programmers make everyday. The article should reaffirm readers beliefs that concentrating on readability is more important than performance. The short of it is that he runs the methods 10 million times to [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[92,91],"class_list":["post-691","post","type-post","status-publish","format-standard","hentry","category-cocoa","tag-objective-c-speed","tag-sorting-articles"],"_links":{"self":[{"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/posts\/691","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/comments?post=691"}],"version-history":[{"count":27,"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/posts\/691\/revisions"}],"predecessor-version":[{"id":762,"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/posts\/691\/revisions\/762"}],"wp:attachment":[{"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/media?parent=691"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/categories?post=691"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bruji.com\/blog\/wp-json\/wp\/v2\/tags?post=691"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}